§ 01 / Now / Case Study / Building · Live

LLMNarrative.

A platform that evaluates how language models reference your brand, product, and industry context across multiple LLMs — surfacing strengths and risks in generative model outputs, weekly.

Status
Live · day 47
Role
Architect & sole engineer
Stack
Python · Next · Postgres
Started
Q1 · 2026
§ Problem

Brands are invisible to the systems answering their customers.

When a buyer asks ChatGPT "what's the best CRM for a 50-person SaaS," the answer shapes the next purchase. Today, brands have no visibility into how often they show up, in what context, or against which competitors — across GPT, Claude, Gemini, Llama, and Mistral.

SEO solved this for search. No one has solved it for generative AI. That's the gap.

§ System

Five-stage pipeline, run on a weekly cadence.

Each tenant configures a brand profile and target query set. The pipeline fans queries across LLM providers, normalizes outputs, classifies mention sentiment, scores share-of-voice against competitors, and writes deltas to a tenant warehouse for dashboards and weekly digests.

fig.A · pipeline architecture
01 · BRAND PROFILE tenant config query bank competitor set 02 · FAN-OUT 5 LLM providers async batch · retries cost cap per tenant 03 · NORMALIZE JSON schema entity extraction citation parsing 04 · SCORE sentiment classifier share-of-voice accuracy verifier 05 · SURFACE dashboards weekly digest webhook · API DATA PLANE Postgres · run history · audit log · per-tenant warehouse · 90-day retention
5
LLM providers
~40k
prompts / tenant / week
7d
eval cadence
§ Decisions

Three calls that shaped the build.

[ 01 ]
Decision

Cache LLM outputs aggressively.

Considered

Live querying every week for freshness vs. content-hash caching with weekly invalidation.

Picked · why

Caching. Cuts cost ~70%; weekly cadence makes invalidation deterministic and auditable.

[ 02 ]
Decision

Score with a smaller, fine-tuned classifier.

Considered

GPT-4o-as-judge vs. a fine-tuned distilled model on labeled mentions.

Picked · why

Distilled classifier. 8× cheaper, deterministic, and removes the "judge sees self" bias.

[ 03 ]
Decision

Per-tenant warehouse, not shared OLAP.

Considered

Single ClickHouse cluster with row-level tenant ID vs. per-tenant Postgres schemas.

Picked · why

Per-tenant. Simpler isolation contract; enterprise buyers ask "where is my data" and the answer is one schema.

§ Outcome

Shipping into design partners now.

Three brands are running weekly evaluations. The first surprise: brands with strong SEO are sometimes weakest in LLM mention rate — the engines aren't crawling their press, they're inferring from documentation and Reddit. That mismatch is the wedge.

Next: open the platform, ship the API, and add longitudinal sentiment trends.

Want a deeper walkthrough — or to build something like this?

→ Get in touch