Methodology

How the Citelligence
Index works.

Published methodology. Auditable math. No black boxes. Here's exactly how we measure AI visibility, what the numbers mean, and where the limitations are.

Last updated: April 2026 | Version 1.0 | Questions? founder@citelligence.app
Contents
01 — THE SWEEP

How data collection works.

Every tracked prompt runs against every active platform, every day. The sweep is the atomic unit of measurement. Nothing is inferred — it's all live queries, live responses, live capture.

Schedule
Daily at 7am CT
Every tracked prompt runs against every active platform. Results are available in the dashboard by 8am.
Full prompt × platform matrix
100–1,000+ prompts per brand (tier-dependent) × 6 platforms = 600–6,000+ individual API calls per daily sweep. Most competing tools cap at 10–40 queries. We start at 100 because a 40-prompt sample misses the long-tail where competitive shifts happen first.
Prompt count is configurable
The default range is 40–100 per brand. Subscribers can add custom prompts beyond the seeded set.
Platforms Tracked
ChatGPT
gpt-4o-search-preview
web search on
Gemini
gemini-2.5-flash
grounded search
Perplexity
sonar-pro
live web
Google AI Overviews
via Serper API
search-grounded
Claude
claude-3-5-sonnet
web access varies
DeepSeek
deepseek-chat
web access varies
What we capture per response
Presence
Brand mention (yes/no)
Position — order of first appearance
Mention count in response
Context
Sentiment classification
Competitor mentions + positions
Raw response text (stored)
Sources
Citation URLs (when provided)
Domain of cited sources
Platform model version
The Prompt Portfolio

Prompts aren't static. Every tracked prompt belongs to one of three groups, each with a different lifecycle:

Core

Your crown queries. Tracked permanently, defended always. When a competitor threatens a Core position, the engine flags it immediately. These never rotate out.

Conquest

Offensive targets. Queries you don't win yet but want to. Track them, get ranked actions to win them, execute. Once held at top 3 for 4+ weeks, the engine recommends promoting to Core and loading a new Conquest target.

Discovery

Engine-surfaced opportunities. Queries you didn't know to track. Found by analyzing competitor citations, trending AI response patterns, and adjacent topics. The engine recommends promoting the best ones to Conquest.

This rotation cycle — discover, target, win, defend, rotate — keeps the platform fresh and prevents the "I've won everything, now what?" plateau that kills SaaS retention. Prompt capacity varies by tier: Lite up to 100, Pro up to 500, Hero 1,000+.

Prompt Intent Types

Within each group, prompts span four intent types to ensure coverage across the full buyer journey:

COMMERCIAL "best [product category] for [use case]" — captures recommendation intent, highest commercial value
COMPARISON "[Brand A] vs [Brand B]" — captures competitive displacement and brand-level positioning
INFORMATIONAL "how to find [X]" or "what makes [Y] good" — captures topical authority signals
BRAND-SPECIFIC "tell me about [Brand]" — captures entity recognition and direct brand awareness
02 — THE CITELLIGENCE INDEX

A 0–100 composite score.

The Citelligence Index is a single number that measures overall AI visibility strength. It's not a ranking — it's a composite across six distinct signals, each weighted by how much it actually affects whether AI platforms recommend your brand.

The formula: Index = Σ (component_score × component_weight). Six components, six weights, one number.
Topical Authority 30%

How many of your tracked prompts you win — defined as top 3 position on at least 2 platforms. This is the highest-weight component because content coverage is the strongest observed predictor of AI visibility. A brand that wins 60% of prompts across 6 platforms is in a structurally different position than one that dominates one query type and disappears on others. This measures breadth of territory, not depth on a single query.

Entity Strength 25%

How well AI platforms recognize your brand as a distinct, authoritative entity. Signals: Schema.org Organization markup with sameAs links, knowledge panel presence, consistent naming across platforms, and domain authority of pages that get cited. Brands with strong entity signals get cited more reliably — even when the prompt doesn't mention them by name. Entity recognition compounds over time as signals accumulate across the web.

Citation Density 20%

What fraction of AI responses mention your brand, weighted by position. Being mentioned third in 80% of responses is very different from being mentioned first in 30%. Weighted at 20% because this component is directly measurable — every sweep produces a precise number — but its causal weight is moderate rather than dominant. High Citation Density without Topical Authority is a fragile position.

Structured Data 10%

A scored checklist of machine-readable signals deployed on your site: Organization schema, sameAs links (LinkedIn, Wikipedia, Crunchbase), Product schema, FAQ schema, Article author schema, BreadcrumbList, AggregateRating. Each signal is weighted by its mechanistic importance for AI citation. Score = deployed points / total possible points × 100. Weighted at 10% because the mechanistic logic is strong — AI models are trained to parse structured data — but the empirical correlation is newer and less established than the top three components.

Surface Coverage 10%

What fraction of tracked AI platforms actively cite your brand. Cited on 4 of 6 platforms = 67% surface coverage. Cited on only ChatGPT = 17%. Weighted lower because platform reach is partially outside your control — some categories just don't get cited on certain platforms. But it matters as a diagnostic: if you're visible on ChatGPT but invisible on Perplexity, that's actionable.

Sentiment Quality 5%

The average sentiment of AI responses where your brand is mentioned. Classified by keyword heuristics: strong positive ("best", "top choice", "recommend"), moderate positive ("quality", "reliable"), neutral, moderate negative ("issues", "complaints"), strong negative ("avoid", "problems"). Score normalized to 0–100. Weighted lowest because sentiment is an outcome, not an input — improving it requires fixing the underlying content or entity signals, not sentiment directly. It's a signal, not a lever.

On the weights themselves

These weights represent our best judgment based on observed correlations in daily sweep data and mechanistic reasoning about how AI models process information. No peer-reviewed research exists on optimal AI visibility component weighting — the entire field is pre-empirical. We publish our reasoning so you can evaluate it yourself. Weights are reviewed quarterly as data accumulates.

Score bands
Score Range Band Label What it means
0 – 20 INVISIBLE AI platforms rarely or never mention this brand. Structural visibility work is needed before optimization is meaningful.
20 – 40 EMERGING Sporadic mentions across a narrow set of prompts or platforms. Entity and content gaps are the primary levers.
40 – 60 COMPETITIVE Consistent mentions in core prompts. Competing with established brands but not yet capturing disproportionate share.
60 – 80 DOMINANT Strong citation density, multi-platform presence, and positive sentiment. AI regularly recommends this brand by name.
80 – 100 CATEGORY LEADER AI treats this brand as the default recommendation in its category. Defending this position is the priority.
03 — POSITION-WEIGHTED SCORING

Why rank matters more
than mention.

AI responses are narrative, not lists. The brand mentioned first gets the lead paragraph — framed as the recommendation. The brand mentioned third gets "other options include..." That's a structurally different outcome, not a marginal one.

"The best option here is Brand X, which offers..." is categorically different from "You might also consider Brand Y." One owns the answer. The other is an afterthought.
Position Weight Typical framing in AI responses
#1 named first
1.0
Gets the lead paragraph. Framed as the recommendation. "The best option is X, which..." — owns the answer.
#2
0.85
Named second, often with "also consider" or "another strong option is" framing. Visible, not dominant.
#3
0.70
Mentioned, but with diminishing editorial weight. Usually the last brand given any substantive description.
#4–5
0.50
Listed in a "you might also look at" section. Rarely described. Often just a name with no supporting detail.
#6–10
0.30
Buried in the response. Minimal impact. Users who read this far are already in comparison mode, not recommendation mode.
Mentioned,
no clear rank
0.20
Referenced in passing or only in citations. Brand name appears but isn't positioned as a recommendation.
Calibration note

This weighting was calibrated against observed patterns in how users engage with AI responses — specifically, how quickly attention drops after the first recommendation. The weights reflect that the first brand named captures disproportionate editorial framing. These weights can be adjusted per account if a category has unusual response patterns that don't match this model.

04 — HANDLING NOISE

AI responses are not deterministic.
We account for that.

The same prompt can return different results across sweeps. Platforms rate-limit. Models update. These aren't problems we hide — they're things we surface explicitly so the data stays trustworthy.

Rate Limits

Gemini's free tier rate-limits aggressively during peak hours. When prompts are skipped because of quota exhaustion, the dashboard surfaces this honestly: "43 prompts skipped — Gemini quota exhausted." Scores are computed only from completed responses, not extrapolated from partial data.

We don't fill in missing data with estimates. A skipped prompt is a gap, logged as a gap.

Response Variability

AI responses change run-to-run. The same prompt can return different brand rankings on different days. We mitigate this three ways:

(a) Daily sweeps that build a rolling average — single-day anomalies smooth out over time.
(b) Flagging anomalous single-day movements: "Flagged: larger than usual variance — validating next sweep."
(c) Week-over-week deltas that smooth daily noise. The trend line is more reliable than any single day's data point.
Personalization

AI platforms personalize responses based on user context, history, and location. Our sweeps use neutral, non-logged-in API endpoints to minimize personalization effects. This means our measurements reflect the "default" response — what a new, anonymous user would see — not what a specific user would get based on their history. It's a baseline, not a universal truth.

Model Updates

AI platforms update their models regularly. When a platform updates — say, ChatGPT ships a new model version — the sweep automatically captures the new behavior. The insight engine flags model-change-driven movements explicitly: "ChatGPT updated its model this week — movements may reflect model behavior changes, not content changes on your end." We don't want you optimizing against a ghost.

05 — HONEST LIMITATIONS

What we don't measure.

Any measurement system has edges. These are ours. We'd rather tell you upfront than have you discover them mid-campaign. Knowing where the model breaks is part of using it correctly.

No click-through data

No AI platform exposes CTR or downstream conversion data through their APIs. We measure position and citation — not whether a user clicked a link in the AI response, visited your site, or converted. What you can reasonably infer: higher position = higher intent capture. What you can't infer: exact revenue impact from a position change.

No user intent fulfillment

We know the AI recommended you. We can't tell you whether the user acted on it. Intent fulfillment would require tracking the full user journey from AI response to site behavior — data that lives in your analytics stack, not ours. Connecting the two is possible with GA4/Plausible; we can guide you on setting up AI referral tracking, but we don't capture it directly.

Daily cadence, not real-time

Sweeps run at 7am CT. If a competitor publishes content at 3pm Monday, we catch it in Tuesday morning's sweep — not instantly. Real-time monitoring would require continuous API polling, which is both expensive and would exhaust platform rate limits in hours. Daily cadence is the right tradeoff for the data quality it produces.

Representative coverage, not exhaustive

We track 100–1,000+ prompts per brand depending on tier (Lite: up to 100, Pro: up to 500, Hero: 1,000+). Real users ask thousands of variations. Our prompt library is constructed to cover the highest-value intent clusters in your category — commercial, comparison, informational, and brand-specific — but there are always long-tail queries outside our sample. Coverage expands as you add custom prompts and as the engine discovers new query patterns from competitor citations.

Claude and DeepSeek are newer additions

Claude and DeepSeek have less stable grounding and web-search behaviors compared to ChatGPT or Perplexity. Their responses can be more variable run-to-run, and their platform architectures change faster. We weight them equally in the Surface Coverage component, but the dashboard flags instability when it's detected. Treat their data as directional until the platforms mature.

06 — THE INSIGHT ENGINE

How actions are generated.

Raw sweep data doesn't tell you what to do. The insight engine reads the data and generates plain-English narratives and ranked actions. Here's exactly how it works and what guardrails we run on it.

What it reads
Current sweep data (all responses, all platforms)
Historical snapshots (week-over-week deltas)
Component scores (which of the six is weakest)
Competitor movements (who gained, who lost)
What it generates
5 Zone narratives — one per dashboard section, explaining what the data means in plain English
6 Component micro-narratives — one per Index component, explaining what's driving the score
3–5 Ranked weekly actions — specific moves, ranked by expected Index lift
How actions are ranked

Each action is drawn from a library of 15+ canonical move templates. Current examples: "publish a comparison page for [query X]," "respond on cited Reddit thread," "add sameAs schema links to homepage," "update FAQ schema on [page]," "publish vendor-perspective article on [topic]."

The engine scores each template against the current sweep data — which queries are losing, which components are weakest, which competitors are gaining ground — and ranks by expected Index lift. The highest-leverage moves surface first. Moves that wouldn't affect the current weakest signal are deprioritized automatically.

What it does

Reads data → selects relevant templates from library → scores against current state → ranks by impact → generates narrative. Every insight references actual sweep numbers. The voice guide (internal document) requires all narratives to cite specific metrics.

What it doesn't do

Fabricate causes. If a metric moved and the engine can't identify why, it says "We saw a shift in Citation Density this week — no clear causal pattern identified yet. Watching next sweep." We don't invent explanations to fill the gap.

Validation

Every generated insight passes three checks before reaching the dashboard: (1) a schema check confirming required fields are populated, (2) an action-library reference check confirming the recommended action maps to a valid template, and (3) a numeric-consistency heuristic confirming cited numbers match the sweep data it claims to reference.

07 — OPEN QUESTIONS

What we're still solving.

AI visibility measurement is a new field. We don't have every answer. These are the measurement problems we're actively working on. When the methodology evolves, we'll publish the changes here.

The methodology will evolve. These aren't embarrassments — they're the honest frontier of where the work is.
Platform weighting — should ChatGPT matter more than Perplexity?

ChatGPT has significantly more users than Perplexity. A citation on ChatGPT may have 10x the real-world impact of one on Perplexity. Right now we weight all platforms equally in the Surface Coverage component. We're exploring whether usage-weighted platform scoring would produce more accurate impact estimates — but user counts are hard to verify, change frequently, and vary by category and geography.

Indirect citations — when AI cites a review site that mentions you

Sometimes AI doesn't cite your domain directly — it cites a review article, a Reddit thread, or an industry directory that mentions you. We currently count this as a "mentioned, no clear rank" (0.20 weight), but that may undervalue the indirect presence. We're working on tracking second-order citations more precisely: you appear in a cited source, therefore you have partial citation credit.

Cumulative vs. per-sweep tracking

Ahrefs tracks cumulative lifetime backlink counts because links accrete over time. Should AI citations work the same way — do yesterday's citations still matter today? Or is AI visibility purely about current-state responses? Our current model is per-sweep snapshots with rolling averages. We haven't settled whether cumulative citation history should factor into the Index.

Causality attribution

If you publish a comparison page and your Citation Density improves three weeks later, did the page cause the improvement? Probably. But AI platforms don't expose ranking signals the way Google does, and the lag between content publication and AI model behavior changes is irregular. We flag correlations, but we don't claim to have solved causal attribution. This is a hard problem in AI visibility specifically because models re-train or update on opaque schedules.

Stay updated

When the methodology changes — new components, revised weights, new platforms — we update this page and log the change. If you're making decisions based on this data, you should know what changed and when.

Questions about methodology
08 — RESEARCH

We're not just building a tool.
We're studying how AI citation works.

Every daily sweep captures hundreds of raw AI responses across 6 platforms. Over weeks and months, that becomes a longitudinal dataset of AI recommendation behavior that doesn't exist anywhere else. We use it to test our own assumptions — and to push the frontier of what's actually understood about AI visibility.

The entire field of AI visibility optimization is pre-empirical. No peer-reviewed research exists on what makes AI recommend one brand over another. We're building the dataset to change that.
The Dataset

Every sweep captures the full response from each AI platform for each tracked prompt. Not just "mentioned: yes/no" — the actual words, the citation sources, the competitor positions, the sentiment. Over time, this lets us ask questions nobody has been able to answer before:

Per sweep
  • • 100–1,000+ prompts × 6 platforms = 600–6,000+ individual responses
  • • Full response text (not just mentions)
  • • Citation URLs + source domains
  • • Competitor positions + sentiment per response
  • • Auto-discovered competitors
Over time
  • • Position trajectories per prompt × platform
  • • Before/after measurements when content ships
  • • Competitor movement patterns across sweeps
  • • Model-update detection (behavior changes after retraining)
  • • Citation decay rates (how long does a citation persist?)
Active Research Questions
Can you influence AI above the decision-to-cite layer?

Most AI visibility work focuses on the citation layer — getting AI to cite your page when it searches the web. But there's a deeper layer: the training data layer. When AI "knows" your brand from its training data, it recommends you without needing to search at all. We see this in our data — 27 responses where AI mentioned our test brand by name without citing the website. That's entity recognition from training, not from a search result.

The question we're actively researching: can you systematically influence training-data-level recognition? If your brand appears consistently across authoritative sources (Wikipedia, industry publications, Reddit, LinkedIn thought leadership), does that compound into stronger AI entity recognition over time? Our longitudinal data is beginning to show patterns, but it's too early to publish conclusions.

Does structured data actually move the needle?

Schema.org markup is machine-readable by design. AI models that crawl the web can parse it. But does deploying Organization schema with sameAs links actually increase your citation rate? We weight Structured Data at 10% in the Index — based on mechanistic reasoning, not empirical proof. As we accumulate before/after deployment data across multiple brands, we'll publish the actual correlation (or lack of one).

What's the half-life of an AI citation?

When ChatGPT starts citing your page, how long does that citation persist? Does it decay like a backlink, or does it persist indefinitely until something better displaces it? Our daily sweeps are building the first longitudinal dataset of citation persistence. Early observations suggest citations are less stable than backlinks — model updates can shuffle positions overnight — but we need more data before publishing decay curves.

Do different platforms respond to different signals?

Our sweep data shows platforms behave very differently. Gemini gives explicit ranked positions; Perplexity almost never does. ChatGPT mentions brands by name; Google AIO tends to cite domains without name-dropping. We're building per-platform signal models to understand whether optimizing for ChatGPT is different from optimizing for Perplexity — and whether brands should allocate effort differently by platform.

Our commitment

We will not claim to have solved problems we haven't solved. When we publish findings, we'll show the data, the methodology, the sample size, and the confidence level. If something is a hypothesis, we'll say so. If something is proven, we'll show the proof.

The AI visibility field needs less marketing and more measurement. We intend to be on the measurement side.