whyiq / science

A behavioral simulation engine, calibrated against 200+ peer-reviewed papers.

Q: How is WhyIQ different from ChatGPT or another AI page reviewer?

A foundation-model page reviewer generates output at inference time, from priors no one can audit. WhyIQ generates output at parameter time. 50 simulated visitors traverse your page under five stacked behavioural frameworks (CLT, dual-process, regulatory focus, prospect theory, ELM). Each parameter is anchored to a specific paper from a 200+ paper corpus. The LLM renders the simulation; it does not decide it. Validated at 89.5% extraction accuracy and under 2.1% scan-to-scan variance.

Q: What does “research-calibrated” actually mean?

Every coefficient in the engine has a paper behind it. The working-memory gate of 4 chunks comes from Cowan 2001. The persuasion-knowledge dampener applied to urgency and scarcity comes from a 148-paper meta-analysis (Eisend and Tarrahi 2022). Trust signal hierarchy is calibrated against Kim and Peterson's 2017 meta of 150 studies. The engine's V2.8.x release log lists 33 research-validated code changes, each shipped with the source that justified it.

Q: How accurate is the extraction?

89.5% extraction-correctness floor in production validation, April 2026. Method: every scan in the cohort was reviewed against the rendered page. Headline the engine extracted, vs headline a human reader would name first. CTA detected, vs CTA actually present. Trust signals enumerated, vs trust signals on the page. Independent corroboration: Zheng et al. 2023 found LLM evaluation reaches 80%+ agreement with human reviewers on structured annotation tasks.

Q: How reproducible are the scores?

Under 2.1% scan-to-scan variance on identical input with a fixed engine version. Scores converge that closely because the engine is parametric: archetype parameter vectors and the calibrated frameworks are deterministic. Only the visitor seeds randomise. That matches test-retest expectations for any panel-based methodology. It is a numerical guarantee shallow LLM scanners cannot make, because re-prompting a foundation model yields a different answer each time.

Q: What does pre-traffic CRO mean here?

Pre-traffic CRO is conversion-rate optimisation done before paid traffic begins. The diagnostic happens through behavioural simulation, not statistical inference. WhyIQ runs 50 calibrated visitors against your page and surfaces structural clarity issues, trust-formation gaps, accessibility blockers, and AI Citability weaknesses. Founders and agencies use it to debug a page before spending on ads, before launch, or as part of an audit package. It does not replace A/B testing once traffic exists, and we do not claim it does.

Q: What are the 5 frameworks the engine layers?

Cognitive Load Theory (Sweller 2019, Cowan 2001) governs working-memory and above-fold density. Dual-process cognition (Kahneman 2011) tracks System 1 to System 2 transition. Regulatory focus (Higgins 1997) calibrates promotion vs prevention orientation per archetype. Prospect theory (Tversky and Kahneman 1979) models loss aversion in pricing and risk framing. ELM (Petty and Cacioppo 1986) governs central vs peripheral persuasion routes. Each framework is implemented as numeric coefficients, not prompt language.

Last updated: May 2026

Engine V2.8.x · 33 research-validated changes

WhyIQ is a behavioral simulation engine, not an LLM that “looks at” your landing page. 50 simulated visitors, drawn from a 21-archetype pool, traverse your page under five stacked psychological frameworks: Cognitive Load Theory, dual-process cognition, regulatory focus, prospect theory, and the Elaboration Likelihood Model. Every coefficient is anchored to a peer-reviewed paper, drawn from a 200+ paper corpus across CRO, behavioral science, and ecommerce calibration. Extraction accuracy is 89.5%. Scan-to-scan variance is under 2.1% on a fixed engine version. This page is the audit trail behind every score.

§01 · Calibration

The numbers a one-shot LLM scan cannot produce

If a tool cannot tell you the source of a single coefficient, the output is a vibe. Below is the calibration surface area WhyIQ commits to, with the citation that produced each number.

peer-reviewed papers

Source corpus spans CLT, dual-process, regulatory focus, prospect theory, ELM, plus the V3.0 ecommerce calibration spine. Sweller 2019, Kahneman 2011, Higgins 1997, Tversky and Kahneman 1979, Petty and Cacioppo 1986, Baymard 2024, and more.

200+

extraction accuracy

Production validation, April 2026. Headline, sub, CTA, and trust signals cross-checked against the rendered DOM.

89.5%

scan-to-scan variance

Score stability on repeat runs of the same URL, identical engine version, randomised visitor seeds.

<2.1%

simulated visitors / archetype pool

Each scan samples 50 calibrated profiles. Pool spans device, intent, technical literacy, regulatory focus, and trust baseline.

50 / 21

stacked frameworks

CLT, dual-process, regulatory focus, prospect theory, ELM. Each governs a distinct phase of the visitor's read.

research-validated code changes

Engine V2.8.x release log. Each change ships with the paper that justified it.

§02 · Parameters

Every coefficient has a paper behind it

A typical AI page reviewer generates output at inference time. WhyIQ generates it at parameter time. The LLM only renders the simulation. Trust thresholds, patience curves, cognitive load capacity, frustration compounding, persuasion-knowledge dampeners: each is a numeric coefficient bounded by a published study.

When Cowan (2001) put working memory at 4 chunks, that is the gate the engine uses for above-fold interactive density. When Eisend and Tarrahi (2022) measured persuasion knowledge across 148 papers, that is the dampener applied to urgency, scarcity, and social proof signals. The model is not learning. It is carrying research.

Parameter / sourceExample effect on engine

Working-memory gate

Cowan 2001

Caps above-fold density

Patience threshold

NN/g attention

Sets per-archetype bounce timing

Trust baseline

Edelman 2025

Calibrates initial skepticism

Frustration compounding

Scheirer 2002

Amplifies repeat friction

Persuasion-knowledge dampener

Eisend & Tarrahi 2022

Discounts urgency and scarcity

Distraction penalty

Bottger 2023

Scores attention loss

Time-pressure depth cut

Browne & Jones 2017

Shortens scan under pressure

§03 · Frameworks

Five frameworks, layered. Not name-dropped.

Most AI page tools cite “behavioral psychology” once and move on. The engine layers five frameworks because each governs a different inflection on the visitor's path from glance to click.

Cognitive Load Theory (CLT)

4 ± 1 chunks · working memory

Above the fold, visitors can hold roughly four interactive elements in working memory before clarity collapses.

Applied to:Above-fold interactive density gate, hero scannability.

Cowan 2001; Sweller 2019.

Dual-process cognition

System 1 → System 2

Visitors enter on heuristics. The engine flags pages that lose them before analytical evaluation ever engages.

Applied to:Per-visitor mode tracking, mode-transition triggers.

Kahneman 2011.

Hick-Hyman Law

RT = a + b · log₂(n + 1)

Decision time scales log-linearly with option count. Drives CTA-density and nav-clutter penalties.

Applied to:CTA density, nav clutter, choice overload.

Hick 1952; Hyman 1953.

NN/g attention distribution

57% / 17% / 26%

Above fold, mid-page, and below fold attention shares from eye-tracking studies. Weights confusion-finding priority.

Applied to:Finding severity, prioritisation, fold weighting.

Nielsen Norman Group.

Kim & Peterson trust hierarchy

meta-analysis · 150 studies

Establishes the rank order of trust signals: third-party reviews above brand claims above visual polish.

Applied to:Trust formation scoring, signal weighting.

Kim and Peterson 2017.

Regulatory focus + prospect theory + ELM

promotion / prevention · loss aversion

Three additional spines. Regulatory focus per archetype, prospect theory in pricing framing, ELM in persuasion routing.

Applied to:Archetype calibration, pricing risk framing, persuasion route choice.

Higgins 1997; Tversky and Kahneman 1979; Petty and Cacioppo 1986.

§04 · Archetypes

Archetypes are parameter sets, not personas

Personas are marketing fiction. Archetypes in WhyIQ are vectors: each one is a calibrated combination of patience threshold, trust baseline, working-memory headroom, regulatory focus, technical literacy, and price sensitivity.

A “Skeptical evaluator” is not a vibe. It is a profile: trust 22, patience 5.4 seconds, prevention-focus +0.6, working-memory one chunk below median. When the engine reports that 14 of 50 visitors disengaged at the pricing block, those 14 share a quantifiable parameter signature. The report names it.

14 of 21 archetypes · representative profile

Skeptical evaluator

trust 22, patience 5.4s, prevention +0.6

Researcher

patience 14s, comparison depth 0.9, ELM central

Impulse buyer

trust 58, patience 3.2s, promotion +0.7

Price-sensitive browser

price sensitivity 0.85, prevention +0.4

Technical expert

technical literacy 0.95, jargon tolerance high

Time-pressed decision maker

patience 2.8s, depth cut 0.70

Mobile thumb browser

device mobile, fold weight 0.78

Trust-driven evaluator

trust gain rate 0.9, social proof weight high

Comparison shopper

comparison depth 1.0, prevention +0.5

Accidental visitor

intent clarity 0.2, patience 4.0s

Returning customer

trust baseline 72, recall +0.4

AI-search verifier

evidence threshold high, citation-seeking

Procurement evaluator

compliance focus, depth 0.95

Qualified visitor

intent clarity 0.9, conversion baseline 0.35

§05 · Comparison

WhyIQ vs a one-shot LLM scan

The shape of the output looks similar. The mechanism is not. Eight dimensions where the difference matters.

Dimension

Typical LLM scanner

WhyIQ engine

Input depth

Pasted URL plus a screenshot.

Rendered DOM, screenshot cross-check, accessibility tree via Chrome DevTools Protocol, structured-data extraction.

Input depth

Typical LLM scanner

Pasted URL plus a screenshot.

WhyIQ engine

Rendered DOM, screenshot cross-check, accessibility tree via Chrome DevTools Protocol, structured-data extraction.

Visitor count

One implied user.

50 simulated visitors per scan, sampled from a 21-archetype pool with calibrated parameter vectors.

Visitor count

Typical LLM scanner

One implied user.

WhyIQ engine

50 simulated visitors per scan, sampled from a 21-archetype pool with calibrated parameter vectors.

Calibration source

Foundation-model priors. Unknown, untraceable.

200+ peer-reviewed papers, parameter-by-parameter, citation-for-citation.

Calibration source

Typical LLM scanner

Foundation-model priors. Unknown, untraceable.

WhyIQ engine

200+ peer-reviewed papers, parameter-by-parameter, citation-for-citation.

Frameworks applied

Behavioral psychology as a vibe.

CLT, dual-process, Hick-Hyman, regulatory focus, prospect theory, ELM, NN/g attention, Kim and Peterson trust, Cowan working memory.

Frameworks applied

Typical LLM scanner

Behavioral psychology as a vibe.

WhyIQ engine

CLT, dual-process, Hick-Hyman, regulatory focus, prospect theory, ELM, NN/g attention, Kim and Peterson trust, Cowan working memory.

Output structure

Bulleted suggestions, no anchors.

Element-anchored findings, archetype attribution, severity, projected impact, suggested rewrite.

Output structure

Typical LLM scanner

Bulleted suggestions, no anchors.

WhyIQ engine

Element-anchored findings, archetype attribution, severity, projected impact, suggested rewrite.

Validation

None disclosed.

89.5% extraction accuracy in production validation, April 2026, methodology published on this page.

Validation

Typical LLM scanner

None disclosed.

WhyIQ engine

89.5% extraction accuracy in production validation, April 2026, methodology published on this page.

Variance / reproducibility

Re-prompt yields different answers.

Under 2.1% scan-to-scan variance on identical input, fixed engine version.

Variance / reproducibility

Typical LLM scanner

Re-prompt yields different answers.

WhyIQ engine

Under 2.1% scan-to-scan variance on identical input, fixed engine version.

Cost basis

Per-token inference, no engine work.

Parametric simulation cost. The LLM renders. It does not decide.

Cost basis

Typical LLM scanner

Per-token inference, no engine work.

WhyIQ engine

Parametric simulation cost. The LLM renders. It does not decide.

See it on your own page

§06 · Validation

What “validated” means here

89.5%

extraction accuracy

Production validation · April 2026

<2.1%

scan-to-scan variance

Fixed engine version · randomised seeds

research-validated code changes

Engine V2.8.x release log

Validation is a methodology, not a marketing word. Every scan in the cohort was reviewed against the rendered page. The headline the engine extracted, vs the headline a human reader would name first. The CTA detected, vs the CTA actually present. Trust signals enumerated, vs trust signals on the page. 89.5% is the extraction-correctness floor.

Variance is measured by running the same URL twice on a fixed engine version, with randomised visitor seeds. Scores converge to within 2.1% absolute. That matches test-retest expectations for any panel-based methodology.

Independent corroboration: Zheng et al. 2023 found LLM evaluation reaches 80%+ agreement with human reviewers on structured annotation tasks. Baymard 2025 reports 95% accuracy for AI-assisted UX evaluation when calibrated against domain research, vs 50–75% for generic prompting.

§07 · AI Citability

A citation-readiness score, not a citation count

The AI Citability Index measures how citation-ready your page is for AI-generated answers. It scores eight dimensions: structured-data quality, factual density, claim-with-evidence pairing, topical authority, source-graph completeness, heading and answer clarity, crawler accessibility, and content freshness. It is not a count of actual citations in production LLM responses. We do not claim it is. The methodology draws on Aggarwal et al. (KDD 2024) on Generative Engine Optimization, which showed structured content and authoritative citations correlate with visibility in AI-generated responses.

How the scoring stays current

The platform layer changes faster than CRO frameworks do. Anthropic split ClaudeBot from Claude-SearchBot in February 2026. OpenAI introduced OAI-SearchBot ahead of ChatGPT search. Perplexity ships changelog-level updates to its citation logic on a near-monthly cadence.

WhyIQ separates two layers: a knowledge layer (which platform behaves how) and an engine layer (the scoring algorithm). Platform changes ship in the knowledge layer without touching scoring code. Every scan stamps the config version it was scored against, on the report.

Signals split into two tiers. Tier 1 covers stable structural signals grounded in peer-reviewed or large-corpus research. These change on the order of years. Tier 2 covers tactical signals that evolve as platforms ship updates and ranking studies replicate. Tier 2 is reviewed quarterly against the BrightEdge, Semrush 10M-keyword, AirOps 548K-page, and Princeton GEO studies, plus the platform vendors' own published guidance.

Tier 1 · Stable structural signals

Reviewed annually

Crawler accessibility (robots, allow-lists)
Content freshness signals
Statistical and factual density
Author attribution and entity graph

Tier 2 · Evolving tactical signals

Reviewed quarterly

FAQ quality and answer clarity
Heading structure and passage shape
Schema coverage and rich-result eligibility
Per-platform citation cohort rules

What we deliberately do not publish: the per-signal weights, the dimension formula, or the proprietary signal-extraction logic. The reason is symmetric to why search-ranking factors are not public. A weight that is published becomes a target the moment it is, and a target gets gamed. The methodology is open. The coefficients are not.

§08 · Accessibility

Browser-measured, not inferred

The accessibility scorer uses Chrome DevTools Protocol, the same data source as Lighthouse. That gives us browser-computed accessible names, roles, and contrast ratios. Every finding ties to a specific element with a measured value. No inferred issues. Sampled WCAG 2.2 AA criteria, with the sample size disclosed on each report. Mapped to ADA, EAA, Section 508, UK Equality Act, plus 10+ jurisdictions.

Color Contrast

Text and interactive contrast ratios measured against WCAG AA/AAA thresholds.

Semantic Structure

Heading hierarchy, landmark regions, and document outline for screen-reader navigation.

Form Accessibility

Label associations, error identification, and input purpose disclosure.

Image and Media

Alt text presence and quality, decorative image handling, captions.

Keyboard and Navigation

Focus indicators, tab order, and keyboard operability across interactive elements.

Interactive Elements

Touch target sizes, button labels, ARIA role usage.

Legal Compliance Risk

Findings mapped to ADA, EAA, Section 508, UK Equality Act, plus 10+ jurisdictions.

Visual Cross-check

Screenshot analysis flags low-contrast text and small touch targets that DOM checks miss.

§09 · Limits

What WhyIQ deliberately does not claim

A research-calibrated tool earns trust by being explicit about its boundaries. WhyIQ does not replace real user testing. It does not measure actual LLM citations. It does not predict A/B test outcomes with statistical certainty. It does not score taste. It does not adjudicate brand strategy. It surfaces the structural clarity, trust-formation, and accessibility issues that should be debugged before traffic, so the test budget goes to the questions a simulation cannot answer.

Not a substitute for real user research, interviews, or session recordings.

Not a measurement of actual AI citations in production LLM responses.

Not a statistical-significance engine. Not an A/B test.

Not a brand or aesthetic judgment.

Not a replacement for a strategic experimentation roadmap.

§10 · FAQ

Methodology questions, answered

How is WhyIQ different from ChatGPT or another AI page reviewer?

A foundation-model page reviewer generates output at inference time, from priors no one can audit. WhyIQ generates output at parameter time. 50 simulated visitors traverse your page under five stacked behavioural frameworks (CLT, dual-process, regulatory focus, prospect theory, ELM). Each parameter is anchored to a specific paper from a 200+ paper corpus. The LLM renders the simulation; it does not decide it. Validated at 89.5% extraction accuracy and under 2.1% scan-to-scan variance.

What does “research-calibrated” actually mean?

Every coefficient in the engine has a paper behind it. The working-memory gate of 4 chunks comes from Cowan 2001. The persuasion-knowledge dampener applied to urgency and scarcity comes from a 148-paper meta-analysis (Eisend and Tarrahi 2022). Trust signal hierarchy is calibrated against Kim and Peterson's 2017 meta of 150 studies. The engine's V2.8.x release log lists 33 research-validated code changes, each shipped with the source that justified it.

How accurate is the extraction?

89.5% extraction-correctness floor in production validation, April 2026. Method: every scan in the cohort was reviewed against the rendered page. Headline the engine extracted, vs headline a human reader would name first. CTA detected, vs CTA actually present. Trust signals enumerated, vs trust signals on the page. Independent corroboration: Zheng et al. 2023 found LLM evaluation reaches 80%+ agreement with human reviewers on structured annotation tasks.

How reproducible are the scores?

Under 2.1% scan-to-scan variance on identical input with a fixed engine version. Scores converge that closely because the engine is parametric: archetype parameter vectors and the calibrated frameworks are deterministic. Only the visitor seeds randomise. That matches test-retest expectations for any panel-based methodology. It is a numerical guarantee shallow LLM scanners cannot make, because re-prompting a foundation model yields a different answer each time.

What does pre-traffic CRO mean here?

Pre-traffic CRO is conversion-rate optimisation done before paid traffic begins. The diagnostic happens through behavioural simulation, not statistical inference. WhyIQ runs 50 calibrated visitors against your page and surfaces structural clarity issues, trust-formation gaps, accessibility blockers, and AI Citability weaknesses. Founders and agencies use it to debug a page before spending on ads, before launch, or as part of an audit package. It does not replace A/B testing once traffic exists, and we do not claim it does.

What are the 5 frameworks the engine layers?

Cognitive Load Theory (Sweller 2019, Cowan 2001) governs working-memory and above-fold density. Dual-process cognition (Kahneman 2011) tracks System 1 to System 2 transition. Regulatory focus (Higgins 1997) calibrates promotion vs prevention orientation per archetype. Prospect theory (Tversky and Kahneman 1979) models loss aversion in pricing and risk framing. ELM (Petty and Cacioppo 1986) governs central vs peripheral persuasion routes. Each framework is implemented as numeric coefficients, not prompt language.

§11 · References

Source corpus

Cowan, N. (2001). The magical number 4 in short-term memory. Behavioral and Brain Sciences, 24(1), 87–114.
Kahneman, D. (2011). Thinking, Fast and Slow. Farrar, Straus and Giroux.
Sweller, J. et al. (2019). Cognitive load theory. Educational Psychology Review, 31, 261–292.
Hick, W. E. (1952). On the rate of gain of information. Quarterly Journal of Experimental Psychology, 4(1).
Hyman, R. (1953). Stimulus information as a determinant of reaction time. Journal of Experimental Psychology, 45(3).
Higgins, E. T. (1997). Beyond pleasure and pain. American Psychologist, 52(12).
Tversky, A. & Kahneman, D. (1979). Prospect theory. Econometrica, 47(2).
Petty, R. E. & Cacioppo, J. T. (1986). The Elaboration Likelihood Model of persuasion.
Kim, D. J. & Peterson, R. A. (2017). Trust meta-analysis (150 studies). Decision Support Systems.
Nielsen Norman Group. Eye-tracking attention distribution research.
Aggarwal, P. et al. (2024). GEO: Generative Engine Optimization. KDD 2024.
Eisend, M. & Tarrahi, M. J. (2022). Persuasion knowledge meta-analysis (148 papers). Journal of Marketing.
Browne, G. J. & Jones, J. (2017). Urgency effects on online decision making (6,700 experiments). JMIS, 34(2).
Bottger, T. et al. (2023). Distraction and online behaviour. Scientific Reports.
Scheirer, J. et al. (2002). Frustrating the user on purpose. Interacting with Computers, 14(2).
Zheng, C. et al. (2023). Is ChatGPT a good annotator? arXiv:2310.14424.
Baymard Institute (2025). AI-assisted UX evaluation accuracy benchmark.
Edelman Trust Institute (2025). Edelman Trust Barometer.

Pre-traffic CRO WhyIQ vs a manual audit How it works (3-step flow)

Run the engine on your own page

Same engine, same calibrated corpus, same parameter table this page describes. Paste a URL, get the audit trail behind every finding. Free first scan, no account needed.

~2 minutes. No card. Engine V2.8.x. Methodology open on this page.