A behavioral simulation engine, calibrated against 200+ peer-reviewed papers.
Last updated: May 2026
Engine V2.8.x · 33 research-validated changesWhyIQ is a behavioral simulation engine, not an LLM that “looks at” your landing page. 50 simulated visitors, drawn from a 21-archetype pool, traverse your page under five stacked psychological frameworks: Cognitive Load Theory, dual-process cognition, regulatory focus, prospect theory, and the Elaboration Likelihood Model. Every coefficient is anchored to a peer-reviewed paper, drawn from a 200+ paper corpus across CRO, behavioral science, and ecommerce calibration. Extraction accuracy is 89.5%. Scan-to-scan variance is under 2.1% on a fixed engine version. This page is the audit trail behind every score.
The numbers a one-shot LLM scan cannot produce
If a tool cannot tell you the source of a single coefficient, the output is a vibe. Below is the calibration surface area WhyIQ commits to, with the citation that produced each number.
peer-reviewed papers
extraction accuracy
scan-to-scan variance
simulated visitors / archetype pool
stacked frameworks
research-validated code changes
Every coefficient has a paper behind it
A typical AI page reviewer generates output at inference time. WhyIQ generates it at parameter time. The LLM only renders the simulation. Trust thresholds, patience curves, cognitive load capacity, frustration compounding, persuasion-knowledge dampeners: each is a numeric coefficient bounded by a published study.
When Cowan (2001) put working memory at 4 chunks, that is the gate the engine uses for above-fold interactive density. When Eisend and Tarrahi (2022) measured persuasion knowledge across 148 papers, that is the dampener applied to urgency, scarcity, and social proof signals. The model is not learning. It is carrying research.
Five frameworks, layered. Not name-dropped.
Most AI page tools cite “behavioral psychology” once and move on. The engine layers five frameworks because each governs a different inflection on the visitor's path from glance to click.
Cognitive Load Theory (CLT)
Above the fold, visitors can hold roughly four interactive elements in working memory before clarity collapses.
Applied to:Above-fold interactive density gate, hero scannability.
Cowan 2001; Sweller 2019.
Dual-process cognition
Visitors enter on heuristics. The engine flags pages that lose them before analytical evaluation ever engages.
Applied to:Per-visitor mode tracking, mode-transition triggers.
Kahneman 2011.
Hick-Hyman Law
Decision time scales log-linearly with option count. Drives CTA-density and nav-clutter penalties.
Applied to:CTA density, nav clutter, choice overload.
Hick 1952; Hyman 1953.
NN/g attention distribution
Above fold, mid-page, and below fold attention shares from eye-tracking studies. Weights confusion-finding priority.
Applied to:Finding severity, prioritisation, fold weighting.
Nielsen Norman Group.
Kim & Peterson trust hierarchy
Establishes the rank order of trust signals: third-party reviews above brand claims above visual polish.
Applied to:Trust formation scoring, signal weighting.
Kim and Peterson 2017.
Regulatory focus + prospect theory + ELM
Three additional spines. Regulatory focus per archetype, prospect theory in pricing framing, ELM in persuasion routing.
Applied to:Archetype calibration, pricing risk framing, persuasion route choice.
Higgins 1997; Tversky and Kahneman 1979; Petty and Cacioppo 1986.
Archetypes are parameter sets, not personas
Personas are marketing fiction. Archetypes in WhyIQ are vectors: each one is a calibrated combination of patience threshold, trust baseline, working-memory headroom, regulatory focus, technical literacy, and price sensitivity.
A “Skeptical evaluator” is not a vibe. It is a profile: trust 22, patience 5.4 seconds, prevention-focus +0.6, working-memory one chunk below median. When the engine reports that 14 of 50 visitors disengaged at the pricing block, those 14 share a quantifiable parameter signature. The report names it.
WhyIQ vs a one-shot LLM scan
The shape of the output looks similar. The mechanism is not. Eight dimensions where the difference matters.
What “validated” means here
Validation is a methodology, not a marketing word. Every scan in the cohort was reviewed against the rendered page. The headline the engine extracted, vs the headline a human reader would name first. The CTA detected, vs the CTA actually present. Trust signals enumerated, vs trust signals on the page. 89.5% is the extraction-correctness floor.
Variance is measured by running the same URL twice on a fixed engine version, with randomised visitor seeds. Scores converge to within 2.1% absolute. That matches test-retest expectations for any panel-based methodology.
Independent corroboration: Zheng et al. 2023 found LLM evaluation reaches 80%+ agreement with human reviewers on structured annotation tasks. Baymard 2025 reports 95% accuracy for AI-assisted UX evaluation when calibrated against domain research, vs 50–75% for generic prompting.
A citation-readiness score, not a citation count
The AI Citability Index measures how citation-ready your page is for AI-generated answers. It scores eight dimensions: structured-data quality, factual density, claim-with-evidence pairing, topical authority, source-graph completeness, heading and answer clarity, crawler accessibility, and content freshness. It is not a count of actual citations in production LLM responses. We do not claim it is. The methodology draws on Aggarwal et al. (KDD 2024) on Generative Engine Optimization, which showed structured content and authoritative citations correlate with visibility in AI-generated responses.
How the scoring stays current
The platform layer changes faster than CRO frameworks do. Anthropic split ClaudeBot from Claude-SearchBot in February 2026. OpenAI introduced OAI-SearchBot ahead of ChatGPT search. Perplexity ships changelog-level updates to its citation logic on a near-monthly cadence.
WhyIQ separates two layers: a knowledge layer (which platform behaves how) and an engine layer (the scoring algorithm). Platform changes ship in the knowledge layer without touching scoring code. Every scan stamps the config version it was scored against, on the report.
Signals split into two tiers. Tier 1 covers stable structural signals grounded in peer-reviewed or large-corpus research. These change on the order of years. Tier 2 covers tactical signals that evolve as platforms ship updates and ranking studies replicate. Tier 2 is reviewed quarterly against the BrightEdge, Semrush 10M-keyword, AirOps 548K-page, and Princeton GEO studies, plus the platform vendors' own published guidance.
- Crawler accessibility (robots, allow-lists)
- Content freshness signals
- Statistical and factual density
- Author attribution and entity graph
- FAQ quality and answer clarity
- Heading structure and passage shape
- Schema coverage and rich-result eligibility
- Per-platform citation cohort rules
What we deliberately do not publish: the per-signal weights, the dimension formula, or the proprietary signal-extraction logic. The reason is symmetric to why search-ranking factors are not public. A weight that is published becomes a target the moment it is, and a target gets gamed. The methodology is open. The coefficients are not.
Browser-measured, not inferred
The accessibility scorer uses Chrome DevTools Protocol, the same data source as Lighthouse. That gives us browser-computed accessible names, roles, and contrast ratios. Every finding ties to a specific element with a measured value. No inferred issues. Sampled WCAG 2.2 AA criteria, with the sample size disclosed on each report. Mapped to ADA, EAA, Section 508, UK Equality Act, plus 10+ jurisdictions.
Color Contrast
Text and interactive contrast ratios measured against WCAG AA/AAA thresholds.
Semantic Structure
Heading hierarchy, landmark regions, and document outline for screen-reader navigation.
Form Accessibility
Label associations, error identification, and input purpose disclosure.
Image and Media
Alt text presence and quality, decorative image handling, captions.
Keyboard and Navigation
Focus indicators, tab order, and keyboard operability across interactive elements.
Interactive Elements
Touch target sizes, button labels, ARIA role usage.
Legal Compliance Risk
Findings mapped to ADA, EAA, Section 508, UK Equality Act, plus 10+ jurisdictions.
Visual Cross-check
Screenshot analysis flags low-contrast text and small touch targets that DOM checks miss.
What WhyIQ deliberately does not claim
A research-calibrated tool earns trust by being explicit about its boundaries. WhyIQ does not replace real user testing. It does not measure actual LLM citations. It does not predict A/B test outcomes with statistical certainty. It does not score taste. It does not adjudicate brand strategy. It surfaces the structural clarity, trust-formation, and accessibility issues that should be debugged before traffic, so the test budget goes to the questions a simulation cannot answer.
Methodology questions, answered
How is WhyIQ different from ChatGPT or another AI page reviewer?
What does “research-calibrated” actually mean?
How accurate is the extraction?
How reproducible are the scores?
What does pre-traffic CRO mean here?
What are the 5 frameworks the engine layers?
Source corpus
- Cowan, N. (2001). The magical number 4 in short-term memory. Behavioral and Brain Sciences, 24(1), 87–114.
- Kahneman, D. (2011). Thinking, Fast and Slow. Farrar, Straus and Giroux.
- Sweller, J. et al. (2019). Cognitive load theory. Educational Psychology Review, 31, 261–292.
- Hick, W. E. (1952). On the rate of gain of information. Quarterly Journal of Experimental Psychology, 4(1).
- Hyman, R. (1953). Stimulus information as a determinant of reaction time. Journal of Experimental Psychology, 45(3).
- Higgins, E. T. (1997). Beyond pleasure and pain. American Psychologist, 52(12).
- Tversky, A. & Kahneman, D. (1979). Prospect theory. Econometrica, 47(2).
- Petty, R. E. & Cacioppo, J. T. (1986). The Elaboration Likelihood Model of persuasion.
- Kim, D. J. & Peterson, R. A. (2017). Trust meta-analysis (150 studies). Decision Support Systems.
- Nielsen Norman Group. Eye-tracking attention distribution research.
- Aggarwal, P. et al. (2024). GEO: Generative Engine Optimization. KDD 2024.
- Eisend, M. & Tarrahi, M. J. (2022). Persuasion knowledge meta-analysis (148 papers). Journal of Marketing.
- Browne, G. J. & Jones, J. (2017). Urgency effects on online decision making (6,700 experiments). JMIS, 34(2).
- Bottger, T. et al. (2023). Distraction and online behaviour. Scientific Reports.
- Scheirer, J. et al. (2002). Frustrating the user on purpose. Interacting with Computers, 14(2).
- Zheng, C. et al. (2023). Is ChatGPT a good annotator? arXiv:2310.14424.
- Baymard Institute (2025). AI-assisted UX evaluation accuracy benchmark.
- Edelman Trust Institute (2025). Edelman Trust Barometer.
Run the engine on your own page
Same engine, same calibrated corpus, same parameter table this page describes. Paste a URL, get the audit trail behind every finding. Free first scan, no account needed.
~2 minutes. No card. Engine V2.8.x. Methodology open on this page.