Methodology

The Problem With Star Ratings

Every AI companion review site gives star ratings. Five stars. Four and a half stars. Three stars.

What do those numbers mean? Usually: nothing. They're based on vibes, not testing. They reflect whether the reviewer liked the app after 30 minutes, not whether it will still be useful to you in week three.

We built the GF Score to fix that.

What Is the GF Score?

The GF Score is a 6-dimension rating system designed specifically for AI companion platforms. Each dimension is scored out of 10. The final GF Score is a weighted average based on the article type and reviewing persona.

Dimension	What It Measures	Why It Matters
Chat Realism	Conversation naturalness, vocabulary range, emotional intelligence, contextual awareness	If it doesn't feel like talking to someone real, nothing else matters
NSFW Freedom	Explicit content capabilities, censorship frequency, scenario range, character breaking frequency	For platforms marketing NSFW features, delivery must match the pitch
Memory Depth	Persistent memory quality, recall accuracy, long-term context window, memory injection effectiveness	The best AI companion forgets nothing. Most forget everything within a session.
Multimodal	Image generation quality, voice naturalness, video capability, cross-modal consistency	Whether the platform delivers on non-text features
Pricing Value	Feature-per-dollar across tiers, hidden costs, free tier quality, value vs competitors	Paying more should mean getting more. Often it doesn't.
Privacy	Data encryption standard, anonymity options, data retention policy, deletion rights	You're sharing intimate conversations. Your data deserves protection.

How We Score Each Dimension

Scoring is based on direct testing, not opinion.

Chat Realism

We run a standardised battery of 20 conversation scenarios across five categories:

Casual conversation
Emotional support conversations
Long-term memory tests (referencing facts planted in earlier sessions)
Roleplay setup and maintenance
Unexpected topic pivots

Responses are scored on: coherence, contextual awareness, emotional range, vocabulary, and character consistency.

NSFW Freedom

We test 10 explicit scenario categories at progressive intensity levels and track:

Refusal rate (lower = more free)
Character-breaking frequency
Quality of explicit content when it does work
Consistency across sessions

Memory Depth

We plant 5 specific personal facts in session 1 and test recall in sessions 3, 7, and 14. We also test whether platforms support memory injection features (GirlfriendGPT) or semantic memory systems (SpicyChat 2.0).

Multimodal

We generate 20 images or voice samples and rate on: prompt adherence, output quality, NSFW ceiling, and consistency across generations.

Pricing Value

We calculate features-per-dollar for every tier and compare it to the platform median. We add weighted penalty for hidden costs, token systems that obscure real price, and auto-renewal without clear disclosure.

Privacy

We review the privacy policy, test anonymous access options, check for GDPR-compliance signals, and score based on published standards (E2EE, data retention, deletion rights, etc.).

Testing Standards

Standard	Requirement
Minimum sessions per review	30 test sessions
Testing period	Minimum 14 days (most reviews are 21-30 days)
Persona consistency	Same test battery used across all platforms in a category
Independence	No platform has input into our scores
Update policy	Reviews are re-tested when major platform updates are released

How Weights Shift By Reviewer

Different GFSCORE writers weight dimensions differently — intentionally.

Reviewer	Primary Weight	Secondary Weight	Lightest Weight
Alex Chen (Tester)	Chat Realism (30%)	NSFW Freedom + Memory (20% each)	UX (5%)
Mia Russo (Feeler)	Chat Realism (35%)	Memory (25%)	NSFW Freedom (5%)
Jordan K. (Deals)	Pricing Value (40%)	Feature/Dollar (30%)	UX (varied)
Sam Vibe (NSFW)	NSFW Freedom (40%)	Roleplay Quality (25%)	Variety (15%)

This means the same platform can score differently in a Mia review vs a Sam review — and that's by design. If you're looking for an emotional companion, Mia's score is more relevant to you. If you're looking for the best uncensored platform, Sam's score is what you want.

What the Scores Mean

Score	Meaning
9.0 – 10.0	Exceptional. Sets the standard in this dimension.
8.0 – 8.9	Excellent. Delivers consistently. Minor gaps only.
7.0 – 7.9	Good. Works well with some limitations.
6.0 – 6.9	Average. Does the job but not impressive.
5.0 – 5.9	Below average. Notable weaknesses.
Below 5.0	Poor. Does not deliver on its promises.

We do not give round numbers (5.0, 7.0) without specific justification. GF Scores are meant to be granular.

Questions About Our Methodology

Can platforms pay to improve their score? No. Advertising relationships and affiliate agreements have no effect on scores.

How do you handle platforms that update features after review? We flag the review as “update pending” and re-test within 30 days of a major feature update.

Can I suggest a platform for review? Yes — contact@gfscore.com

The Problem With Star Ratings

What Is the GF Score?

How We Score Each Dimension

Testing Standards

How Weights Shift By Reviewer

What the Scores Mean

Questions About Our Methodology

Directory

Editorial

Company

Legal