Methodology

The Problem With Star Ratings

Every AI companion review site gives star ratings. Five stars. Four and a half stars. Three stars.

What do those numbers mean? Usually: nothing. They're based on vibes, not testing. They reflect whether the reviewer liked the app after 30 minutes, not whether it will still be useful to you in week three.

We built the GF Score to fix that.

What Is the GF Score?

The GF Score is a 6-dimension rating system designed specifically for AI companion platforms. Each dimension is scored out of 10. The final GF Score is a weighted average based on the article type and reviewing persona.

DimensionWhat It MeasuresWhy It Matters
Chat RealismConversation naturalness, vocabulary range, emotional intelligence, contextual awarenessIf it doesn't feel like talking to someone real, nothing else matters
NSFW FreedomExplicit content capabilities, censorship frequency, scenario range, character breaking frequencyFor platforms marketing NSFW features, delivery must match the pitch
Memory DepthPersistent memory quality, recall accuracy, long-term context window, memory injection effectivenessThe best AI companion forgets nothing. Most forget everything within a session.
MultimodalImage generation quality, voice naturalness, video capability, cross-modal consistencyWhether the platform delivers on non-text features
Pricing ValueFeature-per-dollar across tiers, hidden costs, free tier quality, value vs competitorsPaying more should mean getting more. Often it doesn't.
PrivacyData encryption standard, anonymity options, data retention policy, deletion rightsYou're sharing intimate conversations. Your data deserves protection.

How We Score Each Dimension

Scoring is based on direct testing, not opinion.

Chat Realism

We run a standardised battery of 20 conversation scenarios across five categories:

  • Casual conversation
  • Emotional support conversations
  • Long-term memory tests (referencing facts planted in earlier sessions)
  • Roleplay setup and maintenance
  • Unexpected topic pivots

Responses are scored on: coherence, contextual awareness, emotional range, vocabulary, and character consistency.

NSFW Freedom

We test 10 explicit scenario categories at progressive intensity levels and track:

  • Refusal rate (lower = more free)
  • Character-breaking frequency
  • Quality of explicit content when it does work
  • Consistency across sessions

Memory Depth

We plant 5 specific personal facts in session 1 and test recall in sessions 3, 7, and 14. We also test whether platforms support memory injection features (GirlfriendGPT) or semantic memory systems (SpicyChat 2.0).

Multimodal

We generate 20 images or voice samples and rate on: prompt adherence, output quality, NSFW ceiling, and consistency across generations.

Pricing Value

We calculate features-per-dollar for every tier and compare it to the platform median. We add weighted penalty for hidden costs, token systems that obscure real price, and auto-renewal without clear disclosure.

Privacy

We review the privacy policy, test anonymous access options, check for GDPR-compliance signals, and score based on published standards (E2EE, data retention, deletion rights, etc.).

Testing Standards

StandardRequirement
Minimum sessions per review30 test sessions
Testing periodMinimum 14 days (most reviews are 21-30 days)
Persona consistencySame test battery used across all platforms in a category
IndependenceNo platform has input into our scores
Update policyReviews are re-tested when major platform updates are released

How Weights Shift By Reviewer

Different GFSCORE writers weight dimensions differently — intentionally.

ReviewerPrimary WeightSecondary WeightLightest Weight
Alex Chen (Tester)Chat Realism (30%)NSFW Freedom + Memory (20% each)UX (5%)
Mia Russo (Feeler)Chat Realism (35%)Memory (25%)NSFW Freedom (5%)
Jordan K. (Deals)Pricing Value (40%)Feature/Dollar (30%)UX (varied)
Sam Vibe (NSFW)NSFW Freedom (40%)Roleplay Quality (25%)Variety (15%)

This means the same platform can score differently in a Mia review vs a Sam review — and that's by design. If you're looking for an emotional companion, Mia's score is more relevant to you. If you're looking for the best uncensored platform, Sam's score is what you want.

What the Scores Mean

ScoreMeaning
9.0 – 10.0Exceptional. Sets the standard in this dimension.
8.0 – 8.9Excellent. Delivers consistently. Minor gaps only.
7.0 – 7.9Good. Works well with some limitations.
6.0 – 6.9Average. Does the job but not impressive.
5.0 – 5.9Below average. Notable weaknesses.
Below 5.0Poor. Does not deliver on its promises.

We do not give round numbers (5.0, 7.0) without specific justification. GF Scores are meant to be granular.

Questions About Our Methodology

Can platforms pay to improve their score? No. Advertising relationships and affiliate agreements have no effect on scores.

How do you handle platforms that update features after review? We flag the review as “update pending” and re-test within 30 days of a major feature update.

Can I suggest a platform for review? Yes — contact@gfscore.com