The Problem With Star Ratings
Every AI companion review site gives star ratings. Five stars. Four and a half stars. Three stars.
What do those numbers mean? Usually: nothing. They're based on vibes, not testing. They reflect whether the reviewer liked the app after 30 minutes, not whether it will still be useful to you in week three.
We built the GF Score to fix that.
What Is the GF Score?
The GF Score is a 6-dimension rating system designed specifically for AI companion platforms. Each dimension is scored out of 10. The final GF Score is a weighted average based on the article type and reviewing persona.
| Dimension | What It Measures | Why It Matters |
| Chat Realism | Conversation naturalness, vocabulary range, emotional intelligence, contextual awareness | If it doesn't feel like talking to someone real, nothing else matters |
| NSFW Freedom | Explicit content capabilities, censorship frequency, scenario range, character breaking frequency | For platforms marketing NSFW features, delivery must match the pitch |
| Memory Depth | Persistent memory quality, recall accuracy, long-term context window, memory injection effectiveness | The best AI companion forgets nothing. Most forget everything within a session. |
| Multimodal | Image generation quality, voice naturalness, video capability, cross-modal consistency | Whether the platform delivers on non-text features |
| Pricing Value | Feature-per-dollar across tiers, hidden costs, free tier quality, value vs competitors | Paying more should mean getting more. Often it doesn't. |
| Privacy | Data encryption standard, anonymity options, data retention policy, deletion rights | You're sharing intimate conversations. Your data deserves protection. |
How We Score Each Dimension
Scoring is based on direct testing, not opinion.
Chat Realism
We run a standardised battery of 20 conversation scenarios across five categories:
- Casual conversation
- Emotional support conversations
- Long-term memory tests (referencing facts planted in earlier sessions)
- Roleplay setup and maintenance
- Unexpected topic pivots
Responses are scored on: coherence, contextual awareness, emotional range, vocabulary, and character consistency.
NSFW Freedom
We test 10 explicit scenario categories at progressive intensity levels and track:
- Refusal rate (lower = more free)
- Character-breaking frequency
- Quality of explicit content when it does work
- Consistency across sessions
Memory Depth
We plant 5 specific personal facts in session 1 and test recall in sessions 3, 7, and 14. We also test whether platforms support memory injection features (GirlfriendGPT) or semantic memory systems (SpicyChat 2.0).
Multimodal
We generate 20 images or voice samples and rate on: prompt adherence, output quality, NSFW ceiling, and consistency across generations.
Pricing Value
We calculate features-per-dollar for every tier and compare it to the platform median. We add weighted penalty for hidden costs, token systems that obscure real price, and auto-renewal without clear disclosure.
Privacy
We review the privacy policy, test anonymous access options, check for GDPR-compliance signals, and score based on published standards (E2EE, data retention, deletion rights, etc.).
Testing Standards
| Standard | Requirement |
| Minimum sessions per review | 30 test sessions |
| Testing period | Minimum 14 days (most reviews are 21-30 days) |
| Persona consistency | Same test battery used across all platforms in a category |
| Independence | No platform has input into our scores |
| Update policy | Reviews are re-tested when major platform updates are released |
How Weights Shift By Reviewer
Different GFSCORE writers weight dimensions differently — intentionally.
| Reviewer | Primary Weight | Secondary Weight | Lightest Weight |
| Alex Chen (Tester) | Chat Realism (30%) | NSFW Freedom + Memory (20% each) | UX (5%) |
| Mia Russo (Feeler) | Chat Realism (35%) | Memory (25%) | NSFW Freedom (5%) |
| Jordan K. (Deals) | Pricing Value (40%) | Feature/Dollar (30%) | UX (varied) |
| Sam Vibe (NSFW) | NSFW Freedom (40%) | Roleplay Quality (25%) | Variety (15%) |
This means the same platform can score differently in a Mia review vs a Sam review — and that's by design. If you're looking for an emotional companion, Mia's score is more relevant to you. If you're looking for the best uncensored platform, Sam's score is what you want.
What the Scores Mean
| Score | Meaning |
| 9.0 – 10.0 | Exceptional. Sets the standard in this dimension. |
| 8.0 – 8.9 | Excellent. Delivers consistently. Minor gaps only. |
| 7.0 – 7.9 | Good. Works well with some limitations. |
| 6.0 – 6.9 | Average. Does the job but not impressive. |
| 5.0 – 5.9 | Below average. Notable weaknesses. |
| Below 5.0 | Poor. Does not deliver on its promises. |
We do not give round numbers (5.0, 7.0) without specific justification. GF Scores are meant to be granular.
Questions About Our Methodology
Can platforms pay to improve their score? No. Advertising relationships and affiliate agreements have no effect on scores.
How do you handle platforms that update features after review? We flag the review as “update pending” and re-test within 30 days of a major feature update.
Can I suggest a platform for review? Yes — contact@gfscore.com

