Home » I Tested Claude Code, Google Antigravity, and OpenAI Codex to Build a Real App

I Tested Claude Code, Google Antigravity, and OpenAI Codex to Build a Real App

Every AI coding tool on the market right now claims to be the one that finally replaces the grunt work of development. Bold claims are easy. Delivering a complete, working product from a plain English prompt — without hand-holding, without follow-up prompts, without you fixing what it broke — that’s something else entirely.

So I stopped reading the marketing pages and ran a real test. One brief. Three tools. Claude Code, Google Antigravity, and OpenAI Codex. The task was simple on the surface and genuinely difficult underneath: build a working 2D platformer game in Python, from scratch, using nothing but plain English.

What came back from each tool was more revealing than any benchmark chart I’ve ever read.

Why This Test Actually Means Something

Most AI coding comparisons test autocomplete. They paste in a function signature and ask the model to fill in the body. That’s useful, but it tells you almost nothing about how a tool behaves when you throw a real project at it.

A 2D platformer is a system, not a snippet. It needs gravity, collision detection, player input, rendering logic, a game loop, hazards, objectives, a win condition, and a user interface — all working together, all at once. You cannot fake your way through it. The game either runs and feels complete, or it exposes exactly where the tool’s understanding breaks down.

Same prompt. Same machine. Same evaluation criteria. Here is what each tool actually produced.

Google Antigravity: Big Vision, Underwhelming Output

Google Antigravity arrived with serious credentials. It was built by the former Windsurf team following Google’s $2.4 billion licensing deal, runs Gemini 3.1 Pro under the hood, and ships with a dual Editor and Manager interface that genuinely looks like mission control for multi-agent workflows. The built-in browser for live visual testing is something neither competitor offers. On paper, this was the most technically ambitious tool in the room.

The game it produced told a humbler story.

The structural foundations were there — basic gravity, collision detection, scrollable levels, core Pygame setup. Technically, the code held together. But when you actually ran it, the experience was hollow. No title screen. No objectives. No score. No sense of why you were there or what you were trying to do. The visuals resembled a rough prototype sketched in an afternoon — functional in the narrowest possible definition, but clearly incomplete.

Pygame is fully capable of everything a polished game needs. Antigravity simply didn’t use those capabilities. The gap between Gemini 3.1 Pro’s documented reasoning ability and what Antigravity actually shipped here was stark. Strong benchmark numbers did not translate into strong product judgement. Third place, and it wasn’t particularly close.

OpenAI Codex: Dependable, Disciplined, and Deliberately Safe

OpenAI Codex, running on GPT-5.4, produced something genuinely playable. The game was called Pudding Paws — a platformer where the player collects five fireflies before reaching a cosy pillow, navigating pitfalls and small spike hazards along the way. The HUD was clean. Controls were shown on screen. The full game loop was intact.

Compared to Antigravity, Codex delivered a real product. It had an objective, a challenge, and a reason to keep playing. Technically, the output was thorough and well-structured.

Where it fell short was in the space between instruction and intent. Codex followed the brief faithfully and produced exactly what was asked — no more, no less. There was no creative interpretation. No unprompted additions that would make the experience better. No moment where the tool seemed to understand what the game needed beyond what the words literally said.

For a lot of professional work, that disciplined restraint is actually valuable. Codex is predictable. It ships clean, organised code that stays inside the lines. If you already pay for ChatGPT Plus, it comes bundled at no extra cost — making it the best value agentic coding tool available right now. But in a test of which tool best understands what a developer actually wants, staying strictly inside the brief costs it first place.

Claude Code: The Only One That Understood the Assignment

Claude Code is Anthropic’s terminal-native agentic coding tool. It doesn’t live inside an IDE — it runs in your command line, works across your entire codebase, edits files, executes commands, and iterates on its own output. The model behind it in this test was Claude Sonnet 4.6, tuned specifically for agentic coding tasks.

What Claude Code produced was the only output in this test that felt finished.

The platformer had a title screen, clear objectives, meaningful hazards, visual polish, and a complete game loop — none of which were explicitly requested in the brief. Claude Code didn’t just implement the prompt. It interpreted what the prompt was pointing toward. It understood that a game needs a reason to exist, not just mechanics that technically function.

That gap — between what you typed and what you meant — is where every agentic tool eventually gets tested. Claude Code bridged it. The other two waited to be told.

Raw model size isn’t what made the difference here. Sonnet 4.6 is not Anthropic’s largest or most expensive model. What it demonstrates is that intent-understanding is a distinct capability from raw reasoning power — and right now, Claude Code has more of it than the competition.

The Real Differences Under the Hood

Intent vs instruction is the sharpest line between Claude Code and everything else. When your prompt is imprecise — and real-world prompts always are — Claude Code fills the gaps with judgement. It adds what the experience needs, not just what the words said. That is an extraordinarily difficult capability to develop and it is the single most important thing a coding agent can do.

Workflow fit separates the tools differently for different developers. Claude Code is terminal-first, meaning it slots into your existing setup without displacing it. Antigravity is a full IDE replacement — you either adopt its environment or you don’t use it. Codex lands in between, integrating with VS Code and JetBrains while also offering a standalone desktop app.

Context window depth matters on real projects. Claude Code handles up to one million tokens in full configuration, meaning large multi-file codebases stay coherent across an entire session. For architectural-level work, cross-module refactoring, or anything that spans dozens of files, that depth is not a nice-to-have — it’s essential.

Cost structure is where honest comparison gets uncomfortable. Claude Code is token-priced and can generate surprise bills during heavy agentic sessions. Codex is bundled with ChatGPT Plus at $20 per month. Antigravity is currently free in preview. If budget is the primary constraint, Codex wins on pure economics. If output quality is the primary constraint, Claude Code justifies what it costs.

Honest Weaknesses of Claude Code

Picking a winner doesn’t mean pretending it’s perfect.

Token costs during long agentic loops can escalate quickly. Developers running Claude Code heavily through complex multi-file tasks should monitor usage actively rather than discovering the bill at month end.

The terminal-only interface has a real learning curve for anyone accustomed to graphical IDEs. If you have never worked extensively in a command-line environment, Claude Code requires an adjustment period before it clicks.

First-pass success on complex autonomous tasks runs at roughly 33% in real-world conditions. Iterative prompting is part of the workflow, not an exception. Set that expectation from the beginning and it stops feeling like a limitation.

Context retention in very long sessions can degrade beyond roughly 30 to 40 interactions. The /compact command manages this gracefully — use it proactively rather than reactively.

Where Each Tool Belongs

Claude Code belongs on anything that requires genuine understanding of intent — complex projects, security research, architectural refactoring, creative technical problems where the brief is imprecise by nature.

OpenAI Codex belongs on budget-conscious workflows, teams that need reliable and predictable output, and anyone already paying for ChatGPT Plus who wants agentic coding without an additional subscription.

Google Antigravity belongs on your watchlist. The multi-agent orchestration and built-in browser testing are genuinely ahead of the competition. The output quality needs to catch up with the tooling vision — and based on the trajectory of Google’s AI investment, it might not take long.

The Bottom Line

One prompt. Three tools. One clear winner.

Claude Code won because it understood what the brief meant rather than just what it said. In real development work, where prompts are always imprecise and requirements are always incomplete, that gap-filling intelligence compounds into an enormous productivity advantage over weeks and months.

Codex is an excellent, honest, reliable tool that finished a genuine second. Antigravity showed the most ambitious vision and the most immature output — a combination that is frustrating in the short term and potentially exciting in the long term.

For right now, if you are choosing one agentic coding tool to build real things with, Claude Code is the answer. Not because of benchmarks. Because it shipped the only game in this test that felt like a game someone actually made.

AI coding tools are evolving faster than any single comparison can fully track. But the thing that separates genuinely useful tools from sophisticated toys has stayed constant: understanding what a developer means is always worth more than processing what they typed.