Today OpenAI launched its new $100/month Pro plan. Heavy Codex users get five times more capacity than the $20 Plus plan — without jumping straight to the $200 full Pro tier. Anthropic offers exactly the same structure: Claude Max at $100 (5x) and $200 (20x). The symmetry is not a coincidence.
Both companies are competing for the same audience: developers running coding agents daily, reviewing code, orchestrating refactors. Both are hitting real limits — not strategic ones, but economic ones. The hardware running these models has never been more expensive.
Why Prices Are Rising, Not Falling
Technology gets cheaper over time. Moore’s Law, scale effects, competition. That’s the standard narrative. In AI infrastructure in 2026, it isn’t holding.
Nvidia H100 GPU rental prices have risen 40% in the past six months — from $1.70/hour in October 2025 to $2.35/hour in March 2026. The spot market is effectively sold out. Teams that need compute on short notice are paying up to $14/hour for Blackwell B200 instances on AWS. New Blackwell cluster delivery times stretch to June–July 2026, with production capacity through August–September already fully pre-booked.
Memory is even more extreme: LPDDR5 contract prices in Q1 2026 are running at four times the prior year level, DDR5 at five times. Server OEMs are passing these costs through — plus a margin on top.
In the background: the four largest hyperscalers (Alphabet, Microsoft, Meta, Amazon) are collectively planning around $700 billion in AI infrastructure spending for 2026 alone. GPU spend from just these four buyers likely exceeds $140 billion. At this demand density, no model efficiency gain can absorb the rising operational costs.
The result: OpenAI and Anthropic need to recover these costs somewhere. The new $100 plans are the cleanest answer they have.
What You Get With $100 and $200 — and What You Don’t
Both providers mirror each other almost exactly:
| Plan | Provider | Monthly | Capacity |
|---|---|---|---|
| Plus | OpenAI | $20 | Base Codex limits |
| Pro | OpenAI | $100 | 5x Codex (10x until May 31) |
| Pro | OpenAI | $200 | 20x Codex, parallel workflows |
| Claude Max 5x | Anthropic | $100 | 5x Claude Code |
| Claude Max 20x | Anthropic | $200 | 20x Claude Code |
That sounds like freedom of choice. It is — as long as you stay inside one of these two ecosystems. The issue: you’re not just paying for compute, you’re paying for data passing through a vendor’s servers. You’re paying for rate limits that can change. And you’re paying every month, regardless of how heavily you actually use the capacity.
For many developers and teams this is completely fine — the overhead of self-hosting far outweighs the subscription costs. But that equation flips once token volume grows.
Gemma 4: The Math OpenAI and Anthropic Don’t Love
In early April, Google released Gemma 4 under the Apache 2.0 license. Four models, from edge devices to workstation GPUs. The flagship, the 31B Dense, scores 89.2% on the AIME 2026 math benchmark and a Codeforces ELO of 2,150 — expert-level competitive programming. For context: Gemma 3 scored 20.8% on the same benchmark with an ELO of 110.
What this means in practice shows up in a simple cost comparison:
| Setup | Cost per 1M tokens |
|---|---|
| Claude Sonnet 4.6 (API, blended) | ~$9.00 |
| Claude Opus 4.6 (API, blended) | ~$15.00 |
| Gemma 4 31B local (RTX 4090, Q4) | ~$0.002 |
An RTX 4090 costs around $1,600 one-time. At one million tokens per day, it pays for itself in roughly six months compared to Claude Sonnet API costs. At five million tokens daily — not unusual for an active development team running coding agents — that drops to 36 days. At ten million tokens: 18 days.
The Apache 2.0 license allows unrestricted commercial use. LoRA fine-tuning costs $50–500 in GPU compute one-time — after that, the custom model runs with zero API markup.
Which Approach Makes Sense When
No single framework fits every team. Our take from practice:
API subscription (OpenAI Pro / Claude Max) makes sense when:
- Token volume stays below ~500K per day
- No DevOps capacity to run GPU servers
- Absolute frontier quality is required — the most complex agentic workflows still run more reliably on proprietary models
- Time-to-market matters more than cost control
Self-hosting (Gemma 4 + Ollama / vLLM) makes sense when:
- Volume is high and growing daily
- Data privacy is a requirement, not a preference (GDPR, healthcare, legal, finance)
- Domain-specific fine-tuned quality is needed
- An RTX 4090, M4 Mac Pro, or small GPU cluster is available or budgeted
Pay-as-you-go (API without subscription) stays useful for:
- Irregular or sporadic usage
- Experimental workloads and prototypes
- Teams that want to switch providers without lock-in
The Real Question
OpenAI and Anthropic build excellent models. The $100 plans aren’t predatory — they’re the economically logical response to hardware costs that have structurally changed.
The question isn’t whether these providers are worth the money. The question is whether growing dependence on a handful of platforms fits your strategic goals. Every price round, every limit adjustment, every API change is outside your control.
With Gemma 4 31B under Apache 2.0, there’s finally a realistic answer: a model that runs on consumer hardware, performs at frontier level, and hands data sovereignty back to you. Completely.
Considering whether self-hosting makes sense for your team? We help with the evaluation, the setup, and the decision about which workloads should actually run locally — and which shouldn’t. Get in touch.