The AI Infrastructure War Is Now a Cloud Contract War
by Founder @devroaks
SpaceX Is Now an AI Compute Provider — and Anthropic Is Paying $1.25B/Month for It
In a freshly filed SpaceX S-1, a detail slipped through that deserves more attention than it’s getting: Anthropic has signed Cloud Services Agreements with SpaceX, committing to pay $1.25 billion per month for compute access across COLOSSUS and COLOSSUS II — SpaceX’s Elon Musk-backed supercomputer clusters — through May 2029. Capacity is ramping through May and June 2026 at a reduced rate, with either party able to exit on 90 days’ notice.
To put that number in perspective: $1.25B/month is $15B/year in raw compute spend, from a single vendor. That’s not a pilot program or a research budget line — that’s a strategic infrastructure bet.
What does this tell us technically?
- Training at this scale requires bespoke arrangements. The public cloud (AWS, Azure, GCP) simply cannot guarantee the contiguous GPU/TPU allocations needed for frontier model training runs. COLOSSUS II’s purpose-built architecture — originally designed for Grok 5 training — is being time-shared with Anthropic.
- The inference cliff is real. The compute required to serve frontier models at scale is now approaching the cost of training them. Anthropic’s API traffic is significant enough that leasing dedicated clusters is economically competitive with spot instances.
- Vendor lock-in is a two-way street. SpaceX’s S-1 notes the agreement is terminable on 90 days’ notice — a surprisingly short leash for this level of spend. That clause likely reflects Anthropic’s negotiating leverage (or hedging) as alternative compute providers mature.
The broader signal: the AI infrastructure layer is bifurcating. Hyperscalers handle commodity inference; bespoke clusters handle frontier training and high-throughput serving. Expect more deals like this.
Google I/O 2026: A Lot of Announcements, Not Much You Can Touch Yet
Google I/O was, by most accounts, overwhelming in breadth and underwhelming in immediate availability. A few things worth parsing technically:
Gemini Spark: The Agentic Play
Google’s most consequential I/O announcement is Gemini Spark, a personal AI agent that plugs natively into Gmail, Calendar, Drive, Docs, Sheets, Slides, YouTube, and Maps. It’s positioned as a direct competitor to OpenClaw-style agents, and it’s launching first for AI Ultra subscribers at $100/month.
The security architecture is worth scrutinizing. Google’s enterprise documentation describes Spark as running in:
“a fully managed, secure runtime on Google Cloud… Every task executes in a fresh, strictly isolated, ephemeral VM to help ensure data never overlaps between sessions. All traffic routes through our secure Agent Gateway that enforces Data Loss Prevention (DLP) policies, while user credentials remain fully encrypted and are never exposed directly to the agent.”
This is the right architecture on paper — ephemeral VMs, credential abstraction, DLP enforcement. But the threat model for a personal agent with Gmail access is brutal. Prompt injection via a malicious email is a trivially simple attack vector. The question isn’t whether Google’s infrastructure is secure; it’s whether the model can be manipulated into acting on injected instructions embedded in content it reads. That’s an alignment and evaluation problem, not an infra problem.
Notably, Gemini Spark runs on Gemini 3.5 Flash and Antigravity — the latter being Google’s new closed-source Go-based agent runtime. The open-source Gemini CLI (Apache 2.0 TypeScript) is being deprecated on June 18th in favor of the Antigravity CLI. That’s a significant philosophical pivot: Google is closing the stack at the agent layer.
The Search Box Redesign
The NYT led with the fact that Google is widening its search bar for the first time since 2001. This sounds trivial but signals something meaningful: Google is institutionally acknowledging that the query paradigm has changed. Users now type multi-sentence natural language questions, attach images and video, and expect dialogue — not a list of blue links. The UX is catching up to the model capability.
The subtler story, as several commentators noted, is what this means for the open web. Google’s agentic search doesn’t just answer your apartment-hunting query — it subscribes to listings and notifies you without you ever visiting Zillow. That’s a structural threat to any site that currently earns traffic from search intent.
Gemini 3.5 Flash
Gemini 3.5 Flash is in general availability. Simon Willison has already noted that this is one of the few I/O announcements you can actually test today. Most of the headline features — Spark, Project Aura smart glasses, the revamped Search — are “coming soon.” The developer story for Gemini 3.5 Flash is worth watching, particularly its positioning in latency-sensitive agentic pipelines.
Supply Chain Risk in AI Tooling: Malware in PyTorch Lightning
A Semgrep security disclosure trending on Hacker News this week: malicious code was found embedded in the PyTorch Lightning library, a widely used training abstraction layer. The dependency was used in AI training pipelines across the ecosystem before detection.
This is a recurring pattern that deserves more systemic attention. The AI/ML toolchain has an enormous attack surface:
- Training frameworks with complex C++/CUDA extension chains
- Hub-style model distribution (Hugging Face, PyPI) with limited signing infrastructure
- Notebook-first workflows where
pip installis reflexive
The irony is that as AI systems become more autonomous and start writing and executing their own code (vibe-coding, agents), the blast radius of a compromised dependency grows proportionally. An agent that auto-installs packages from model-generated code is a supply chain attack waiting to happen.
For teams running serious ML infrastructure: pin your dependencies, use reproducible builds, and treat your training environment with the same paranoia you’d apply to production auth services.
LLM Token Speed: What the Numbers Actually Mean
A small but useful tool making the rounds: tokenspeed by Mike Veerman lets you viscerally feel the difference between 10, 30, 100, and 800 tokens/second by simulating live output at each rate.
This matters more than it might seem. Latency perception in LLM interfaces is non-linear. 10 tok/s feels painfully slow for code completion but is fine for a long-form essay you’re reading. 800 tok/s is invisible — faster than human reading comprehension. The useful range for most interactive use cases is roughly 30–150 tok/s, which is where most hosted inference APIs land today.
When evaluating models for production use, token throughput is often more important than raw benchmark scores. A model that scores 5% higher on MMLU but runs at 40 tok/s will feel worse than a slightly weaker model at 120 tok/s in any latency-sensitive application.
The Week’s Throughline
The story across all of these data points is the same: the AI layer is moving from research artifact to infrastructure primitive, and all the classic infrastructure problems are showing up on schedule — supply chain security, vendor concentration risk, adversarial inputs at the application layer, and the ongoing tension between openness and control.
Google is closing the Gemini CLI stack. Anthropic is signing billion-dollar compute contracts with SpaceX. Malware in training libraries. These aren’t isolated news items; they’re the predictable growing pains of a technology transitioning from lab to production at a massive scale.
The developers who will navigate this well are the ones treating AI components with the same rigour they’d apply to any other critical dependency: threat model it, pin it, monitor it, and have a fallback.
Sources: Simon Willison’s Weblog, Hacker News, Daring Fireball, SpaceX S-1 (SEC EDGAR)