Clockwork.io Launches The Industry's First Contractual Commitment to End GPU Waste in AI Training

TECHNOLOGY 01.07.2026

"You Only Compute Once" (YOCO) guarantees to resolve 90% of AI training failures with no lost progress, or customers get credit

Text size:

PALO ALTO, CA / ACCESS Newswire / July 1, 2026 / Clockwork.io, pioneer of Software-Driven AI Fabrics™ and the company behind TorchPass AI fault tolerance, today announced the YOCO Guarantee - the industry's first contractual commitment to dramatically reduce the hidden, compounding cost of training failure in large-scale AI infrastructure. The announcement marks a turning point in how the AI industry measures infrastructure reliability - moving beyond uptime metrics designed for a previous era towards goals AI teams value most: whether the job finishes on time, without losing work.

Under the YOCO (You Only Compute Once) Guarantee, Clockwork.io commits that at least 90% of training failures on supported TorchPass workloads will be resolved through live GPU migration, with no lost training progress, no checkpoint rollback, and no recompute. If Clockwork.io falls short of that commitment in any contract year, customers receive a 25% credit against their next TorchPass renewal or expansion.

"We built TorchPass to make training failure irrelevant," said Suresh Vasudevan, CEO of Clockwork.io. "The YOCO Guarantee is a line in the contract. We're putting skin in the game because we know TorchPass delivers, and we want our customers to know it too."

The Hidden Tax on AI Progress

Every AI organization training at scale faces the same brutal math: GPU clusters fail constantly, and every failure triggers an expensive restart cycle. According to research published by Meta FAIR at HPCA 2025, a 1,024-GPU cluster experiences a mean time to failure of just 7.9 hours - and at 16,384 GPUs, that drops to 1.8 hours. Each failure forces teams to provision replacement nodes, restore from the last checkpoint, and recompute every training step since that checkpoint was taken. That recomputed work costs full GPU dollars - compute you already paid for, run again from scratch. The cycle typically costs three or more hours of progress per failure event, with losses accumulating daily.

The consequence is that current GPU clusters effectively operate at 30-50% of their theoretical performance - not because the hardware is slow, but because the reliability framework governing it was never designed for workloads of this nature, duration, or scale.

"AI teams need their models to be done, not their nodes to be up. The industry has been measuring node uptime and calling it reliability. YOCO holds us accountable for the only thing that matters - your model, done," said Vasudevan.

The financial toll is severe. In a typical 2,048-GPU H200 deployment, failure-driven restarts drain over $6 million per year in wasted compute - hundreds of thousands of GPU-hours lost to cascading retries, idle recovery time, and recomputed training steps. For AI builders, the real unit of value is not GPU uptime but time to trained model - yet the infrastructure contract they've been buying guarantees node availability, not job continuity. For AI operators, the gap is equally costly: when a customer's training job fails, restarts, and loses days of progress, the experience is one of unreliability - regardless of what the SLA technically said.

"Recompute and restart is the hidden tax of large-scale training," said Vasudevan. "Most teams treat it as a fact of life. It isn't."

The YOCO Guarantee changes that contract.

TorchPass: Reliability Redefined in Software

Clockwork.io's answer is to make reliability a software-defined property rather than a function of hardware uptime - a fundamental architectural rethink that decouples job continuity from the failure rate of any individual component.

TorchPass addresses failure at its root through live GPU migration - when a fault occurs, TorchPass transfers the training job's full in-memory state, including model weights, gradients, and optimizer state, to a healthy spare node. Training continues from exactly where it stopped, typically completing recovery in approximately three minutes. No checkpoint restore. No recompute. No lost progress.

TorchPass handles three classes of failure: unplanned migration for sudden, catastrophic faults - kernel crashes, power failures, GPU failures - where state is reconstructed from healthy replicas; pre-emptive migration triggered by early warning signals like rising ECC error rates or thermal thresholds, enabling a controlled handoff before failure occurs; and planned migration for proactive maintenance, security patching, and firmware updates, allowing infrastructure hygiene without interrupting training. Across all three scenarios, the job never stops.

This approach reduces wasted training progress by 90%, cutting lost time from approximately three hours per day to under ten minutes in a 1,024-GPU cluster - meaning research teams no longer discover hours of progress silently erased, and model release timelines become predictable rather than probabilistic.

In independent testing conducted by SemiAnalysis, a leading AI infrastructure research firm, TorchPass outperformed every competing fault-tolerance framework - the only solution that "maintains the same training performance as jobs without fault tolerance."

TorchPass is 100% software-based, runs in cloud and on-premises environments, and supports popular training frameworks including TorchTitan, Megatron-LM, and DeepSpeed, on schedulers including Kubernetes and Slurm. It works across NVIDIA and AMD hardware, and across InfiniBand, RoCE, and Ethernet fabrics - with no hardware lock-in of any kind.

Why the Guarantee Changes the Market

For AI builders, it redefines the SLA they should demand. The question is no longer "what is your node uptime?" but "what percentage of my training failures will be resolved without losing progress?" - a metric tied directly to GPU ROI, not an availability percentage that has historically had little relationship to whether models get trained on time. The YOCO Guarantee makes that question answerable and auditable.

For AI operators, it raises the competitive bar. AI Cloud operators and infrastructure providers who can offer job-level continuity guarantees - backed by contractual credits - will command premium pricing, win customers burned by restart-driven losses, and protect their margins by dramatically reducing their GPU idle time. Those who cannot will find themselves competing only on raw GPU price in a commoditizing market.

And for the industry as a whole, it establishes a new accountability standard. The AI infrastructure market has long accepted vendor claims about fault tolerance at face value, with no contractual obligation behind them. The YOCO Guarantee - measurable and contractually backed - introduces a standard the market will increasingly expect others to match or explain why they cannot.

"There's a big difference between a vendor making a slide that says their product works and them writing it into a contract," said Jordan Nanos, Member of Technical Staff and lead author of ClusterMAX at SemiAnalysis. "In our testing, TorchPass delivered the fastest and most efficient fault-tolerant performance for a GPT-OSS-120B training run on a 64x H200 cluster when compared to checkpoint-restart on job completion time. TorchPass also outperformed TorchFT (in terms of MFU and tokens/sec/GPU) for this job, while matching its recovery time. The YOCO Guarantee just reflects what we saw in testing, and makes it contractual."

"Every enterprise running large-scale AI training knows the cost of a failed job: hours of progress lost, recomputes billed, model timelines slipping. Every product decision we make at Scaleway comes back to one question: are we making our customers' outcomes more predictable? Node uptime answers a different question entirely. The YOCO Guarantee is the first infrastructure commitment we've seen built around the right metric - whether progress is protected and the jobs keep running to completion, not whether the hardware stays up. That's the accountability model the AI infrastructure market has been missing," said Fred Bardolle, Head of Products and AI at Scaleway.

Availability

The YOCO Guarantee is available to new and renewing TorchPass customers effective August 3, 2026. Existing TorchPass customers should contact their Clockwork.io account team to discuss adding the guarantee to their current agreement. To learn more or get started, visit clockwork.io/yoco.

Clockwork.io will be at RAISE Summit in Paris, France, July 8-9, Booth #27A. Suresh Vasudevan, CEO of Clockwork.io, will also take part in the panel "Infrastructure as Destiny: The Compute-Capital-Cloud Trinity" on July 8th at 10:40 a.m. local time on the Main Stage.

About Clockwork.io

Clockwork.io pioneers Software-Driven AI Fabrics™ - a programmable layer between hardware and workload that delivers nanosecond-accurate telemetry, AI fault tolerance, and performance optimization across any accelerator, network, or deployment model. Modern AI workloads need the whole cluster to act as one machine, but failures and infrastructure bottlenecks severely compromise efficiency. Clockwork.io's FleetIQ platform recovers that lost capacity, letting enterprises train, deploy, and serve the world's most demanding AI workloads faster, more reliably, and at lower cost - across any Ethernet, RoCE, or InfiniBand fabric, without hardware lock-in. TorchPass, Clockwork.io's AI fault tolerance product, is independently benchmarked by SemiAnalysis as the only solution that maintains full training throughput during failures, outperforming checkpoint-restart and leading open-source frameworks. Uber, Wells Fargo, DCAI, Nebius, NScale, and White Fiber trust Clockwork.io to power their AI infrastructure. Learn more at www.clockwork.io

Media Contact

Dana Trismen
[email protected]
650-269-7478

SOURCE: Clockwork

View the original press release on ACCESS Newswire

A.Maldonado--TFWP