The Fort Worth Press - Clockwork.io Launches The Industry's First Contractual Commitment to End GPU Waste in AI Training

USD -
AED 3.67305
AFN 63.500677
ALL 82.449476
AMD 368.049947
ANG 1.790403
AOA 917.508457
ARS 1484.565183
AUD 1.449895
AWG 1.8
AZN 1.707292
BAM 1.716457
BBD 2.014726
BDT 123.242589
BGN 1.69088
BHD 0.377035
BIF 2975
BMD 1
BND 1.296755
BOB 6.937497
BRL 5.223598
BSD 1.000298
BTN 95.33551
BWP 14.280449
BYN 2.914275
BYR 19600
BZD 2.01183
CAD 1.421065
CDF 2264.999693
CHF 0.80877
CLF 0.023525
CLP 926.059747
CNY 6.79395
CNH 6.795075
COP 3430.18
CRC 455.303389
CUC 1
CUP 26.5
CVE 96.875034
CZK 21.285197
DJF 178.128236
DKK 6.56453
DOP 59.798462
DZD 133.330407
EGP 49.104498
ERN 15
ETB 158.950128
EUR 0.87825
FJD 2.245199
FKP 0.754315
GBP 0.75355
GEL 2.639964
GGP 0.754315
GHS 11.365007
GIP 0.754315
GMD 73.502137
GNF 8772.50249
GTQ 7.629052
GYD 209.24824
HKD 7.843685
HNL 26.719833
HRK 6.614202
HTG 130.790023
HUF 311.684502
IDR 17946.7
ILS 2.985502
IMP 0.754315
INR 95.04725
IQD 1310.5
IRR 1376000.00034
ISK 126.28032
JEP 0.754315
JMD 157.314119
JOD 0.708981
JPY 162.437503
KES 129.449728
KGS 87.449724
KHR 4009.999904
KMF 432.000207
KPW 900.00035
KRW 1551.76006
KWD 0.309699
KYD 0.83364
KZT 479.437628
LAK 22499.999678
LBP 89550.000301
LKR 336.036368
LRD 182.300214
LSL 16.419712
LTL 2.95274
LVL 0.60489
LYD 6.40987
MAD 9.38503
MDL 17.690836
MGA 4269.999706
MKD 54.123225
MMK 2099.611597
MNT 3582.983883
MOP 8.081898
MRU 40.149983
MUR 47.205074
MVR 15.460225
MWK 1736.999812
MXN 17.541275
MYR 4.0941
MZN 63.849903
NAD 16.419861
NGN 1378.680147
NIO 36.620178
NOK 9.92355
NPR 152.537167
NZD 1.762815
OMR 0.384514
PAB 1.000298
PEN 3.412953
PGK 4.378002
PHP 61.612025
PKR 278.04983
PLN 3.768119
PYG 6080.073017
QAR 3.645504
RON 4.593204
RSD 103.085988
RUB 77.75513
RWF 1465
SAR 3.751401
SBD 8.065041
SCR 13.452006
SDG 600.5029
SEK 9.73315
SGD 1.295704
SHP 0.746601
SLE 24.800301
SLL 20969.503664
SOS 571.497688
SRD 37.504498
STD 20697.981008
STN 21.85
SVC 8.752391
SYP 110.532098
SZL 16.410357
THB 33.324021
TJS 9.252979
TMT 3.51
TND 2.94625
TOP 2.40776
TRY 46.6669
TTD 6.790936
TWD 31.852498
TZS 2625.003016
UAH 44.843589
UGX 3665.771506
UYU 40.21203
UZS 12049.999786
VES 622.24352
VND 26300.5
VUV 120.098371
WST 2.780884
XAF 575.673565
XAG 0.016628
XAU 0.000245
XCD 2.70255
XCG 1.802784
XDR 0.715018
XOF 573.49884
XPF 104.849869
YER 238.59315
ZAR 16.39625
ZMK 9001.195602
ZMW 18.211258
ZWL 321.999592
  • RIO

    -1.1000

    93.83

    -1.17%

  • CMSC

    0.1900

    21.83

    +0.87%

  • NGG

    -2.3650

    80.505

    -2.94%

  • BTI

    -0.9600

    60.8

    -1.58%

  • BCE

    -0.1950

    21.315

    -0.91%

  • GSK

    -1.2550

    51.165

    -2.45%

  • BCC

    -1.6500

    75.98

    -2.17%

  • RYCEF

    0.4000

    19.5

    +2.05%

  • JRI

    0.0080

    12.968

    +0.06%

  • CMSD

    0.1400

    22.04

    +0.64%

  • VOD

    -0.2050

    13.02

    -1.57%

  • RBGPF

    0.6100

    65.61

    +0.93%

  • BP

    -0.6650

    36.285

    -1.83%

  • AZN

    -5.6300

    183.99

    -3.06%

  • RELX

    -0.1600

    31.51

    -0.51%

Clockwork.io Launches The Industry's First Contractual Commitment to End GPU Waste in AI Training
Clockwork.io Launches The Industry's First Contractual Commitment to End GPU Waste in AI Training

Clockwork.io Launches The Industry's First Contractual Commitment to End GPU Waste in AI Training

"You Only Compute Once" (YOCO) guarantees to resolve 90% of AI training failures with no lost progress, or customers get credit

Text size:

PALO ALTO, CA / ACCESS Newswire / July 1, 2026 / Clockwork.io, pioneer of Software-Driven AI Fabrics™ and the company behind TorchPass AI fault tolerance, today announced the YOCO Guarantee - the industry's first contractual commitment to dramatically reduce the hidden, compounding cost of training failure in large-scale AI infrastructure. The announcement marks a turning point in how the AI industry measures infrastructure reliability - moving beyond uptime metrics designed for a previous era towards goals AI teams value most: whether the job finishes on time, without losing work.

Under the YOCO (You Only Compute Once) Guarantee, Clockwork.io commits that at least 90% of training failures on supported TorchPass workloads will be resolved through live GPU migration, with no lost training progress, no checkpoint rollback, and no recompute. If Clockwork.io falls short of that commitment in any contract year, customers receive a 25% credit against their next TorchPass renewal or expansion.

"We built TorchPass to make training failure irrelevant," said Suresh Vasudevan, CEO of Clockwork.io. "The YOCO Guarantee is a line in the contract. We're putting skin in the game because we know TorchPass delivers, and we want our customers to know it too."

The Hidden Tax on AI Progress

Every AI organization training at scale faces the same brutal math: GPU clusters fail constantly, and every failure triggers an expensive restart cycle. According to research published by Meta FAIR at HPCA 2025, a 1,024-GPU cluster experiences a mean time to failure of just 7.9 hours - and at 16,384 GPUs, that drops to 1.8 hours. Each failure forces teams to provision replacement nodes, restore from the last checkpoint, and recompute every training step since that checkpoint was taken. That recomputed work costs full GPU dollars - compute you already paid for, run again from scratch. The cycle typically costs three or more hours of progress per failure event, with losses accumulating daily.

The consequence is that current GPU clusters effectively operate at 30-50% of their theoretical performance - not because the hardware is slow, but because the reliability framework governing it was never designed for workloads of this nature, duration, or scale.

"AI teams need their models to be done, not their nodes to be up. The industry has been measuring node uptime and calling it reliability. YOCO holds us accountable for the only thing that matters - your model, done," said Vasudevan.

The financial toll is severe. In a typical 2,048-GPU H200 deployment, failure-driven restarts drain over $6 million per year in wasted compute - hundreds of thousands of GPU-hours lost to cascading retries, idle recovery time, and recomputed training steps. For AI builders, the real unit of value is not GPU uptime but time to trained model - yet the infrastructure contract they've been buying guarantees node availability, not job continuity. For AI operators, the gap is equally costly: when a customer's training job fails, restarts, and loses days of progress, the experience is one of unreliability - regardless of what the SLA technically said.

"Recompute and restart is the hidden tax of large-scale training," said Vasudevan. "Most teams treat it as a fact of life. It isn't."

The YOCO Guarantee changes that contract.

TorchPass: Reliability Redefined in Software

Clockwork.io's answer is to make reliability a software-defined property rather than a function of hardware uptime - a fundamental architectural rethink that decouples job continuity from the failure rate of any individual component.

TorchPass addresses failure at its root through live GPU migration - when a fault occurs, TorchPass transfers the training job's full in-memory state, including model weights, gradients, and optimizer state, to a healthy spare node. Training continues from exactly where it stopped, typically completing recovery in approximately three minutes. No checkpoint restore. No recompute. No lost progress.

TorchPass handles three classes of failure: unplanned migration for sudden, catastrophic faults - kernel crashes, power failures, GPU failures - where state is reconstructed from healthy replicas; pre-emptive migration triggered by early warning signals like rising ECC error rates or thermal thresholds, enabling a controlled handoff before failure occurs; and planned migration for proactive maintenance, security patching, and firmware updates, allowing infrastructure hygiene without interrupting training. Across all three scenarios, the job never stops.

This approach reduces wasted training progress by 90%, cutting lost time from approximately three hours per day to under ten minutes in a 1,024-GPU cluster - meaning research teams no longer discover hours of progress silently erased, and model release timelines become predictable rather than probabilistic.

In independent testing conducted by SemiAnalysis, a leading AI infrastructure research firm, TorchPass outperformed every competing fault-tolerance framework - the only solution that "maintains the same training performance as jobs without fault tolerance."

TorchPass is 100% software-based, runs in cloud and on-premises environments, and supports popular training frameworks including TorchTitan, Megatron-LM, and DeepSpeed, on schedulers including Kubernetes and Slurm. It works across NVIDIA and AMD hardware, and across InfiniBand, RoCE, and Ethernet fabrics - with no hardware lock-in of any kind.

Why the Guarantee Changes the Market

For AI builders, it redefines the SLA they should demand. The question is no longer "what is your node uptime?" but "what percentage of my training failures will be resolved without losing progress?" - a metric tied directly to GPU ROI, not an availability percentage that has historically had little relationship to whether models get trained on time. The YOCO Guarantee makes that question answerable and auditable.

For AI operators, it raises the competitive bar. AI Cloud operators and infrastructure providers who can offer job-level continuity guarantees - backed by contractual credits - will command premium pricing, win customers burned by restart-driven losses, and protect their margins by dramatically reducing their GPU idle time. Those who cannot will find themselves competing only on raw GPU price in a commoditizing market.

And for the industry as a whole, it establishes a new accountability standard. The AI infrastructure market has long accepted vendor claims about fault tolerance at face value, with no contractual obligation behind them. The YOCO Guarantee - measurable and contractually backed - introduces a standard the market will increasingly expect others to match or explain why they cannot.

"There's a big difference between a vendor making a slide that says their product works and them writing it into a contract," said Jordan Nanos, Member of Technical Staff and lead author of ClusterMAX at SemiAnalysis. "In our testing, TorchPass delivered the fastest and most efficient fault-tolerant performance for a GPT-OSS-120B training run on a 64x H200 cluster when compared to checkpoint-restart on job completion time. TorchPass also outperformed TorchFT (in terms of MFU and tokens/sec/GPU) for this job, while matching its recovery time. The YOCO Guarantee just reflects what we saw in testing, and makes it contractual."

"Every enterprise running large-scale AI training knows the cost of a failed job: hours of progress lost, recomputes billed, model timelines slipping. Every product decision we make at Scaleway comes back to one question: are we making our customers' outcomes more predictable? Node uptime answers a different question entirely. The YOCO Guarantee is the first infrastructure commitment we've seen built around the right metric - whether progress is protected and the jobs keep running to completion, not whether the hardware stays up. That's the accountability model the AI infrastructure market has been missing," said Fred Bardolle, Head of Products and AI at Scaleway.

Availability

The YOCO Guarantee is available to new and renewing TorchPass customers effective August 3, 2026. Existing TorchPass customers should contact their Clockwork.io account team to discuss adding the guarantee to their current agreement. To learn more or get started, visit clockwork.io/yoco.

Clockwork.io will be at RAISE Summit in Paris, France, July 8-9, Booth #27A. Suresh Vasudevan, CEO of Clockwork.io, will also take part in the panel "Infrastructure as Destiny: The Compute-Capital-Cloud Trinity" on July 8th at 10:40 a.m. local time on the Main Stage.

About Clockwork.io

Clockwork.io pioneers Software-Driven AI Fabrics™ - a programmable layer between hardware and workload that delivers nanosecond-accurate telemetry, AI fault tolerance, and performance optimization across any accelerator, network, or deployment model. Modern AI workloads need the whole cluster to act as one machine, but failures and infrastructure bottlenecks severely compromise efficiency. Clockwork.io's FleetIQ platform recovers that lost capacity, letting enterprises train, deploy, and serve the world's most demanding AI workloads faster, more reliably, and at lower cost - across any Ethernet, RoCE, or InfiniBand fabric, without hardware lock-in. TorchPass, Clockwork.io's AI fault tolerance product, is independently benchmarked by SemiAnalysis as the only solution that maintains full training throughput during failures, outperforming checkpoint-restart and leading open-source frameworks. Uber, Wells Fargo, DCAI, Nebius, NScale, and White Fiber trust Clockwork.io to power their AI infrastructure. Learn more at www.clockwork.io

© 2026 Clockwork Systems Inc. TorchPass and YOCO Guarantee are trademarks of Clockwork Systems Inc. All other trademarks are the property of their respective owners.

Media Contact

Dana Trismen
[email protected]
650-269-7478

SOURCE: Clockwork



View the original press release on ACCESS Newswire

A.Maldonado--TFWP