The Fort Worth Press - Clockwork.io Introduces A New Class of Fault Tolerance to End Failure-Driven GPU Waste in AI Training

USD -
AED 3.672502
AFN 62.99991
ALL 83.847188
AMD 377.663361
ANG 1.790083
AOA 916.999566
ARS 1398.213497
AUD 1.417696
AWG 1.8
AZN 1.703637
BAM 1.708212
BBD 2.017486
BDT 122.914738
BGN 1.709309
BHD 0.377651
BIF 2973.692945
BMD 1
BND 1.281814
BOB 6.92176
BRL 5.265302
BSD 1.001712
BTN 92.461144
BWP 13.649683
BYN 2.963911
BYR 19600
BZD 2.014516
CAD 1.367675
CDF 2256.999987
CHF 0.78755
CLF 0.023195
CLP 915.860146
CNY 6.896604
CNH 6.89166
COP 3694.09
CRC 471.29313
CUC 1
CUP 26.5
CVE 96.306777
CZK 21.297601
DJF 178.376159
DKK 6.50885
DOP 61.540611
DZD 132.375034
EGP 52.358967
ERN 15
ETB 156.356736
EUR 0.87114
FJD 2.215903
FKP 0.754939
GBP 0.752865
GEL 2.729771
GGP 0.754939
GHS 10.878299
GIP 0.754939
GMD 73.445873
GNF 8781.936498
GTQ 7.681659
GYD 209.565567
HKD 7.830625
HNL 26.515042
HRK 6.563202
HTG 131.339112
HUF 339.557056
IDR 16999
ILS 3.123685
IMP 0.754939
INR 92.2685
IQD 1312.214231
IRR 1321724.999909
ISK 125.1098
JEP 0.754939
JMD 157.170494
JOD 0.709023
JPY 159.113025
KES 129.498985
KGS 87.450098
KHR 4016.786833
KMF 431.000302
KPW 899.999993
KRW 1490.24498
KWD 0.30674
KYD 0.83472
KZT 490.385917
LAK 21464.006848
LBP 89699.372893
LKR 311.744232
LRD 183.302982
LSL 16.823764
LTL 2.95274
LVL 0.60489
LYD 6.391601
MAD 9.434294
MDL 17.474278
MGA 4159.188076
MKD 53.71692
MMK 2099.642329
MNT 3571.28497
MOP 8.074956
MRU 40.077209
MUR 46.740091
MVR 15.449849
MWK 1736.867158
MXN 17.805045
MYR 3.930504
MZN 63.909615
NAD 16.823837
NGN 1380.030291
NIO 36.857988
NOK 9.70619
NPR 147.937656
NZD 1.71158
OMR 0.3845
PAB 1.001625
PEN 3.454329
PGK 4.380142
PHP 59.696976
PKR 279.690813
PLN 3.718505
PYG 6462.347372
QAR 3.641255
RON 4.437799
RSD 102.272826
RUB 81.450381
RWF 1461.74237
SAR 3.752614
SBD 8.051718
SCR 13.688485
SDG 600.99956
SEK 9.375185
SGD 1.278935
SHP 0.750259
SLE 24.550073
SLL 20969.510825
SOS 571.47349
SRD 37.547978
STD 20697.981008
STN 21.398501
SVC 8.76469
SYP 110.524985
SZL 16.818349
THB 32.415975
TJS 9.601069
TMT 3.5
TND 2.962352
TOP 2.40776
TRY 44.187974
TTD 6.793399
TWD 31.984946
TZS 2605.000414
UAH 44.172726
UGX 3766.136217
UYU 40.238092
UZS 12094.904122
VES 442.704625
VND 26290
VUV 119.565255
WST 2.735215
XAF 572.920733
XAG 0.012652
XAU 0.0002
XCD 2.70255
XCG 1.805255
XDR 0.71253
XOF 572.918232
XPF 104.162209
YER 238.550019
ZAR 16.789401
ZMK 9001.1894
ZMW 19.497092
ZWL 321.999592
  • RBGPF

    0.1000

    82.5

    +0.12%

  • RYCEF

    -1.1300

    16.12

    -7.01%

  • CMSC

    -0.1500

    22.99

    -0.65%

  • GSK

    -0.8900

    53.39

    -1.67%

  • AZN

    -2.6000

    189.9

    -1.37%

  • CMSD

    -0.1100

    22.99

    -0.48%

  • RIO

    -2.8700

    87.83

    -3.27%

  • BCE

    -0.1100

    25.57

    -0.43%

  • RELX

    -0.0400

    34.14

    -0.12%

  • NGG

    0.0900

    90.9

    +0.1%

  • VOD

    0.1000

    14.41

    +0.69%

  • BTI

    0.0400

    59.93

    +0.07%

  • JRI

    -0.2300

    12.59

    -1.83%

  • BCC

    0.3800

    70

    +0.54%

  • BP

    0.5100

    42.67

    +1.2%

Clockwork.io Introduces A New Class of Fault Tolerance to End Failure-Driven GPU Waste in AI Training
Clockwork.io Introduces A New Class of Fault Tolerance to End Failure-Driven GPU Waste in AI Training

Clockwork.io Introduces A New Class of Fault Tolerance to End Failure-Driven GPU Waste in AI Training

New TorchPass solution addresses a multi-million dollar challenge with AI infrastructure; uses Live GPU Migration to keep large-scale AI training running through hardware failures instead of forcing costly restarts

Text size:

PALO ALTO, CA / ACCESS Newswire / March 11, 2026 / Clockwork.io, the leader in Software-Driven AI Fabrics- a programmable, vendor-neutral software layer that optimizes large-scale GPU clusters for real-time observability, fault tolerance, and deterministic performance-today announced the general availability of TorchPass Workload Fault Tolerance. This new class of software-driven fault-tolerance eliminates one of the most costly failure modes in large-scale AI training: catastrophic job restarts caused by infrastructure faults.

Delivered as a core capability of the Clockwork.io FleetIQ platform, TorchPass applies the principles of Software-Driven AI Fabrics to distributed training, using Live GPU Migration to allow workloads to continue running through GPU failures, network disruptions, driver bugs, and even full node crashes-without checkpoint restarts or lost progress.

"Companies are investing billions in next-gen chips, yet the costs of running distributed AI jobs remains grossly inflated because the ecosystem has accepted failure as a constant," said Suresh Vasudevan, CEO of Clockwork.io. "We built TorchPass to fundamentally reject that premise. Instead of treating failure as inevitable and restarting after the fact, TorchPass makes infrastructure faults invisible to the workload-training continues through failures transparently, in software. For a typical 2,048-GPU deployment, that translates into over $6 million a year in recovered compute. This is what our Software-Driven AI Fabric approach was designed to deliver: fault-tolerant AI infrastructure."

Dylan Patel, Founder and CEO of SemiAnalysis agreed that large-scale training jobs are limited by interruptions.

"As Blackwell clusters roll out with an NVL72 domain, and we look to the future with Rubin Ultra's NVL576 domain, the idea that a single GPU error or network link flap can take down an entire run is totally unacceptable," said Patel. "TorchPass solves a huge challenge with cluster reliability: it provides transparent failover and live workload migration that keeps MFU high, which in turn drives better GPU economics."

Why AI Training Fails at Scale

Distributed AI training remains one of the most failure-prone workloads in modern infrastructure. As cluster sizes grow, fragility increases sharply. Research from Meta FAIR shows that mean time to failure drops to 7.9 hours in a 1,024-GPU cluster and to just 1.8 hours at 16,384 GPUs. This means that for most large, AI-focused enterprises or AI clouds, failure-driven restarts are completely inevitable - making this a major barrier to scaling AI's impact.

Each failure forces training jobs to roll back to the most recent checkpoint, discarding minutes or hours of completed work and wasting additional time on manual intervention, reprovisioning resources and restarting training. These restarts silently cap GPU utilization, making reliability one of the largest hidden costs in AI infrastructure.

TorchPass addresses this problem by proactively addressing costly AI workload failures, solving them before the job stops or needs to restart. Vital for enterprises running large AI workloads and AI clouds alike, TorchPass dramatically improves the reliability of workloads and cluster utilization. For AI clouds, who can now address impacted GPUs while preserving the training run as planned, this translates into better customer SLAs and overall AI cloud economics, improving their ability to protect margin and deliver new models sooner.

"Managing compute output across large-scale GPU clusters is vital to ensuring we're delivering reliable capacity to our customers. By using TorchPass we have the support of a company that focuses on resilience like it is a core business function: it replaces any specific failing GPU and keeps the rest of the job moving, rather than making one small problem impact our large-scale operations," said David Power, CTO of Nscale. "In our evaluation, Live GPU Migration preserved both run continuity and throughput under real fault conditions, which is exactly what you need to deliver predictable time-to-train and a better customer experience at scale."

How Live GPU Migration Works: Reliability Without Restart

TorchPass performs transparent, in-flight migration of impacted training ranks to spare resources when failures occur. TorchPass typically completes recovery in approximately three minutes while the training process continues uninterrupted.

It supports resilience across three failure scenarios:

  • Unplanned migration, handling sudden events such as kernel crashes, power failures, or GPU faults by reconstructing state from healthy replicas

  • Pre-emptive migration, triggered by early warning signals such as rising temperatures or ECC memory errors, enabling controlled migration before a hard failure

  • Planned migration, enabling maintenance, patching, and workload rebalancing without interrupting training

This approach reduces wasted training progress by 95%, cutting lost time from approximately three hours per day to under ten minutes in a 1,024-GPU cluster.

Jordan Nanos, Member of Technical Staff and lead author of ClusterMAX-SemiAnalysis' independent benchmark for large-scale AI training-stress tested Clockwork.io TorchPass and found it delivered leading performance and efficiency for large-scale distributed training, enabling users to reduce checkpointing overhead in training. He shared the following results:

"In our testing, Clockwork.io TorchPass delivered the fastest and most efficient fault-tolerant performance for a gpt-oss-120B training run. We used TorchTitan on a Kubernetes cluster with 64x H200 GPUs. During our testing we measured job completion time (JCT) and Model FLOPs Utilization (MFU) against a standard approach (checkpoint-restart) and the leading open-source fault-tolerant training framework (TorchFT). We simulated multiple hardware failures on the cluster in order to stress test the fault-tolerant training frameworks.

When compared to checkpoint-restart, TorchPass was significantly faster to recover from failures. This reduced overall JCT and maintained high MFU. And when compared to TorchFT, TorchPass had a significantly higher MFU. This reduced overall JCT while also maintaining an equal time to recover from failures.

Using TorchPass also has a downstream effect where it provides users with an opportunity to reduce or even remove checkpointing from their training code. This means larger effective batch sizes, lower risk of out of memory errors (OOMs), and less time spent thinking about storage. For a research organization, this can ultimately mean a faster time to reach their training objective," concluded Nanos.

Measurable Business Impact from Software-Driven Fault-Tolerance

For customers operating large AI clusters, the impact is immediate and measurable. In a typical 2,048-GPU H200 deployment, TorchPass Workload Fault Tolerance delivers over $6 million in annual savings by preventing wasted compute.

These savings come from eliminating hundreds of thousands of GPU-hours that would otherwise be lost to failure-driven restarts, cascading retries, and idle recovery time. By keeping training jobs running through infrastructure faults instead of restarting them, TorchPass converts lost GPU time into productive training, significantly improving the return on GPU investments that today often operate at just 30-50% of theoretical performance.

Enabling the Next Generation of AI Infrastructure

By making reliability a software-defined capability rather than a hardware constraint, TorchPass provides the operational confidence required to deploy next-generation, tightly coupled systems such as NVIDIA GB200 and GB300 NVL72 and future rack-scale systems, where dense architectures amplify the cost of even small failures.

TorchPass builds on Clockwork.io's prior release of Network Fault Tolerance, which applies the same Software-Driven AI Fabric principles to network resilience by transparently rerouting traffic around link failures.

Together, these capabilities form Clockwork.io's Software-Driven AI Fabric, a vendor-neutral software layer spanning network, compute, and storage. As modern AI workloads run on tightly coupled clusters where hundreds or thousands of processors must operate in coordinated lockstep, infrastructure behaves as a single system, where reliability and performance directly determine overall efficiency. By managing this complexity in software, Clockwork.io enables operators to run heterogeneous AI infrastructure as a unified platform-maintaining high utilization, predictable performance, and resilience while preserving the flexibility to evolve hardware and improve the economics of large-scale AI deployments.

To learn more about the launch of TorchPass, visit the Clockwork.io team in-person at NVIDIA GTC from March 16-19, Booth #205, or visit https://clockwork.io.

About Clockwork.io
Clockwork.io pioneers Software-Driven AI Fabrics™, delivering a programmable software layer that makes large-scale AI clusters observable, deterministic, and resilient by design to drive continuous workload progress and peak cluster utilization. Its FleetIQ platform enables enterprises to train, deploy, and serve the world's most demanding AI workloads faster, more reliably, and at lower cost. Companies including Uber, Wells Fargo, DCAI, Nebius, Nscale, and White Fiber trust Clockwork.io to power their AI infrastructure. Learn more at www.clockwork.io.

Media Contact
Dana Trismen
[email protected]
650-269-7478

SOURCE: Clockwork



View the original press release on ACCESS Newswire

W.Lane--TFWP