THU FASTsys

  •   List
  •   Category
  •   Archive
  •   Tag

Gemini Fast Failure Recovery in Distributed Training with In-Memory Checkpoints

Mar 12, 2024

 SOSP 23


---
 Reporter
  • 周与祺
 Tags
  •  ML
  •  Checkpointing
  •  Fault Tolerance
 Related
  • Removing Obstacles before Breaking Through the Memory Wall A Close Look at HBM Errors in the Field
  • Cachew ML input Data Processing as a Service
  • Demystifying and Checking Silent Semantic Violations in Large Distributed Systems
  • Carbink Fault-Tolerant Far Memory
  • Geometric Partitioning Explore the Boundary of Optimal Erasure Code Repair
  • NVMe SSD Failures in the Field the Fail-Stop and the Fail-Slow
  • 存储盘用户画像刻画
  • TVStore Automatically Bounding Time Series Storage via Time-Varying Compression
  • DeepSketch A New Machine Learning-Based Reference Search Technique for Post-Deduplication Delta Compression
  • FlashNeuron SSD-Enabled Large-Batch Training of Very Deep Neural Networks
  • Privacy Budget Scheduling
  • Jump-Starting Multivariate Time Series Anomaly Detection for Online Service Systems
THU FASTsys