ExplorerData ScienceMachine Learning
Research PaperResearchia:202605.19004

A Readiness-Driven Runtime for Pipeline-Parallel Training under Runtime Variability

Ruitao Liu

Abstract

Pipeline parallelism is a key technique for scaling large-model training, but modern workloads exhibit runtime variability in computation and communication. Existing pipeline systems typically consume static, profiled, or adaptively generated schedules as pre-committed execution orders. When realized task readiness diverges from the pre-committed order, stages may wait for not-yet-ready work even though other executable work is available, creating stage misalignment, idle bubbles, and reduced ut...

Submitted: May 19, 2026Subjects: Machine Learning; Data Science

Description / Details

Pipeline parallelism is a key technique for scaling large-model training, but modern workloads exhibit runtime variability in computation and communication. Existing pipeline systems typically consume static, profiled, or adaptively generated schedules as pre-committed execution orders. When realized task readiness diverges from the pre-committed order, stages may wait for not-yet-ready work even though other executable work is available, creating stage misalignment, idle bubbles, and reduced utilization. We present Runtime-Readiness-First Pipeline (RRFP), a readiness-driven runtime for pipeline-parallel training. RRFP changes how schedules are consumed at runtime: instead of treating a schedule as a sequence that stages must wait to follow, it treats the schedule as a non-binding hint order for ranking currently ready work. To support this model, RRFP combines message-driven asynchronous communication, lightweight tensor-parallel coordination for collective consistency, and ready-set arbitration for low-overhead dispatch. We implement RRFP in a Megatron-based training framework and evaluate it on language-only and multimodal workloads at up to 128 GPUs. RRFP improves over fixed-order pipeline baselines across all settings. Using the BFW hint, RRFP achieves up to 1.77×\times speedup on language-only workloads and up to 2.77×\times on multimodal workloads. In cross-framework comparisons, RRFP with the default BF hint outperforms the faster available external system by up to 1.84×\times while preserving training correctness.


Source: arXiv:2605.18750v1 - http://arxiv.org/abs/2605.18750v1 PDF: https://arxiv.org/pdf/2605.18750v1 Original Link: http://arxiv.org/abs/2605.18750v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
May 19, 2026
Topic:
Data Science
Area:
Machine Learning
Comments:
0
Bookmark
A Readiness-Driven Runtime for Pipeline-Parallel Training under Runtime Variability | Researchia