Proposed: Managing Multiple Training Runs in Separate Pods
The Context
We are working on training a model using multiple GPUs. We have encountered issues when trying to run two training runs in the same pod.
The Problem Statement
The problem is that we cannot run multiple GPU training runs in the same pod. This limits our ability to parallelize our training and make full use of our resources.
Proposal
Based on the discussion, the proposal is to manage the training runs in separate pods. Specifically:
- Start a pod with a single X1 4090 for stable training.
- Start another new pod with 4x 4090 to test the theory that you can only run the multi-GPU once per session.
The Benefits
- This approach allows us to run multiple training runs simultaneously, making full use of our resources.
- It also enables us to isolate the training runs, reducing the risk of one affecting the other.
The Downsides
- Managing multiple pods might be more complex than managing a single one.
- There may be additional overhead associated with running multiple pods.
The Road Not Taken
- We could have continued trying to run multiple training runs in the same pod, but this was causing issues.
The Infrequent Use Case
- This approach may not be necessary if we only need to run a single training run at a time.
In Core and Done by Us
- The decision on how to manage the pods and the training runs will be made by us, using our acquired resources and expertise.
Status
Status: Proposed
Decision Makers
- V-Sekai development team
Further Reading
- V-Sekai · GitHub - Official GitHub account for the V-Sekai development community focusing on social VR functionality for the Godot Engine.
- V-Sekai/v-sekai-game is the GitHub page for the V-Sekai open-source project, which brings social VR/VRSNS/metaverse components to the Godot Engine.
AI assistant Aria assisted with this article.