Draft: Managing Multiple Training Runs in Separate Pods

The Context

We are working on training a model using multiple GPUs. We have encountered issues when trying to run two training runs in the same pod.

The Problem Statement

The problem is that we cannot run multiple GPU training runs in the same pod. This limits our ability to parallelize our training and make full use of our resources.

Proposal

Based on the discussion, the proposal is to manage the training runs in separate pods. Specifically:

Start a pod with a single X1 4090 for stable training.
Start another new pod with 4x 4090 to test the theory that you can only run the multi-GPU once per session.

The Benefits

This approach allows us to run multiple training runs simultaneously, making full use of our resources.
It also enables us to isolate the training runs, reducing the risk of one affecting the other.

The Downsides

Managing multiple pods might be more complex than managing a single one.
There may be additional overhead associated with running multiple pods.

The Road Not Taken

We could have continued trying to run multiple training runs in the same pod, but this was causing issues.

The Infrequent Use Case

This approach may not be necessary if we only need to run a single training run at a time.

In Core and Done by Us

The decision on how to manage the pods and the training runs will be made by us, using our acquired resources and expertise.

Status

Status: Draft

Decision Makers

V-Sekai development team