Contents
Typically, we want to
- train, validate, and test our models with various choices of training hyper-parameters in an effort to optimize performance; and,
- depending on our problem, experiment with tweaked or wholly different model architectures.
Any ML framework or library facilitates your fine-tuning your training process. DUlib is, in addition, designed to assist the search for a preferred architecture for your problem.
In any case, once you have your code working, you no doubt want to run some experiments. There are several ways to do this. The following are example solutions that we use at the DL@DU Project.
Suppose that you have a program named myprogram.py that accepts various commandline arguments. Maybe you run your program like this:
python3 myprogram.py -bs 20 -epochs 10 -lr 0.05 -mo 0.95
The sections that follow, we look at recipes for running experiments on various hardware configurations.
Many of the advanced DL@DU students have accounts on the Dual-GPU ArchLinux machine in Simmons' office (and they can rdp into that machine over a vpn). For students running jobs on the Arch machine please have a look at the next section but, in practice, be sure to use the job queue per the multiple users section.
Single node, multiple GPUs, one user
Scenario 1
Suppose that you have a program that runs on a single GPU. Since there are 2 GPUs in the Arch machine, you want to run 2 instances of your program in parallel, one on each GPU. To keep things simple, suppose you just want to tweak a few the learning parameters.
Method 1
parallel -j2 eval CUDA_VISIBLE_DEVICES='$(( {%} - 1 ))' python3 myprogram.py -bs 20 -epochs 10 -lr ::: 0.05 0.01 -mo ::: 0.95 0.98The command above is equivalent to running (simultaneously) the following commands:
CUDA_VISIBLE_DEVICES=0 python3 myprogram.py -bs 20 -epochs 10 -lr 0.05 -mo 0.95CUDA_VISIBLE_DEVICES=1 python3 myprogram.py -bs 20 -epochs 10 -lr 0.01 -mo 0.98Method 2
The following is just a different, but equivalent, way to organize Method 1.
Create a file named, say, params containing these two lines:
-bs 20 -epochs 10 -lr .05 -mo .95-bs 20 -epochs 10 -lr .01 -mo .98Then run
parallel -j2 eval CUDA_VISIBLE_DEVICES='$(( {%} - 1 ))' python3 myprogram.py {} :::: params
Modifying Method 2, for instance, you can run many experiments. Suppose that the following lines are in a file in your current directory called params2:
-bs 20 -epochs 10 -lr .001 -mo .99-bs 20 -epochs 10 -lr .001 -mo .99 -channels 1 16 32-bs 20 -epochs 10 -lr .001 -mo .99 -channels 1 32 16-bs 20 -epochs 10 -lr .001 -mo .99 -channels 1 16 32 -widths 1000 500
Then, issuing the command
parallel -j2 eval CUDA_VISIBLE_DEVICES='$(( {%} - 1 ))' python3 myprogram.py {} :::: params2
results in running (possibly in a different order, depending on which ones complete first):
CUDA_VISIBLE_DEVICES=0 python3 myprogram.py -bs 20 -epochs 10 -lr 0.001 -mo 0.99CUDA_VISIBLE_DEVICES=1 python3 myprogram.py -bs 20 -epochs 10 -lr 0.001 -mo 0.99 -channels 1 16 32CUDA_VISIBLE_DEVICES=0 python3 myprogram.py -bs 20 -epochs 10 -lr 0.001 -mo 0.99 -channels 1 32 16CUDA_VISIBLE_DEVICES=1 python3 myprogram.py -bs 20 -epochs 10 -lr 0.001 -mo 0.99 -channels 1 16 32 -widths 1000 500
Assuming the order above, whenever the first line finishes running on gpu 0, then the third line starts running on gpu 0, and similarly for gpu 1.
You can capture the output (for either method) in a file by redirecting:
parallel -j2 eval CUDA_VISIBLE_DEVICES='$(( {%} - 1 ))' python3 myprogram.py {} :::: params2 >> logfile
Single node, multiple GPUs, multiple users
Method 3
Alternatively to the methods above, you could put these lines in a file called jobs:
CUDA_VISIBLE_DEVICES=0 python3 myprogram.py -bs 20 -epochs 10 -lr 0.001 -mo 0.99CUDA_VISIBLE_DEVICES=1 python3 myprogram.py -bs 20 -epochs 10 -lr 0.001 -mo 0.99 -channels 1 16 32CUDA_VISIBLE_DEVICES=0 python3 myprogram.py -bs 20 -epochs 10 -lr 0.001 -mo 0.99 -channels 1 32 16CUDA_VISIBLE_DEVICES=1 python3 myprogram.py -bs 20 -epochs 10 -lr 0.001 -mo 0.99 -channels 1 16 32 -widths 1000 500and then simply issue the command
parallel -j2 {} :::: params2 >> logfileor, just,
parallel -j2 -a params2 >> logfileBut, since we are assuming multiple users, we may have more than person issuing commands like these — and that will very quickly lead to collisions.
Method 4
This following is specific to the way that Simmons' office machine is set up.
This method solves the problem of collisions by using a job queue. In other words, whenever you want to run programs that are GPU (or even CPU) intensive, you encapsulate those in a file and put them on the queue. Then the programs are dequeued (in a FIFO manner) and run as resources become available.
Let us write down how to do this on the Arch machine in Simmons' office. (This setup involves some shell scripts that ultimately call the parallel command discussed above. Elsewhere, we will provide the code for running such a queue. Here we discuss solely how to use the queue.)
Suppose that the file jobs looks like this:
CUDA_VISIBLE_DEVICES=0 python3 myprogram.py -bs 20 -epochs 10 -lr 0.001 -mo 0.99CUDA_VISIBLE_DEVICES=1 python3 myprogram.py -bs 20 -epochs 10 -lr 0.001 -mo 0.99 -channels 1 16 32CUDA_VISIBLE_DEVICES=0 python3 myprogram.py -bs 20 -epochs 10 -lr 0.001 -mo 0.99 -channels 1 32 16CUDA_VISIBLE_DEVICES=1 python3 myprogram.py -bs 20 -epochs 10 -lr 0.001 -mo 0.99 -channels 1 16 32 -widths 1000 500On Simmons' machine, you can place those 4 jobs on the queue by typing simply:
qq jobsAs each job completes, the output will be displayed in the terminal from which the command above was issued.
If we wish instead to log the output to logfile:
qq jobs __log logfileNote the double underscore.
Alternatively, if you wish to quickly throw a job to the queue with out bothering with a jobs file, then e.g.:
qq CUDA_VISIBLE_DEVICES=0 python3 myprogram.py -bs 20 -epochs 10 -lr 0.001 -mo 0.99and, if you want to log the output to a file instead of seeing it in your terminal:
qq CUDA_VISIBLE_DEVICES=0 python3 myprogram.py -bs 20 -epochs 10 -lr 0.001 -mo 0.99 __log logfileNow, the queue is just a certain file that resides on the Arch machine. You don't want to edit that file manually. The commands above involving qq do that for you.
Of course, others might already have their jobs on the queue. You can check what's on the queue:
cat /home/sharedData/jobqueueor you can just issue the command
qq
, which is a shortcut to the command above.Using the current implementation, the lines in the file /home/sharedData/jobqueue remain even after those programs have completed. Over time, that file grows unless we remove lines. Here is how we can do that manually.
echo __clear >> /home/sharedData/jobqueue(That's a double underscore in front of the word clear.)
When the line __clear gets evaluated, the jobs above it (which have all already completed) are removed. If you wish, you can add __clear to your jobs file. Then when your jobs finish running, they will be automatically removed from the queue.
Or, you can put __clear at the beginning of your jobs file.
__clearCUDA_VISIBLE_DEVICES=0 myprogram.py -bs 20 -epochs 10 -lr 0.001 -mo 0.99CUDA_VISIBLE_DEVICES=1 myprogram.py -bs 20 -epochs 10 -lr 0.001 -mo 0.99 -channels 1 16 32CUDA_VISIBLE_DEVICES=0 myprogram.py -bs 20 -epochs 10 -lr 0.001 -mo 0.99 -channels 1 32 16CUDA_VISIBLE_DEVICES=1 myprogram.py -bs 20 -epochs 10 -lr 0.001 -mo 0.99 -channels 1 16 32Tips
- The __clear line at the top of jobs clears lines of completed jobs from the file /home/sharedData/jobqueue but it also guarantees that the first two jobs in jobs start at the same time; whereas, without the beginning _clear, the first job in jobs would start before the second if someone was still using CUDA_VISIBLE_DEVICES=1. So, when reasoning about optimal GPU utilization you may or may not want to include _clear. You can safely omit it.
- Type qq -h or qq -help (or qq --h or qq --help) to see more usage examples involving putting jobs on the queue.
Resources
- GNU Parallel article