Condor is an workload management system for compute-intensive tasks. It is used in production environments for more than 15 years, so it is considered a stable and mature project.
Job submission examples: http://vivaldi.ll.iac.es/sieinvens/siepedia/pmwiki.php?n=HOWTOs.CondorSubmitFile
This section details the Condor setup on our cluster.
We have four nodes with 6 real cores (12 HT threads).
All threads are available for condor, meaning we have 48 Condor slots,
To use condor, you must be connected to the head node (zarco). Compute nodes are just workers.
Each user has a 48GB memory limit on each machine; each slot has a 5GB memory limit.
You must take these limits into account, so that your jobs don’t get killed as you reach the limit.
$ vi sim.submit Universe = vanilla Executable = sim Output = sim.out Log = sim.log Error = sim.err GetEnv = True Queue
Do not forget “GetEnv” to load your environment!
$ condor_submit sim.submit
The real benefit of Condor comes from managing 1000s of jobs.
$ vi sim.submit Executable = sim getenv = True Input = sim.$(PROCESS) Output = sim.$(PROCESS) Log = sim.log Error = sim.err Queue 1000
Executable = sim getenv = True Arguments = $(PROCESS)` Output = sim.$(PROCESS) Log = sim.log Error = sim.err Queue 1000
Will execute 1000 processes with the process id as a parameter: sim 0, sim 1, sim 2, … , sim 999
If you need to change multiple arguments, you can set the shared parameters at the beginning and change the required parameters
Executable = sim getenv = True Output = sim.$(PROCESS) Log = sim.log Error = sim.err Arguments = a Output = a.out Queue Arguments = b Output = b.out Queue` Arguments = c Output = c.out Queue
Will execute 3 processes with the selected parameters: sim a, sim b, sim c and will output to a.out, b.out, c.out respectively.
If you have many parameter combinations, we suggest you generate this condor file with a script.
getenv = True
This parameter shares the local environmental variables with the remote environment. This is useful if you want to share global library paths.
initialdir = <path>
Sets the base execution path of the execution. Useful to reference files with relative paths.
Requirements = (Machine == "compute-0-1.local")
Restricts the job to run on machines that satisfy the requirement. On this example, the jobs will only be deployed to compute-0-1.
Here you can find information regarding managing GPUs in Condor.
When launching a job, if you need a GPU, you can specify that in your condor submit file with:
request_GPUs = 1
To check the status of Condor slots and check how many GPUs are in available/in use, you can use
$ condor_status -af Machine Gpus RemoteUser State
You can adapt the following example of a .submit file, to use your Python environment.
Edit a .submit file:
$ vi example.submit
It should look like this:
Universe = vanilla Executable = /home/myusername/.conda/envs/myenv/bin/python Arguments = your_script.py getenv = True Transfer_executable = False Initialdir = /home/myusername/ # Point to the base folder of your code (i.e. the your_script.py file) Log = /home/myusername/condor_test.log.$(PROCESS) Output = /home/myusername/condor_test.out.$(PROCESS) Error = /home/myusername/condor_test.err.$(PROCESS) request_GPUs = 1 # If you need a GPU, you must specify it Queue 1
The important parts are:
Some users may have problems using option #1. Alternatively, we can define a bash script that will call Python. This option provides more flexibility since you can run any command before and after executing your Python script (e.g. copying output files to a given folder).
First create a bash script that will invoke Python:
$ touch example.sh # Note the .sh extension
Add the following lines to the .sh file:
#!/bin/bash # Setup anaconda . /share/apps/anaconda3/2019.10/etc/profile.d/conda.sh conda activate myenv python your_script.py
Next, create the .submit file:
$ vi example_bash.submit
It should look like this:
Universe = vanilla Executable = /bin/bash Arguments = example.sh getenv = True Transfer_executable = False Initialdir = /home/myusername/ # Point to the base folder of your code (i.e. the your_script.py file) Log = /home/myusername/condor_test.log.$(PROCESS) Output = /home/myusername/condor_test.out.$(PROCESS) Error = /home/myusername/condor_test.err.$(PROCESS) request_GPUs = 1 # If you need a GPU, you must specify it Queue 1
Before starting submitting jobs, it is advisable to do a quick test to confirm that everything is working correctly. To do so, create a simple Python script:
$ touch your_script.py $ vi your_script.py
Add the following code to the script:
print("My python script is running!") # Check if GPUs are seen by PyTorch (if you're not using PyTorch, remove the next two lines) import torch print("CUDA is available:", torch.cuda.is_available())
Then create a condor .submit file using one of the options above as example (use your own environment and set each path accordingly), and submit it:
# For option #1 $ condor_submit example.submit OR #For option #2 $ condor_submit example_bash.submit
Your job is then submitted and the output is now available in the Log file (/home/myusername/condor_test.out.0). It should look like this:
$ cat condor_test.out.0 My python script is running! CUDA is available: True