Control Where Your Jobs Run / Job Requirements¶
By default, your jobs will match any available spot in the OSG. This is fine for very generic jobs. However, in some cases a job may have one or more system requirements in order to complete successfully. For instance, your job may need to run on a node with a specific operating system.
HTCondor provides several options for "steering" your jobs to appropriate
nodes and system environments. The request_cpus
, request_gpus
, request_memory
, and request_disk
submit file attributes should be used to specify the hardware needs of your jobs.
Please see our guides Multicore Jobs and Large Memory Jobs
for more details.
HTCondor also provides a requirements
attribute and feature-specific
attributes that can be added to your submit files to target specific environments in
which to run your jobs.
Lastly, there are some custom attributes you can add to your submit file to either focus on, or avoid, certain execution sites.
Requirements¶
The requirements
attribute is formatted as an expression, so you can use logical
operators to combine multiple requirements where &&
is used for AND and
||
used for OR. For example, the following requirements
statement will direct
jobs only to 64 bit RHEL (Red Hat Enterprise Linux) 8 nodes.
requirements = OSGVO_OS_STRING == "RHEL 8" && Arch == "X86_64"
Alternatively, if you have code which can run on either RHEL 7 or 8, you can use OR:
requirements = (OSGVO_OS_STRING == "RHEL 7" || OSGVO_OS_STRING == "RHEL 8") && Arch == "X86_64"
Note that parentheses placement is important for controling how the logical operations are interpreted by HTCondor.
Another common requirement is to land on a node which has CVMFS.
Then the requirements
would be:
requirements = HAS_oasis_opensciencegrid_org == True
AVX (segfault / illegal instruction) and Other Hardware Attributes¶
A common problem in distributed computing infrastructures is a mismatch between
the executable and the hardware. On OSG, this can happen if you compile a code
which automatically detects hardware features such as AVX or AVX2. When you then
run the resulting executable in a job, and that job lands on a maybe slightly
older execution endpoint, which does not have those hardware features, the
execution will fail with an error like segmentation fault
or
illegal instruction
. Sometimes it is difficult to determine exactly what
hardward feature is the cause, but a very common one is AVX and AVX2, both of
which are advertised and can be matched against. If you are experiencing these
problems, try:
requirements = HAS_AVX == True
or
requirements = HAS_AVX2 == True
Additional Feature-Specific Attributes¶
There are many attributes that you can use with requirements
. To see what values
you can specify for a given attribute you can run the following command while
connected to your login node:
$ condor_status -af {ATTR_NAME} | sort -u
For example, to see what values you can specify for the OSGVO_OS_STRING attribute run:
$ condor_status -af OSGVO_OS_STRING | sort -u
RHEL 7
RHEL 8
This means that we can specify an OS version of RHEL 7
or RHEL 8
. Alternatively
you will find many attributes will take the boolean values true
or false
.
Below is a list of common attributes that you can include in your submit file requirements
statement.
-
HAS_SINGULARITY - Boolean specifying the need to use Singularity containers in your job.
-
OSGVO_OS_NAME - The name of the operating system of the compute node. The most common name is RHEL
-
OSGVO_OS_VERSION - Version of the operating system
-
OSGVO_OS_STRING - Combined OS name and version. Common values are RHEL 7 and RHEL 8. Please see the requirements string above on the recommended setup.
-
OSGVO_CPU_MODEL - The CPU model identifier string as presented in /proc/cpuinfo
-
HAS_CVMFS_oasis_opensciencegrid_org - Attribute specifying the need to access specific oasis /cvmfs file system repositories.
-
GPUs_Capability - For GPU jobs, specifies the GPUs' compute capability. See our GPU guide for more details.
Specifying Sites / Avoiding Sites¶
To run your jobs on a list of specific execution sites, or avoid a set of
sites, use the +DESIRED_Sites
/+UNDESIRED_Sites
attributes in your job
submit file. These attributes should only be used as a last resort. For
example, it is much better to use feature attributes (see above) to make
your job go to nodes matching what you really require, than to broadly
allow/block whole sites. We encourage you to contact the facilitation team before taking this action, to make sure it is right for you.
To avoid certain sites, first find the site names. You can find a current list by querying the pool:
condor_status -af GLIDEIN_Site | sort -u
In your submit file, add a comma separated list of sites like:
+UNDESIRED_Sites = "ISI,SU-ITS"
Those sites will now be exluded from the set of sites your job can run at.
Similarly, you can use +DESIRED_Sites
to list a subset of sites
you want to target. For example, to run your jobs at the SU-ITS site,
and only at that site, use:
+DESIRED_Sites = "ISI,SU-ITS"
Note that you should only specify one of +DESIRED_Sites
/+UNDESIRED_Sites
in the submit file. Using both at the same time will prevent the job from
running.