Roadmap to HTC Workload Submission via OSG Connect¶
This guide lays out the steps needed to go from logging in to an OSG Connect login node to running a full scale high throughput computing (HTC) workload on OSG's Open Science Pool (OSPool). The steps listed here apply to any new workload submission, whether you are a long-time OSG user or just getting started with your first workload, with helpful links to our documentation pages.
This guide assumes that you have applied for an account on the OSG Connect service and
have been approved after meeting with an OSG Research Computing Facilitator.
If you don't yet have an account, you can apply for one at
Learning how to get started on the OSG does not need to end with this document or our guides! Learn about our training opportunities and personal facilitation support in the Getting Help section below.
1. Introduction to the OSPool and OSG Connect¶
The OSG's Open Science Pool is best-suited for computing work that can be run as many, independent tasks, in an approach called "high throughput computing." For more information on what kind of work is a good fit for the OSG, see Is the Open Science Pool for You?.
Learn more about the services provided by the OSG that can support your HTC workload:
2. Get on OSG Connect¶
After your OSG account has been approved, go through the following guides to complete your access to the login node and to enable your account to submit jobs.
3. Learn to Submit HTCondor Jobs¶
Computational work is run on the OSPool by submitting it as “jobs” to the HTCondor scheduler. Jobs submitted to HTCondor are then scheduled and run on different resources that are part of the Open Science Pool. Before submitting your own computational work, it is important to understand how HTCondor job submission works. The following guides show how to submit basic HTCondor jobs. The second example allows you to see where in the OSPool your jobs run.
4. Test a First Job¶
After learning about the basics of HTCondor job submission, you will need to generate your own HTCondor job -- including the software needed by the job and the appropriate mechanism to handle the data. We recommend doing this using a single test job.
Prepare your software¶
Software is an integral part of your HTC workflow. Whether you’ve written it yourself, inherited it from your research group, or use common open-source packages, any required executables and libraries will need to be made available to your jobs if they are to run on the OSPool.
Read through this overview of Using Software in OSG Connect to help you determine the best way to provide your software. We also have the following guides/tutorials for each major software portability approach:
- To install your own software, begin with the guide on Compiling Software for OSG Connect and then complete the Example Software Compilation tutorial.
- To use precompiled binaries, try the example presented in the AutoDock Vina tutorial and/or the Julia tutorial.
- To use Docker containers for your jobs, start with the Docker and Singularity Containers guide, and (optionally) work through the Tensorflow tutorial (which uses Docker/Singularity)
- To use Distributed Environment Modules for your jobs, start with this Modules guide and then complete the Module example in this R tutorial
Finally, here are some additional guides specific to some of the most common scripting languages and software tools used on OSG**:
**This is not a complete list. Feel free to search for your software in our Knowledge base.
Manage your data¶
The data for your jobs will need to be transferred to each job that runs in the OSPool, and HTCondor has built-in features for getting data to jobs. Our Data Management Policies guide discussed the relevant approaches, when to use them, and where to stage data for each.
Assign the Appropriate Job Duration Category¶
Jobs running in the OSPool may be interrupted at any time, and will be re-run by HTCondor, unless a single execution of a job exceeds the allowed duration. Jobs expected to take longer than 10 hours will need to identify themselves as 'Long' according to our Job Duration policies. Remember that jobs expected to take longer than 20 hours are not a good fit for the OSPool (see Is the Open Science Pool for You?) without implementing self-checkpointing (further below).
5. Scale Up¶
After you have a sample job running successfully, you’ll want to scale up in one or two steps (first run several jobs, before running ALL of them). HTCondor has many useful features that make it easy to submit multiple jobs with the same submit file.
- Easily submit multiple jobs
- Scaling up after success with test jobs discusses how to test your jobs for duration, memory and disk usage, and the total amount of space you might need on the
6. Special Use Cases¶
If you think any of the below applies to you, please get in touch and our facilitation team will be happy to discuss your individual case.
- Run sequential workflows of jobs: Workflows with HTCondor's DAGMan
- Implement self-checkpointing for long jobs: HTCondor Checkpointing Guide
- Build your own Docker container: Creating a Docker Container Image
- Submit more than 10,000 jobs at once: FAQ, search for 'max_idle'
- Larger or speciality resource requests:
The OSG Facilitation team is here to help with questions and issues that come up as you work through these roadmap steps. We are available via email, office hours, appointments, and offer regular training opportunities. See our Get Help page and OSG Training page for all the different ways you can reach us. Our purpose is to assist you with achieving your computational goals, so we want to hear from you!