-
Notifications
You must be signed in to change notification settings - Fork 0
Home
- AIDA is a private cluster with restricted access to members of the rab38_0001 project
- Head node: aida.cac.cornell.edu accessed via ssh (aida2.cac.cornell.edu also works as an alias)
- Running Rocky 9.4 and built with OpenHPC 3.2.1 and Slurm 23.11.10
- 17 GPU nodes (most with additional access restriction):
- 6 with V100 GPUs (c0010-c0015)
- 6 with A100 GPUs (c0016-c0021)
- 5 with H100 GPUs (c0001-c0005)
- 4 CPU-only nodes (c0006-c0009)
- New users might find the Getting Started information helpful
The regular partition includes nodes open to all cluster users (including the general account) and is the default, while the full partition has access restricted to users in the allaccess slurm account.
Partitions | Node Names | CPU (Intel) | S:C:T | Clock | GPU (Nvidia) | RAM | Swap | $TMPDIR (approx) | SIMD | Hyperthreading |
---|---|---|---|---|---|---|---|---|---|---|
full | c00[01-05] | Xeon 8462Y+ | 2:32:2 | 2.8 GHz | (4) H100-80GB SXM | 1.0 TB | 250 GB | 3.0 TB | AVX-512, AMX | On |
regular, full | c00[06-09] | Xeon 8462Y+ | 2:32:2 | 2.8 GHz | 1.0 TB | 250 GB | 3.0 TB | AVX-512, AMX | On | |
full | c0010 | Xeon 6154 | 2:18:2 | 3.0 GHz | (5) V100-16GB | 768 GB | 187 GB | 700 GB | AVX-512 | On |
regular (2 nodes), full | c00[11-14] | Xeon 6154 | 2:18:2 | 3.0 GHz | (5) V100-16GB | 384 GB | 187 GB | 700 GB | AVX-512 | On |
full | c0015 | Xeon 6154 | 2:18:2 | 3.0 GHz | (2) V100-16GB | 1.5 TB | 535 GB | 700 GB | AVX-512 | On |
full | c00[16-21] | Xeon 6348 | 2:28:2 | 2.6 GHz | (4) A100-80GB | 1.0 TB | 250 GB | 3.0 TB | AVX-512 | On |
Nodes | Infiniband | Ethernet |
---|---|---|
c00[01-09] | 2x400 Gb NDR | 25 GbE |
c00[10-15] | 100 Gb EDR | 10 GbE |
c00[16-21] | 100 Gb EDR | 10 GbE |
See the AIDA Filesystems page for more information.
- Path:
~
OR$HOME
OR/home/fs01/<username>
- Users' home directories are located on a NFS export from the AIDA head node; access via 10Gb/25Gb Ethernet.
- Most data should go on BeeGFS (datasets, results, write-once data), but certain things make more sense in your home directory:
- Scripts, code, profiles, and other files and user-installed software where this is the assumed location.
- Small datasets or low I/O applications that don't benefit from a high-performance filesystem.
- Data rarely or never accessed from compute nodes.
- Applications where client-side caching is important: binaries, libraries, virtual/conda environments, Apptainer containers (unless staging to /tmp on compute nodes is feasible).
- Data in users' home directories are NOT backed up; users are responsible for backing up their own data.
- Cornell Box (or another cloud service) might be a good option for general data.
- COECIS Github (or github.com) is a good option for code and scripts.
- Data not in active use should be archived elsewhere and removed from the cluster.
- Path:
- Parallel File System
- Path:
/mnt/beegfs/
- Directories: bulk and fast, each with subdirectories mirror and stripe are named for their default storage patterns:
- Bulk: HDD storage pool, intended for general data storage and write-based workloads (results, logs, checkpoints). Data should ideally be organized in larger files where possible.
- Fast: NVMe storage pool, intended for staging data for read-intensive workloads, especially if the data must be stored in many small files (high IOPS workloads) or read repeatedly.
- Mirror: Data mirrored across two servers; more fault tolerance, but slower writes.
- Stripe: Data not mirrored; space efficiency and faster write speed, but less fault tolerance.
- Please try not to structure your data in large directories of tiny files (especially under bulk).
- Create a subdirectory under
/mnt/beegfs/bulk/mirror/
matching your username for most use cases.- Use the other directory trees instead for specific access patterns that benefit.
- For collaborative projects, use a directory name that clearly indicates the owners and purpose.
- BeeGFS is configured to use buffered reads/writes, rather than client-side page cache, so workloads that strongly benefit from that type of cache should go in
$HOME
or staged to$TMPDIR
. - All users have access to the BeeGFS file system; use
chmod
to set appropriate access rules. - Data on BeeGFS are NOT backed up; users are responsible for backing up their own data.
- Cornell Box (or another cloud service) might be a good option for general data.
- COECIS Github (or github.com) is a good option for code and scripts.
- Data not in active use should be archived elsewhere and removed from the cluster.
- The cluster scheduler is Slurm v23.11.10.
- See the Slurm documentation page for details.
- Some of the Nvidia A100 cards have MIG configured (the H100 are not currently using MIG).
- MIG instances are intended for single-GPU tasks, especially ones that don't keep a full GPU busy, such as interactive work, debugging, testing, and jobs that only use a GPU some of the time.
- It is recommended not to do GPU-GPU communication on MIG instances.
- Pay attention to which sizes are the most useful; they can be reconfigured periodically based on demand.
- Use sinfo to list the GPU resources available for scheduling:
- Some of the Nvidia A100 cards have MIG configured (the H100 are not currently using MIG).
$ sinfo -o "%20N %10c %10m %50G"
- See the Requesting GPUs section for information on how to request GPUs on compute nodes for your jobs. There are multiple options to do this:
-
-G 1
to request 1 GPU of any available type, total for this job (job should be flexible enough to work well with any size GPU). -
--gres=gpu:v100:<number of devices>
to request a certain number of V100 GPUs per node. -
--gpus-per-task 1g.10gb:1
to request 1 1g.10gb MIG instances per job task. -
--gpus-per-node 2g.20gb:2
to request 2 2g.20gb MIG instances per job node. -
--gpus a100:2
to request 2 entire A100 GPUs, total for this job. - See also the other
--gpu*
and*-per-gpu
options in the slurm docs, e.g., to request RAM or CPUs proportional to the GPU count.- Note that the
cpus-per-gpu
option doesn't appear to always distribute the CPUs across nodes in proportion to the GPU count as it should in slurm 23.11.10, which meanssrun
might not inherit job settings as expected fromsbatch
, for example. This can cause it to fail starting the job step, or give it the wrong CPU affinity. You can still use this if you don't mind CPUs being allocated disproportionately, but you might need to either specify more precise resources tosrun
or in some cases options like--overlap -O --gres none
(forsrun
within a job allocation, notsbatch
itself) might work.
- Note that the
-
- See the Requesting GPUs section for information on how to request GPUs on compute nodes for your jobs. There are multiple options to do this:
-
Cgroups restrict jobs to the requested CPU, RAM, and GPU resources, and terminate all tasks when the job ends.
- Jobs will not use additional idle resources on the nodes, and will have a access to swap equal to 20% of the assigned RAM.
- If you don't request enough CPUs, your threads will still be restricted to the requested core count rather than spreading out and competing with other jobs. Note that in recent slurm versions,
srun --pty
will generally grab all the allocated CPUs, and won't behave as you might expect in all but the simplest cases. The FAQ recommends usingsalloc
for interactive jobs instead. - Disk cache for the files you use, memory fragmentation, etc., count towards RAM usage, so you might need more than you think; allocation of more RAM will fail or the kernel will start killing your processes if RAM is over-committed.
- Only the requested GPUs (or MIG instances) will be visible, e.g., to
nvidia-smi
. - You can only
ssh
to a compute node if you have a running job, and the shell will run in the cgroup context of the job it finds.- Be careful with this if you have multiple jobs running on the node, this shell might not be in the job context you expect.
- This shell will be a login shell, and does not appear to inherit all of the SLURM environment variables, and therefore might be of limited use if you intend to use slurm commands or launch job steps, for example.
- You can run additional tasks in the context of an existing job with
srun --jobid <id>
.- This will assign specific resources from the job allocation by default, and in the case
--pty
it will want to allocate all of the resources. Therefore, launching interactive steps this way in an existing allocation is not recommended and might not behave as you expect. - If you want it to share resources with existing tasks, you will likely want to request to overlap the job's resources with
--overlap
.
- This will assign specific resources from the job allocation by default, and in the case
- You can also connect to an existing job step with
sattach
.- This attaches to the stdin and stdout for the specified job step. If that step is an interactive shell, exiting will likely terminate the job.
- Remember, hyperthreading is enabled on the cluster, so Slurm considers each physical core to consist of two logical CPUs.
- You can ensure that your MPI tasks use a full physical core each by specifying
-c 2
in your slurm job. - Many tasks will perform best by requesting 2 CPUs per thread in the job (with an appropriate combination of
-c
and-n
values, depending on program structure).
-
Cgroups restrict jobs to the requested CPU, RAM, and GPU resources, and terminate all tasks when the job ends.
- There are two overlapping partitions (queues), which are used for access control on AIDA:
Name | Nodes | Access | Description | Max Time Limit* |
---|---|---|---|---|
regular | c00[06-09,13-14] | general and allaccess accounts (all members of the rab38_0001 CAC project) | CPU nodes, 2 V100 nodes | 21 days |
full | c00[01-21] | allaccess account (by permission of cluster owners) | All nodes (H100, A100, V100, CPU-only) | 7 days |
(*) For any QOS available on this partition.
- These partitions are heterogeneous.
- Most limits and other scheduling constraints are provided through other mechanisms (QOS, node features, associations).
- Choose a partition with
-p
; the default is regular.
- If you want your job to run on specific hardware types, you can specify constraints with
-C
. - Features are defined for each node, reflecting various items that distinguish them:
-
(skylake|icelake|sapphirerapids)
: CPU architecture codename -
cpu-(6154|6348|8462)
: CPU Model -
[no-]simd-(avx512|amx)
: SIMD extensions supported (or not supported) -
(volta|ampere|hopper)
: GPU architecture codename -
[no-]gpu
: GPU of any kind present (or not) -
[no-]gpu-(v100|a100|h100|1g.10gb|2g.20gb|3g.40gb|4g.40gb|7g.80gb)
: Specific GPU types present (or not); #g.##gb refer to A100 MIG instances -
[no-]gpucc-(7.0|8.0|9.0)
: Nvidia compute capability supported (or not) -
ib-(100g|2x400g)
: Infiniband support/speed -
eth-(10g|25g)
: Ethernet speed -
rack-(18al|18am|18an)
: Physical server location
-
- Use
sinfo -eo "%35N %10w %10z %10m %230b"
to see the active features on each node. - Node weights are defined for each node, which will generally cause jobs use the highest-RAM and largest-GPU nodes last.
- If you want your job to run on specific hardware types, you can specify constraints with
- Time limits and priority are specified by QOS.
- Each partition has a QOS with default limits, with the prefix
p-
, i.e., p-regular and p-full. - The default job QOS for each user is normal, which currently doesn't modify the partition QOS limits in any way; others can be chosen with the
-q
option. - The low, high, and urgent QOS get modified priority, and have different time limits; high is sometimes restricted, and urgent always requires cluster owner permission.
- low is good to use for job arrays (also use slot limits for these), short and low-resource jobs that can squeeze in-between higher-resource jobs, or anything else that might flood the queue or block large amounts of resources for long periods of time, in addition to jobs that can wait.
- normal is for typical jobs using reasonable amounts of resources for reasonable runtimes.
-
high is good to use for jobs that are higher priority (deadlines), need short turnaround (quick debugging, incremental testing on small datasets, or quick interactive work) or would otherwise get wrongly starved for resources; please coordinate usage with the other users who have queued jobs. It can also be used in some cases to jump ahead of your own queued jobs, but reducing the priority of the other job with
--nice
is better. Do not abuse. - urgent is for tight deadlines and other emergencies where jobs need to jump ahead of even high-priority ones. Access to this QOS will only be granted temporarily (for a specific time window) with specific permission from the cluster owner.
- The long QOS is intended for CPU nodes and can only be used on the regular partition. It has a much longer time limit to accommodate jobs using long-running software that can't use checkpoints.
- Use
sacctmgr show qos
for a current list of QOS and associated limits, andsacctmgr show assoc user=<username>
for your own limits and QOS access.
- The priority calculation for queued jobs can be seen with the
sprio
command, and current priority coefficients can be seen in the output ofscontrol show config
. - The job's QOS determines the overall job priority, and jobs in a higher priority tier will always be selected first for available resources.
- Jobs on the full partition get a boost to their priority value, compared to those on the regular partition.
- A fairshare policy generally causes jobs from users with lower usage to run earlier; use
sshare -a
to see current fairshare values; jobs run under the allaccess account get a higher fairshare value. - The job's age (time queued) also increases priority, so that jobs from high-usage users run eventually.
- You can reduce your job's priority voluntarily with the
--nice
option. - Job size has a minimal effect.
- Related Slurm documentation:
- Multi-factor job priority
- QOS
- Fair-share algorithm
- Commands: sacctmgr, sbatch, scontrol, sprio, sshare
- The priority calculation for queued jobs can be seen with the
- There are resource limits configured at the partition, QOS, and association (e.g., individual user) level. In most cases, the most restrictive value out of all of these applies.
- Walltime limits vary across QOS and partitions.
- Resource limits will be added as needed; at times they have included:
-
GrpTRES
: Total GPU or other resources in use at once for a user (or other grouping). -
GrpTRESRunMins
: Combines resources and remaining walltime so that larger-resource jobs must have shorter (remaining) walltimes. -
MaxJobsAccrue
: Number of queued jobs that can accrue age priority. -
MaxJobs
: Number of jobs that can be running at once.
-
- Get info on current limits with
scontrol show part
,sacctmgr show qos
,scontrol show assoc
, andsacctmgr show assoc
. - Resource limits can be added to individual users (associations), which you might find useful if you want your own resource usage limited in a more flexible way than through something like job array slot limits.
Set up the working environment for each software package using the module command. The module command will activate dependent modules if there are any.
To show all available modules: $ module avail To load a module: $ module load <software> To unload a module: $ module unload <software> To swap compilers: $ module swap gnu14 intel Use "module spider" to find all possible modules and extensions. Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".
It is possible to create your own personal modulefiles to support the software and settings you use.
A lot of software can be installed in a user's $HOME
directory without root access, or is easy to build from source. Please check for such options, as well as the virtual environment and container solutions described below before requesting system-wide software installation (unless there are licensing issues).
Because BeeGFS is not configured to use client-side page cache, user software, code, binaries, virtual environments, and containers are best put in your $HOME
directory, not on BeeGFS.
Python versions 3.9 (default) and 3.11 are installed.
Users can manage their own python environment (including installing needed modules) using virtual environments. Please see the documentation on virtual environments on python.org for details.
You will likely want to reinstall a clean environment after system upgrades, Python updates, etc.
You can create as many virtual environments, each in their own directory, as needed.
- python3.11:
python3.11 -m venv <your virtual environment directory>
$HOME
directory, not on BeeGFS.
You need to activate a virtual environment before using it:
source <your virtual environment directory>/bin/activate
Once such an environment is activated, both python
and python3
should become aliases for python3.11
.
After activating your virtual environment, you can now install python modules for the activated environment:
- It's always a good idea to update pip first:
pip install --upgrade pip
- Install the module:
pip install <module name>
- List installed python modules in the environment:
pip list modules
- Anaconda can be used to maintain custom environments for R, Python, and other software, including alternate interpreter versions and dependencies.
- Reference to help decide if Miniconda is enough: https://conda.io/docs/user-guide/install/download.html
- NOTE: Consider starting with Miniconda if you do not need a multitude of packages for it will be smaller, faster to install as well as update.
- Reference for Anaconda R Essentials: https://conda.io/docs/user-guide/tasks/use-r-with-conda.html
- Reference for Linux install: https://conda.io/docs/user-guide/install/linux.html
- Please take the tutorials to assist you with your management of conda packages:
$HOME
directory, not on BeeGFS. You will likely want to reinstall a clean environment after system upgrades, Python updates, or when updating packages installed with a mix of pip and conda.
Anaconda by default puts its intitialization code in ~/.bashrc
. Some methods you might use to launch a job will result in either a login or interactive shell, which will likely reload your bashrc file, and thus reactivate your base conda environment. This means your currently active conda environment might carry over to your job, or the base environment get reactivated, depending on how the job is launched. If you work interactively (or use login shells for your job scripts), and use multiple conda or virtual environments, then you might want to disable this behavior (conda config --set auto_activate_base false
), or explicitly activate (and verify) the environment you want in your job script or interactive shell.
Apptainer (formerly Singularity) is a container system similar to Docker, but suitable for running in HPC environments without root access. You might want to use Apptainer if:
- You're using software or dependencies designed for a different Linux distribution or version than the one on AIDA.
- Your software is easy to install using a Linux distribution's packaging system, which would require root access.
- There's a Docker or Apptainer image available from a registry like Nvidia NGC, Docker Hub, Singularity Hub, or from another cluster user with the software you need.
- There's a Dockerfile or Apptainer recipe that is close to what you need with a few modifications.
- You want a reproducible software environment on different systems, or for publication.
Download an existing image with apptainer pull
, which doesn't require root access. If multiple people will use the same image, we can publish them in a shared location.
Build a new image with apptainer build
, which usually must be run on an outside machine where you have root access. Then you can upload it directly to the cluster to run, or transfer it through a container registry.
Run software in the container using apptainer run
, apptainer exec
, or apptainer shell
. Performance will likely be best if you copy the image to local disk on the compute node, or run it from the HOME
filesystem (not BeeGFS). Remember that the container potentially works like a different OS distribution and software stack, even though it can access some host filesystems by default (such as HOME), so be careful about interactions with your existing environment (shell startup files, lmod, Anaconda, venv, etc.). Consider using the -c
option or maintaining an environment specific to each container you use.
Some container images that you might find useful enough without modification are provided in /home/containers/
(more can be added on request):
- Nvidia NGC (nvcr.io) frameworks containers generated with
apptainer pull nvcr.io-nvidia-<name>-<tag>.sif docker://nvcr.io/nvidia/<name>:<tag>
- Include versions of PyTorch, MXNet (deprecated; last version is 24.06), TensorFlow (available combinations of tf1 and tf2 with py2 and py3), and TensorRT.
- See the Frameworks Containers Support Matrix for the OS and software details of each image, which are numbered by release month.
-
nvidia-ngc/
directory contains images with key platform versions:- 24.06/24.09 versions are the latest available as Oct. 2024 (based on Ubuntu 22.04).
-
nvidia-ngc-legacy/
directory contains images that are not necessarily tested against RHEL/Rocky/CentOS 9, and may not work correctly, and also may not support the newer GPUs (i.e., version 23.05 or older):- 19.06 versions are the latest available based on Ubuntu 16.04.
- 20.11 versions are the latest available based on Ubuntu 18.04.
- 23.03/23.04 versions are the latest available based on Ubuntu 20.04.
-
nvidia-ngc-compat/
directory contains images that have newer CUDA than was supported by the installed driver (at the time downloaded), and will use forward compatibility (may not support all features) if still ahead of the current driver:- None yet as the latest Nvidia driver is installed.
- Nvidia NGC (nvcr.io) frameworks containers generated with
Software | Path | Notes |
---|---|---|
CPLEX 22.1.1 |
/opt/ohpc/pub/apps/ibm/ILOG/CPLEX_Studio2211/ |
|
CUDA 12.6.2 |
/usr/local/cuda-12.6/ |
|
GCC 14.2 |
/opt/ohpc/pub/compiler/gcc/14.2.0/ |
|
Intel Compiler 2023.2.1, 2024.0.0, 2024.2.1 |
|
|
Gurobi (license only) |
/opt/ohpc/pub/gurobi/ |
|
Open MPI 5.0.5 |
/opt/ohpc/pub/mpi/openmpi5-gnu14/5.0.5/ |
|
Software will generally only be installed system-wide if:
- It's a simple package install from a standard repository (Rocky, EPEL, OpenHPC-3) with no likely conflicts.
- Try
dnf search
as a quick check for package availability.
- Try
- It's required for licensing reasons, subject to additional direct approval by the cluster owner (potentially only the license infrastructure will be installed).
- It can't be installed by the mechanisms above, or is version-stable and widely used, with direct approval of the cluster owner.
- It's a simple package install from a standard repository (Rocky, EPEL, OpenHPC-3) with no likely conflicts.
- New
$HOME
directories are empty. - Atlas/Atlas2
$HOME
directories and/home/shared/
are no longer mounted for transfer (deprecated since 2022). - Old home directories from the previous head node are mounted under
/aidahome/fs01/
on the head node only.- Not reachable from compute nodes.
- Will be shut down after a transition period (date TBD).
- You should not blindly copy everything from the old to the new.
- You should update your default .bash_profile and .bashrc as needed (and any other profile/config files).
- Default
.bashrc
in Rocky 9 now contains a code block with a mechanism to load settings from separate files in~/.bashrc.d/
(so each group of related settings can go in a separate file), which you might find cleaner to use than loading your main.bashrc
with all your settings.
- Default
- Probably cleanest to reload your code from git repos and rebuild conda environments from a package list.
- For actively used data, see the guidance above for $HOME vs. BeeGFS.
- Most other data (inactive/archived) should be deleted or moved to the cloud.
- You should update your default .bash_profile and .bashrc as needed (and any other profile/config files).
- New
Default shell for all CAC users is bash
, but if your CAC account is from before 2024 (even from a different cluster or project), it could be set to sh
, which has some legacy behavior. You can check with getent passwd <username>
or check the value of $SHELL
. To change it to bash or another shell of your choice, submit a CAC ticket (chsh
won't work permanently). If you prefer another shell such as zsh
, keep in mind that some defaults and tools integrated with Rocky 9, OpenHPC, etc., assume you use bash, so you might lose some minor shell-integrated features.
- Compilers:
gnu13gnu14 is now the default.- You likely want to recompile your code from source.
- Many version updates from OpenHPC and Rocky Linux repos.
- Check for the software you use to see if anything is missing, or for version issues.
- CUDA: version matching current driver is included as an env module (
module load cuda
ormodule load cuda/12.6
); for other versions, install in a virtual environment (e.g., Anaconda), or use an Apptainer container (such as from NGC).
- Compilers:
- Partition structure changed; use
scontrol show part
for info. - Node feature lists updated to account for new and removed hardware.
- Some changes to QOS and resource and time limits; use commands like
sacctmgr
to view details. - Job confinement now uses cgroups v2 (previously used v1).
-
/tmp
(which is also$TMPDIR
) and/var/tmp
are now in a private directory on local disk. It is only accessible from within the job, and is deleted when the job ends. It can no longer be used to store persistent data or to share data between different jobs. If you save your job output there, be sure to copy it to a persistent filesystem before your job ends. This is controlled by the Slurm job_container/tmpfs plugin. -
/dev/shm
(shared memory ramdisk) is also private for each job and deleted when the job ends, via the Slurm job_container/tmpfs plugin. - Added
/staging/
directory to compute nodes for local data that needs to be persistent across multiple jobs. - If you use
srun --jobid
to run a new interactive shell within an existing job allocation, use--overlap
(instead of-s
, which was used in earlier versions). You no longer need to explicitly specify 0 GPUs or other GRES, as the overlap option can now handle those.
- Partition structure changed; use
- List node resources and state
- sinfo -NO "Partition:.10,NodeList:.8,SocketCoreThread:.8,CPUsState:.16,Oversubscribe:.10,Memory:.10,AllocMem:.10,Freemem:.10,Gres:.50,GresUsed:.80,Features:.230"
- #shorter alternative w/o allocated GRES/Mem
- sinfo -No "%.10P %.8N %.8z %.16C %.10m %.10e %.10h %.50G %.8t %.230f"
- List running and queued jobs, with more details than squeue (including allocated GRES)
- sacct -aX -s R,PD -o "JobID%15,JobName%10,User%10,Partition%10,NodeList,ReqTres%60,AllocTRES%80,Start,TimeLimit,State,QOS,Priority"
- #Just running jobs
- sacct -aX -s R -o "JobID%15,JobName%10,User%10,Partition%10,NodeList,ReqTres%60,AllocTRES%80,Start,TimeLimit,State,QOS,Priority"
- Information about unavailable/down nodes
- sinfo -R
- List fairshare information for all users (second example widens the TRESRunMins column)
- sshare -al
- sshare -a -o "Account,User,RawShares,NormShares,RawUsage,NormUsage,EffectvUsage,FairShare,LevelFS,GrpTRESMins,TRESRunMins%230"
- List my own fairshare
- sshare -o "Account,User,RawShares,NormShares,RawUsage,NormUsage,EffectvUsage,FairShare,LevelFS,GrpTRESMins,TRESRunMins%230"
- When will queued jobs run, and how is their priority calculated?
- squeue --start
- sprio
- What are the current relevant time and resource limits on jobs based on partitions, qos, and associations (i.e., individual users)?
- scontrol show part
- sacctmgr show qos
- scontrol show assoc
- sacctmgr show assoc
- Head node. Don't run jobs on the head node, as it can make things unresponsive for other users or in the worst case take down the whole cluster. It's ok to compile code, scan files, and do other minor administrative tasks on the head node though.
-
Persistent connections. SSH connections to the head node become disconnected frequently (this will often either kill interactive jobs, or keep them holding resources while you might not know they're still running). You should generally edit your code locally, and run things with batch jobs, so this isn't a big problem. However, you can use
tmux
orscreen
to keep a session open even if you become disconnected (typically this is done on the head node). They have the additional advantage of terminal/window management. If you are running interactive jobs within a persistent terminal, remember that you will have to manually cancel the job, or end the session for the job to terminate if you're done before timeout. Another option is mosh, which will reconnect an SSH session as long as you keep the client window open (however, one downside is that the terminal you get doesn't have scrollback history). -
Resource allocation. Always reserve the max resources your job might use. Please make some effort to choose the amount of RAM (otherwise your job might crash), number of cores (usually 0.5 or 1 per thread, depending on workload), and time limit you actually need, with a bit of extra as a buffer. OTOH, try not to use excessive values because this makes it hard to schedule jobs efficiently. You can additionally restrict the RAM used by your program (at least on a per-process basis) from inside your job script with
ulimit -Sv
, or equivalently inside a python script withresource.setrlimit()
, or externally with theprlimit
command, especially if you want different limits per process (this might be necessary if your processes get killed by the kernel's OOM killer rather than adjusting its own RAM usage). -
Threads. If you have a multithreaded job, you might want to limit the number of threads to something like 1 or 2 per core reserved, or reserve one core per thread or two. The simplest way to do this is usually to use the
-c
with a value that is double the number of threads (e.g. if you want 4 cores/task, use-c 8
). Slurm sees each core as being 2 CPUs due to hyperthreading, however, your program might not use hyperthreading well. Many multithreaded programs will default to the number of CPUs they see on the system, and are not aware of scheduled resources. We are now forcing CPU affinity, so jobs with too many threads should no longer interfere with other jobs (but might hurt their own performance). -
SIMD jobs / job arrays. If you are running many instances of the same job with different data or settings, please don't just launch tons of separate jobs in a loop. Use a job array because they are easier to monitor, manage, and cancel. Be sure to set a slot limit (% notation) to avoid flooding the queue and allow others to use the resources too — as a rule of thumb, you should be using less than 25% of any in-demand resources such as GPUs. It's good for there to always be some idle resources in case someone needs to test something quickly, etc., using a small amount of resources. Also, please run such jobs at lower priority (using QOS or
--nice
) when possible, or otherwise communicate directly with other users in case there's a problem. See the ics-research/ics-cluster-script-examples repository on COECIS Github for an example. -
Job priority / nice. Please follow the guidelines for QOS level, and use
--nice
as applicable. Keep in mind that if everyone runs at the highest priority all the time, the priority levels will become useless. See above. - Storage performance. For jobs with significant I/O requirements, please use the BeeGFS filesystem for input and output data, not $HOME. Single-pass data can be accessed on BeeGFS directly. If reading/writing data multiple times, it's probably best to work in $TMPDIR (copying input data and code at the start of the job, and copying results at the end); binaries (programs and libraries) will likely perform best if copied to $TMPDIR (or local staging if accessed from multiple jobs), or run from $HOME. This is because BeeGFS doesn't use the native page cache on the client side (even though the server-side caching is really good for access to the same data by multiple clients).
-
$HOME I/O throttling. If you must access significant data from $HOME, please throttle the bandwidth if you have many jobs running at the same time. You can do this by using
rsync
to copy the data to/from the local $TMPDIR, with the--bwlimit
option (instead of accessing the data directly, or copying it withcp
). Note that the units are KB/s by default. $HOME is net network-limited at about 2GB/s from compute nodes (in aggregate), so make sure each job only uses a small fraction of this, depending on the number of jobs that might do so at the same time (keep in mind access by other users too). Saturating these network links can affect the performance of the scheduler, all jobs, and interactive use, even if it's not obvious that these access much data from there. -
Interactive use. The recommended way to run interactive jobs is with
salloc
, within a persistent session (such as within atmux
session on the head node). It is best to do your code development and initial testing on your own desktop (with smaller datasets or models), and only use interactive jobs for final compatibility and script testing. Then run your real experiments usingsbatch
(using job arrays if applicable).- If you must work interactively, make sure you're running on a compute node (not the head node) with
salloc
(orsrun
might be suitable for some cases).- Please be sure to exit your shell or
scancel
the job to release the resources when you're not actively using it (e.g. if you have a class, meeting, lunch, or go home for the night). - Be sure that your session has not been terminated when left open before running new tasks, such as by checking the output of
hostname
, as this will result in tasks running on the head node, and possibly crashing the head node and making the cluster inaccessible.
- Please be sure to exit your shell or
- If your ssh session to the head node gets disconnected, or if you close the window without exiting the job, please log back in and verify the job has terminated, and either cancel it or reconnect to the already running job (using sattach or
srun --jobid
) as is appropriate; don't just start a new job while the old one is still holding resources. - Please don't choose an excessive time limit and leave your session open to hold resources. If you can't get resources you need for interactive use because someone has flooded the queue, you can ask them to cancel their jobs, and teach them to use job arrays, etc., correctly to avoid this.
- If you must work interactively, make sure you're running on a compute node (not the head node) with
-
Interactive use with multiple windows. Warning: while the following are options, they don't all behave as you might expect (especially for jobs on multiple nodes, or with multiple job steps or tasks launched), and it's probably best to stick to a single
salloc
shell for all but the simplest interactive work.- If you work interactively, or want to interact with a batch job, you can first create your allocation with
salloc --no-shell
, start an interactivesalloc
as above, or launch a batch job with steps/tasks running as usual. - Then you can create a new job step in that job allocation (in a new terminal window, ideally a persistent one) with something like
srun --jobid ### --pty --overlap $SHELL
.- Here,
--jobid
tells it to run under the specified job allocation,--pty
gives you a pseudoterminal,--overlap
lets it share the allocated resources with the job's other step(s), andbash
is the command you want to run (in this case a shell). - Without
--overlap
,--pty
will make it allocate all of the node's resources to the new job step, which means no other steps/tasks can run at the same time (not usually what you want). - Note that if the execution environment of this new job step closes (this can happen on disconnect, for example, for an interactive step), everything launched as part of that step would also be immediately killed, so you generally want to make sure the shell you launch this from is persistent.
- If you create a
tmux
session on a compute node inside your job (and use something else to make the outer shell persistent, or are very careful about how you managed your nested tmux shells), the tmux server will be killed if the step in which the session was created ends, even if you attach to it from a different job step/shell. The tmux server doesn't escape job step containment.
- Here,
- You can also connect to the console (stdin/stdout) for a running job step with
sattach
.- You might use this to watch the output of a batch job, or create a separate monitoring job step (perhaps a task that overlaps resources, and just outputs resource information periodically), and use
sattach
to watch that. - In some situations you might find the shell you're attaching to frozen, so it might not be the best solution in general, especially when using a pty.
- Remember that commands you run in this attached session are executed by whatever is connected to stdin in the job step, which means for example that
exit
could end that job step or the whole job.
- You might use this to watch the output of a batch job, or create a separate monitoring job step (perhaps a task that overlaps resources, and just outputs resource information periodically), and use
- You can also
ssh
to a node on which you have a running job, and your shell will run within the allocation, but if you have multiple jobs running on that node, you might join the wrong job allocation. Also, while this gets you access to the allocated resources on that node, the Slurm environment variables won't be set.
- If you work interactively, or want to interact with a batch job, you can first create your allocation with
- Code development/testing/debugging. Keep your code in a repository in the COECIS Github Enterprise, under the ics-research organization (adjust permissions as appropriate). As you implement changes and test/debug on your personal machine, this will make it easy to sync those changes to the cluster, as well as make it easy to keep track of changes as you make them, and keep track of multiple development paths or versions using branches. Simply push your changes locally, and pull them from the cluster. Use Anaconda or similar to set up your environment in order to avoid most incompatibilities between environments. Also, it will likely help you to include command-line options early on to switch between CPU and GPU use, and between shorter small-data small-model runs and full-scale runs. That way you can test quickly on your personal machine, as well as do a quick test for compatibility or script issues on the cluster (in many cases even if all the GPUs are in use — just do a CPU-only test) before running a long experiment.
-
Long-running jobs / checkpoint-and-resume. Where possible, please avoid long jobs. It is generally possible to checkpoint and resume most jobs (by saving the state manually or using functions provided by the libraries you use), so please do so. For example, you might set a 2 hour wall-time limit, but checkpoint your job after an hour, terminate your script, and resume it in a new job (you can requeue from your script, and use job dependencies to manage this, or use a mechanism for job array tasks look for the next work that needs to be done). Note that any checkpointing features you might find advertised by Slurm, Apptainer, etc. are probably experimental, so the point here is to save the state and resume yourself. Shorter jobs are best because they allow other people's jobs to make progress, higher priority jobs to run, quicker access for interactive use, and in general let resources get scheduled more efficiently. Checkpointing also helps you recover from errors without losing as much work. Remember, checkpoints should go on the BeeGFS filesystem (and definitely not in
$TMPDIR
, since this will be deleted when the job ends, and also you might resume on a different machine).
How your profile configuration files are used can be confusing, due to the way ssh and Slurm work. Assuming you use the bash
shell (note this is different if invoked using the alias sh
— see the man bash
):
- The profile files are loaded at interactive login, but not in every subshell. So /etc/profile, and ~/.bash_profile, ~/.bash_login, or ~/.profile should contain settings that get propagated to subshells, or aren't needed in subshells.
- The bashrc files are loaded in every interactive shell (if the user profile is configured properly), and so ought to contain things that don't propagate to subshells (or don't need to), but don't do lots of extra work.
- When starting an interactive login shell, it reads /etc/profile (and in turn /etc/profile.d/*), followed by the first user file it finds out of: ~/.bash_profile, ~/.bash_login, and ~/.profile . It doesn't directly load bashrc, but ~/.bash_profile by default includes commands to do so load ~/.bashrc, and ~/.bashrc by default includes commands to load /etc/bashrc.
- When starting an interactive non-login shell, it reads ~/.bashrc. It doesn't directly load /etc/bashrc, however by default ~/.bashrc includes commands to do so.
- When using ssh to connect to the head node, or any other node, you get an interactive login shell (loading profile, and possibly indirectly loading bashrc).
- When Slurm runs a job script submitted using sbatch or srun, the script is executed in the interpreter specified in the shebang line. If that's a shell such as
/bin/bash
or/bin/bash -l
, that would result in a non-interactive shell (unless-i
is specified), either non-login or login, respectively, sourcing the corresponding files as described above. In most cases, the environment (shown byenv
) at submission time is captured and applied to the job environment. - When Slurm runs an inline batch job submitted with
sbatch --wrap
, it creates a temporary job script that begins with#!/bin/sh
, which thus runs as a non-interactive non-login shell and doesn't load any profile or bashrc files. - When starting an interactive job in Slurm using salloc, you get an interactive non-login shell on the compute node, which would read only ~/.bashrc (and /etc/bashrc if specified there).
sh
's legacy behavior are that a login shell will only look for ~/.profile (not ~/.bash_profile), and interactive non-login shells will not load .bashrc, so that file will be ignored unless sourced from ~/.profile in a login shell.
- It's easy for user config files to be missing or broken. Compare your files to the default ones in /etc/skel/ to fix anything missing, in particular:
-
~/.bash_profile should have lines that load ~/.bashrc, and add $HOME/.local/bin and $HOME/bin to PATH. Note that ~/.bash_profile would override ~/.profile if present and if the shell is
bash
, so be careful about that. You can make one a symlink to the other if you want them to be identical. -
~/.bashrc should have lines that load /etc/bashrc (which in turn loads /etc/profile.d/*) and
~/.bashrc.d/
. - Without those instructions in bashrc and bash_profile, your environment might be significantly different.
- You might also be interested in the defaults for zsh, ksh, and emacs (or at least copy them over if missing).
-
~/.bash_profile should have lines that load ~/.bashrc, and add $HOME/.local/bin and $HOME/bin to PATH. Note that ~/.bash_profile would override ~/.profile if present and if the shell is
- If your CAC account was first created before 2024, your default shell might be
/bin/sh
, which can have legacy behavior, butbash
is recommended, and some tools might assume that's what you're using. You can check your default shell withecho $SHELL
(if that variable hasn't been modified), and the list of available ones withchsh -l
. However, it is set through CAC's Active Directory, so you can only change it tobash
(or your preferred shell) by submitting a ticket to CAC (not withchsh
). This is the shell you see when usingssh
or by default insalloc
.- One type of legacy behavior is that
sh
login shells will look for.profile
, not.bash_profile
, while a new user's$HOME
will come with.bash_profile
. This is because the OS config assumes people usebash
, which CAC's user account system overrides this tosh
.
- One type of legacy behavior is that
- If properly configured and in a bash login shell, the default prompt on the cluster should show your username and the hostname of the current node.
- If you typically work interactively on compute nodes, you might want to add information to the command prompt, by setting PS1 in ~/.bashrc (or a file under ~/.bashrc.d/). In most cases, the default is
[\u@\h \W]\\$
(showing username, hostname, working directory suffix). You might want to add Slurm job information, for example $SLURM_JOB_ID, $SLURM_JOB_NODELIST, or any of the other SLURM environment variables. You can run commands to get things like the job end time. You can also add colors to this to make it easier to read. Remember to escape variables and commands to be executed when the prompt is set. Examples:PS1="[ \u@\$(hostname -s) \$([ \$SLURM_JOB_ID ] && printf \"\$SLURM_JOB_ID.\${SLURM_STEP_ID/4294967290/interactive} \$SLURM_JOB_NODELIST \$(squeue -h -j \$SLURM_JOB_ID -o %e) \")\W ]\\$ "
PS1="\[\e[1;33m\][ \[\e[1;32m\]\u\[\e[1;33m\]@\[\e[1;32m\]\$(hostname -s) \$([ \$SLURM_JOB_ID ] && printf \"\[\e[1;36m\]\$SLURM_JOB_ID.\${SLURM_STEP_ID/4294967290/interactive} \[\e[1;32m\]\$SLURM_JOB_NODELIST \[\e[1;36m\\]\$(squeue -h -j \$SLURM_JOB_ID -o %e) \")\[\e[1;31m\]\W \[\e[1;33m\]]\\$\[\e[m\] "
- One quirk of launching an interactive job with a non-login shell (which is likely the default behavior when using
salloc
) is that the$HOSTNAME
env variable inherits the value from the head node rather than being set by the shell on the compute node (this is because it is set in /etc/profile, which is not executed);sbatch
explicitly sets this as a special case when it sets SLURM environment variables. In general, you're probably better off using another method to get the hostname, like$(hostname -s)
, but there could also be other differences between interactive and batch environments, depending on your shell.
- It's easy for user config files to be missing or broken. Compare your files to the default ones in /etc/skel/ to fix anything missing, in particular:
- Submit questions or requests at help or by sending email to: help@cac.cornell.edu. Please include AIDA in the subject area.
- Password resets and changes: https://www.cac.cornell.edu/services/myacct.aspx
To request an account on AIDA with prior permission of the cluster owner, fill out the CAC userid request form for the project:
This is also necessary for continuing users who previously only had access through the bs54_0001 project (no longer used for AIDA access). For users without a NetID who have a CAC account on another project (including bs54_0001), a ticket must be submitted to CAC to have the user added to the rab38_0001 project manually.