-
Notifications
You must be signed in to change notification settings - Fork 28
Machine Specific Build Instructions
Specific instructions for ChaNGa builds on various parallel architectures are documented below. Preferred Charm++ build options are noted. If the machine corresponds to a NSF XSEDE or other national facility, the particular machine will be noted.
In this configuration, Charm uses the UDP protocol for communicating over the network.
Most recent machines are 64 bit. In this case, build charm with
./build ChaNGa netlrts-linux-x86_64 --with-productionThen configure and make ChaNGa. For 32 bit machines, omit the 'x86_64' in the charm build command.
Almost all modern clusters are built from "nodes", each with many compute cores. Within a node, the cores can directly access each other's memory, while communication between nodes is done via a message passing protocol (e.g., MPI).
On these architectures Charm++ and ChaNGa can be compiled in such a way to take advantage of the shared memory within a node to reduce the amount of communication. This is done at the time charm is built by adding smp
to the build command line, e.g.,
./build ChaNGa netlrts-linux-x86_64 smp --with-productionThen "configure" and "make" ChaNGa the same as a non-SMP build.
Running an SMP build presents a lot of options that will impact performance. On a given physical node, one can run one or more ChaNGa processes, where each process has one thread for communication and many worker threads. Furthermore, each thread can be tied to a particular core, which can impact performance based on the core and memory layout of the node. For example, a node may have two CPU chips in two separate sockets on the motherboard, in which case, better performance may be gotten by having two ChaNGa processes, each running its communication and worker threads on one of the CPU chips. To be even more specific, consider a two socket Intel Ivy Bridge node with 12 cores on each chip, the command line to run on 4 such nodes would be:
charmrun +p 88 ChaNGa ++ppn 11 +setcpuaffinity +commap 0,12 +pemap 1-11,13-24 sample.paramIn this case, 8 processes would be created (88/11), with 2 processes on each node. The first processes on each node will have its communication thread on core "0", and 11 worker threads on cores 1 through 11 (all on the first chip), while the second processes will have its communication thread on core "12", and 11 worker threads on cores 13 through 23 (all on the second chip).
When using mpiexec
or mpirun
, the command line layout is slightly different. The above example would be:
mpiexec -np 8 ChaNGa ++ppn 11 +setcpuaffinity +commap 0,12 +pemap 1-11,13-24 sample.paramAgain, 8 processes (or MPI ranks, or "virtual nodes") are created, two on each of 4 hardware nodes, with 11 worker threads per process.
Some architectures can have multiple virtual cores per physical core, referred to as hyperthreading. ChaNGa generally does not benefit from hyperthreading.
See the Charm++ SMP documentation for other ways to specify layout of processes and threads.
For a single multicore machine, ChaNGa can be built to utilize all the cores. In this case build charm with
./build ChaNGa multicore-linux-x86_64or
./build ChaNGa multicore-linux-i386depending on whether you are running 64 bit or 32 bit linux. Then configure and make ChaNGa.
The Mac OS is subtly different than linux. First be sure you have the development tools installed with xcode-select --install
. Two additional packages need to be installed with homebrew via the command brew install autoconf automake
. The charm system can then be built with:
./build ChaNGa netlrts-darwin-x86_64 smp --with-production -j4For older versions of charm (before March, 2019), you may have to add
-stdlib=libc++
at the end of the above command. Then you can cd
into the ChaNGa
source directory and then "configure" and "make". For older versions of ChaNGa (before November, 2018), "sinks.cpp" will have an undeclared identifier "MAXPATHLEN". Either upgrade ChaNGa, or change "MAXPATHLEN" to "PATH_MAX" in sinks.cpp.
The Cray XC series is very similar since it uses the same GNI interface to the network. For XC series, replace gni-crayxe
with gni-crayxc
below.
Charm runs natively on the Gemini interconnect used by the XE6/XK7 series. With 32 cores/node, the "SMP" version of charm offers advantages. Running out of memory in the GNI layer can be a problem. This is fixed with the hugepage
option below.
- Switch to the GNU programming environment:
module swap PrgEnv-cray PrgEnv-gnu
- Load the cray resiliency communication agent (RCA) library with
module load rca
- Load the hugepage module with
module load craype-hugepages2M
- Build charm with
./build ChaNGa gni-crayxe hugepages smp -j4 --with-production
- Configure and make ChaNGa
export HUGETLB_DEFAULT_PAGE_SIZE=2M export HUGETLB_MORECORE=no # this line may give problems on small core counts
Note that on Cray architectures, one usually uses aprun
, not charmrun
to start parallel programs.
A typical aprun command would look like:
aprun -n 8 -N 1 -d 32 ./ChaNGa -p 4096 +ppn 31 +setcpuaffinity +pemap 1-31 +commap 0 dwf1.2048.paramwhere
-n 8
starts 8 processes, -N 1
puts 1 process
on each physical node, -d 32
reserves 32 threads per process,
-p 4096
divides the simulation into 4096 domains,
+ppn 31
request 31 worker threads per process and
+setcpuaffinity +pemap 1-31 +commap 0
explicitly maps the threads
to CPU cores with the worker threads going on cores 1 to 31, and the
communication thread going on core 0.
GPU support is in development.
In addition to the above:
- Load the CUDA development environment with
module load cudatoolkit
- Use the CUDA_DIR environment variable to point at this environment:
export CUDA_DIR=$CRAY_CUDATOOLKIT_DIR
- Build charm with
./build ChaNGa gni-crayxe cuda hugepages -j4 --with-production
CrayNid.c: In function 'getXTNodeID': CrayNid.c:32:2: error: #error "Cannot get network topology information on a Cray build. Swap current module xt-mpt with xt-mpt/5.0.0 or higher and xt-asyncpe with xt-asyncpe/4.0 or higher and then rebuild
This can be fixed by setting the following environment variable before running the build command:
export PE_PKGCONFIG_LIBS=cray-pmi:cray-ugni:$PE_PKGCONFIG_LIBS
- Configure ChaNGa with
./configure --with-cuda=$CUDA_DIR
, thenmake
.
As of v3.3, the GPU build of ChaNGa can run in SMP mode (one process, multiple threads). To build for this mode, replace the charm build command above with ./build ChaNGa gni-crayxe cuda hugepages smp -j4 --with-production
before compiling ChaNGa.
On an infiniband cluster there are two options for building ChaNGa. The most straightforward option is using MPI (the mpi-linux-x86_64 build below), but occasionally the verbs-linux-x86_64 build may work better.
Bridges 2 has 128 CPU cores per node. ChaNGa running on this many cores generates a lot of messages and cause problems with the MPI implementations. As of March 2021, the charm verbs build seems to be the only machine layer that works and scales well, and that with a more recent version of charm++. The procedure at the moment is:
- Checkout version v7.0.0 of charm
- load the "mvapich2/2.3.5-gcc8.3.1" and the "python/2.7" modules. The "mpi" module is just to get at an "mpiexec" for the sbatch submission.
- build charm with
./buildold ChaNGa verbs-linux-x86_64 smp -j8 --with-production
- build ChaNGa with the usual configure and make.
- The run line in your sbatch script should look like (e.g. for running on 4 nodes):
./charmrun.smp +p 504 ++mpiexec ./ChaNGa.smp ++ppn 63 +setcpuaffinity +commap 0,64 +pemap 1-63,65-127 +IBVBlockAllocRatio 1024 +IBVBlockThreshold 11 XXX.paramThis runs 2 SMP processes on each node, one per socket. The IBVBlock flags allocate bigger chunks of pinned memory for the Infiniband card.
IBVBlockAllocRatio
specifies how many messages in a single allocation. IBVBlockThreshold
is related to the message size threshold below which messages are allocated in blocks, where the byte threshold is 128*2^IBVBlockThreshold
bytes.
The current (7/24/24) default for these numbers are +IBVBlockAllocRatio 128
and +IBVBlockThreshold 9
.
If you are having trouble running with mpiexec, you can generate a nodelist and use ssh for the spinup. An example SLURM run script would look like this:
cd $SLURM_SUBMIT_DIR # Generate node list echo "group main ++shell /usr/bin/ssh ++cpus $SLURM_CPUS_ON_NODE" > nodelist for i in `scontrol show hostnames $SLURM_NODELIST` do echo host $i >> nodelist done ./charmrun.smp +p 504 ./ChaNGa.smp ++ppn 63 +setcpuaffinity +commap 0,64 +pemap 1-63,65-127 +IBVBlockAllocRatio 1024 +IBVBlockThreshold 11 XXX.param
This machine is very similar to Bridges-2: 128 cores per node. See the Bridges-2 instructions.
Update 7/12/24: The MVAPICH2 implementation on Expanse has some corruption issue that I have not resolved. I currently recommend using the verbs build as described in the Bridges-2 instructions, and ignore the following.
While the OpenMPI implementation on this machine tends to fail, MVAPICH2 is also installed, and that implementation works well. The modules that need to be loaded in the following order: cpu/0.17.3b gcc/10.2.0/npcyll4 mvapich2/2.3.7 slurm
. To run with mpi (rather than verbs) a typical run command with 2 SMP processes per node would be the following (follow Bridges-2 directions above if running with verbs):
#SBATCH --nodes=8 #SBATCH --ntasks-per-node=2 ... srun --mpi=pmi2 -n 16 ./ChaNGa.smp ++ppn 63 +setcpuaffinity +commap 0,64 +pemap 1-63,65-127 XXX.paramSee the SDSC Expanse documentation at https://www.sdsc.edu/support/user_guides/expanse.html#running for more information.
This is another Infiniband machine with lots of cores per node, in this case 56. See the instructions for Bridges-2, but now the run command would be
./charmrun.smp +p 216 ++mpiexec ./ChaNGa.smp ++ppn 27 +setcpuaffinity +commap 0,28 +pemap 1-27,29-55 XXX.paramto run 2 SMP (one on each socket) per node on 4 nodes.
With the Fall 2021 "upgrade" of the operating system on Pleiades, the default MPI implementation (mpi-hpe/mpt) no longer works. However, the Intel MPI implementation is installed, and it seems to work. Update 5/2/22: The Intel MPI implementation is also based on UCX, which has known problems with large numbers of messages. While the following works for most jobs, more network intense jobs (e.g. toward the end of a zoom simulation) will fail with UCX errors. In that case, use the verbs build as described under PSC Bridges-2 above.
To use Intel MPI, load the mpi-intel
module, then follow the directions for mpi-linux-x86_64 below. For jobs on larger node counts, the smp
can be used (see above). For SMP with MPI, care must be taken with the PBS options. For example, to run ChaNGa on 24 "Ivy" nodes where each node has two Intel Ivybridge sockets with 10 cores each, one uses the PBS line
#PBS -lselect=24:ncpus=20:mpiprocs=2:model=ivyThe
mpiprocs
option is saying to run only two MPI processes per node. The corresponding command to start ChaNGa is:
mpiexec $PWD/ChaNGa ++ppn 9 +setcpuaffinity +commap 0,10 +pemap 1-9,11-19 XXX.paramwhich puts one MPI process with 9 worker threads and 1 communication thread on each socket.
The intel compilers and MPI distribution seems to work best, gcc and openmpi can run into issues with hanging.
module load intel intelmpi autotoolsCharm can then be built with
./build ChaNGa mpi-linux-x86_64 --with-production mpicxx
A basic submission script looks like this:
#!/bin/bash #SBATCH --nodes=8 #SBATCH --ntasks-per-node=40 #SBATCH --time=4:00:00 #SBATCH --job-name=mychangajob #SBATCH --output=%x.%j_%A.out cd $SLURM_SUBMIT_DIR module load intel intelmpi charmrun ++mpiexec ++remote-shell mpirun +p 320 ChaNGa +balancer MultistepLB_notopo BLAH.param
which can then be run with sbatch
Charm should be built with
./build ChaNGa mpi-linux-x86_64 --with-production
Then in the changa directory type
- ./configure; make
Charm has a native infiniband driver that is more efficient than using MPI. To use it first build charm with
./build ChaNGa verbs-linux-x86_64 --with-production
./configure; make
charmrun
and ChaNGa
executables.
Even if mpi is not being used, the MPI infrastructure is useful for starting ChaNGa. Charmrun has a ++mpiexec
option that takes advantage of this infrastructure.
For example,
charmrun +p 144 ++mpiexec ChaNGa -wall 600 + balancer MultistepLB_notopo simulation.paramCharmrun assumes that "mpiexec" is used, but stampede uses "ibrun". Therefore a small shell script is needed to overcome this difference. Call it "mympiexec" and it will contain:
#!/bin/csh shift; shift; exec ibrun $*Then call charmrun with (e.g.):
charmrun +p 36 ++mpiexec ++remote-shell mympiexec ChaNGa -wall 60 +balancer OrbLB Disk_Collapse_1e6.param
If the MPI runtime is not available, or you don't wish to use it, charmrun
needs a nodelist file to inform it which nodes are available to run on. An example is:
group main ++shell /usr/bin/ssh host maia0 host maia1Call this file
nodelist
and have it in the directory from which you run ChaNGa. The node names can be found from the queueing system. For example, in the PBS system, one can use the short script:
#!/bin/bash echo 'group main ++shell /usr/bin/ssh' > nodelist for i in `cat $PBS_NODEFILE` ; do echo host $i >> nodelist doneto create
nodelist
. ChaNGa can then be started with charmrun +p 2 ChaNGa simulation.param
For SLURM systems the above nodelist
generation script can be written as:
#!/bin/bash echo 'group main ++shell /usr/bin/ssh' > nodelist for i in `scontrol show hostnames` ; do echo host $i >> nodelist done
GPU support is still experimental. Also, on many machines the CUDA device can only be accessed by one process on a node. Hence charm needs to be built with the SMP option so that all cores can use the GPU, and only one charm process is running per node (mpiprocs=1 in the PBS -lselect option, and +p N ++ppn M options such that N/M equals the number of GPU nodes used.)
For any of the machines below that have GPUs more advanced than Kepler, special compile flags need to be passed to the NVidia compiler that depend on the machine architecture. We don't have an automatic way of detecting the GPU architecture (particularly when you are compiling on a different host), so an appropriate cuda-level
needs to be added to the ChaNGa configure command. For Pascal GPUs (P100), add --with-cuda-level=60
to the configure line. For Volta GPUs (V100) add --with-cuda-level=70
. The GPU code can be compiled to perform part of the tree-walk on the GPU with --enable-gpu-local-tree-walk
. Note that the gravity algorithm is slightly different (more like the traditional Barnes-Hut) with this option.
To build for CUDA on NAS Pleiades:
- As of June 2017, the build steps need to be done on one of the GPU nodes. Use "qsub -I -q gpu_k40" to get an interactive session.
- Load the CUDA development environment with
module load cuda
- If you haven't already, load a modern C compiler with
module load gcc
. (Intel should also work, but gcc also needs to be loaded for its libraries:module load gcc; module load comp-intel
) - Set the CUDATOOLKIT_HOME environment variable to point at the development environment. Use
which nvcc
to find the directory, then, e.g., set the environment variable withsetenv CUDATOOLKIT_HOME /nasa/cuda/8.0
. - In the charm source directory, build charm with
./build ChaNGa verbs-linux-x86_64 cuda smp -j4 --with-production
. Note that built with this configuration, charm++ can only be used to compile ChaNGa on nodes with the full CUDA development environment. - In the changa source directory, configure ChaNGa with
./configure --with-cuda=$CUDATOOLKIT_HOME
. - Compile ChaNGa with "make".
To build for CUDA on Maverick or Stampede:
- Load the CUDA development environment with
module load cuda
- Use the CUDA_DIR environment variable to point at this environment:
export CUDA_DIR=$TACC_CUDA_DIR
- Build charm with
./build ChaNGa verbs-linux-x86_64 cuda smp -j4 --with-production
- Configure ChaNGa with
./configure --with-cuda=$CUDA_DIR
- Compile ChaNGa with "make"
- Load the CUDA development environment with
module load cuda
- Build charm with
./build ChaNGa verbs-linux-x86_64 cuda smp -j4 --with-production
- Configure ChaNGa with
./configure --with-cuda
- Compile ChaNGa with "make"
- You must compile in an interactive environment on a GPU node. Use
srun --partition=gpu-debug --pty --account=<<project>> --nt</asks-per-node=10>\ --nodes=1 --mem=96G --gpus=1 -t 00:30:00 --wait=0 --export=ALL /bin/bashto get an interactive session.
- Get the modules reset:
module purge; module restore; module load cuda
. Do NOT load the gcc module. - Build charm with
./build ChaNGa verbs-linux-x86_64 cuda smp -j4 --with-production
- Configure ChaNGa with
./configure --with-cuda --with-cuda-level=70
. The 70 level is for the V100 GPU. - Compile ChaNGa with
make
.
The interconnect on the Stampede 3 cluster uses the Intel OmniPath architecture which does not work well with the verbs API. Use an MPI build instead:
./build ChaNGa mpi-linux-x86_64 smp mpicxx --with-production --march=skylake
- configure ChaNGa with the
--enable-avx
flag.
-march=skylake
flag to the opt_flag
line.
The -march=skylake
flag allows ChaNGa to take advantage of the AVX2 vector instructions to calculate gravity. The Skylakes (SKX) are the lowest common denominator on the Stampede 3 system. If you are going to be running exclusively on the Icelake (ICX) or Sapphire Rapids (SPR) nodes, the -march
flag can be changed accordingly.
All the compute nodes have two sockets on each node, so having at least 2 SMP processes per node helps with performance. Be aware that the mapping betweens cores and sockets is different than most other machines: all the even numbered cores are on one socket and all the odd numbered cores are on the other. For good performance all the threads of an SMP process should be on a single core, so a typical run command (on a Skylake (spx) partition) would be:
ibrun ./ChaNGa ++ppn 23 +setcpuaffinity +commap 0,1 +pemap 2-46:2,3-47:2 xxx.param
sbatch -n 10 -N 5 xxx.job
, where "10" is the total number of tasks, and 5 is the number of nodes, each of which is running 2 tasks. Within a given node, the first task will use core 0 to communicate, and cores 2, 4, 6, ..., 46 as workers, while the second task will use core 1 to communicate, and cores 3, 5, 7, ..., 47 as workers. If your simulation has a lot of communication, you might get better performance with more commication threads, which means more tasks. To run 4 mpi tasks per physical Skylake node, the sbatch command will be something like: sbatch -n 20 -N 5 xxx.job
, so there are now 20 total tasks on 5 nodes, and each node will run 4 tasks. To get an efficient thread layout, the run command would be:
ibrun ./ChaNGa ++ppn 11 +setcpuaffinity +commap 0,24,1,25 +pemap 2-22:2,26-46:2,3-23:2,27-47:2 xxx.param
The original hyak nodes (ITK) have out-of-date compilers which are unable to compile recent versions of Charm++ and ChaNGa. If you must use the old hyak nodes, use charm version 6.8.0 or eariler, and ChaNGa version 3.3 or earlier. However, it is recommend that you move to the new MOX nodes (see below.)
updated 04/14/17
There are some GPU enabled nodes on hyak. Currently the vsm group has 1 GPU node.
Steps:
Download charm++ from github:
git clone https://github.com/UIUC-PPL/charm
As of writing, this works with the current development version of charm++ (the default charm branch). Request a GPU node by submitting an interactive job to the GPU queue, eg:
qsub -IV -q gpu -l walltime=2:00:00
Load CUDA, find the cuda toolkit directory, and choose the directory corresponding to the cuda version loaded.
module load cuda_7 ls -d /sw/cuda* export CUDATOOLKIT_HOME=/sw/cudatoolkit-7.0.28
This will point charm to the right directory. cd into the charm++ directory and build it:
./build ChaNGa mpi-linux-x86_64 cuda -j12
(-j12 assumes you are on a 12 core hyak node). cd into the ChaNGa directory, configure ChaNGa to use cuda and build it:
./configure --with-cuda=$CUDATOOLKIT_HOME make -j 12
Quick test results for the testcollapse simulation show a factor of 6x speedup:
Walltime with cuda: 0m48.023s Walltime without cuda: 4m55.079s
Updated 06/26/17
When building on mox, make sure to request an interactive session on a compute node, e.g.:
srun -t 0 -N 1 -p vsm --pty /bin/bash
Then load the gnu compiler with intel mpi
module load gcc_4.8.5-impi_2017
Build charm with mpi linux, without SMP (tested for speed protoplanetary disks). Mox nodes have 28 cores, so you can use a lot.
./build ChaNGa mpi-linux-x86_64 -j20
ChaNGa can be built with defaults.
When submitting jobs with sbatch, you should not need to specify the number of tasks, just the number of nodes.
ChaNGa should be run with mpirun. For a job on 7 nodes for 1 day, your submission script can look like:
#!/bin/bash -l
#SBATCH -N 7
#SBATCH -J jobname
#SBATCH -t 24:00:00
#SBATCH -p vsm
#SBATCH --mem=500G
#SBATCH --mail-type=ALL --mail-user=username@uw.edu
cd path/to/run
mpirun ChaNGa paramfile.param &> stdoutfile
Updated 01/10/23
Building ChaNGa on Klone is very similar to the build on Mox. To get an interactive node on Klone (in this case, using the stf partition), run the command
salloc -A stf -p compute-int -N 1 --time=00:30:00
The main difference is that the intel mpi compiler must be loaded with a slightly differently
module load stf/mpich/4.0a2
The interconnect on the Stampede 2 KNL cluster uses the Intel OmniPath architecture which does not work well with the verbs API. Use an MPI build instead:
./build ChaNGa mpi-linux-x86_64 smp mpicxx --with-production -xCORE-AVX2 -axMIC-AVX512
--enable-avx
) to take advantage of the KNL floating point units. Also add -xCORE-AVX2 -axMIC-AVX512
to the C/C++ compiler flags in the Makefile.
Since this is an MPI build, ChaNGa can be executed from within a batch script. The Omnipath network requires more CPU for communication, so multiple processes/node is helpful. Four processes/node can be specified in the sbatch command with a "-N nnn" argument where "nnn" is the total tasks divided by four. E.g. for a job running on 8 nodes, use sbatch -n 32 -N 8 xxx.qsub
. Then in the script, ChaNGa is executed with:
ibrun ./ChaNGa ++ppn 16 xxx.param
+setcpuaffinity +commap 0,17,34,51 +pemap 1-16,18-33,35-50,52-67
ibrun ./ChaNGa ++ppn 32 +setcpuaffinity +commap 0,17,34,51 +pemap 1-16+68,18-33+68,35-50+68,52-67+68 xxx.param
Update from Jim Phillips, NAMD developer: It's a bad idea to split tiles across SMP nodes, so the pemaps should start on even PEs. Furthermore, the comm threads can be anywhere on the chip since they are going to the network anyway. His preferred map is therefore:
ibrun ./ChaNGa ++ppn 32 +setcpuaffinity +commap 64-67 +pemap 0-63+68 xxx.param
ibrun ./ChaNGa ++ppn 20 +setcpuaffinity +commap 60-65 +pemap 0-59+68 xxx.param
sbatch -n 30 -N 5 xxx.job
, where "30" is the total number of tasks, and 5 is the number of nodes, each of which is running 6 tasks.
Stampede2 also includes a Skylake partition. Much of the description for the KNL partition holds here since the interconnect is the same, but details of the processors are quite different. Build the MPI target of charm with:
./build ChaNGa mpi-linux-x86_64 smp mpicxx --with-production
- configure ChaNGa with the
--enable-avx
flag.
-xCORE-AVX2
flag to the opt_flag
line.
The -xCORE-AVX2
flag allows ChaNGa to take advantage of the AVX2 vector instructions to calculate gravity.
The Skylake nodes have two sockets on each node, so having 2 SMP processes per node helps with performance. Be aware that the mapping betweens cores and sockets is different than most other machines: all the even numbered cores are on one socket and all the odd numbered cores are on the other. For good performance all the threads of an SMP process should be on a single core, so a typical run command would be:
ibrun ./ChaNGa ++ppn 23 +setcpuaffinity +commap 0,1 +pemap 2-46:2,3-47:2 xxx.param
sbatch -n 10 -N 5 xxx.job
, where "10" is the total number of tasks, and 5 is the number of nodes, each of which is running 2 tasks. Within a given node, the first task will use core 0 to communicate, and cores 2, 4, 6, ..., 46 as workers, while the second task will use core 1 to communicate, and cores 3, 5, 7, ..., 47 as workers. If your simulation has a lot of communication, you might get better performance with more commication threads, which means more tasks. To run 4 mpi tasks per physical Skylake node, the sbatch command will be something like: sbatch -n 20 -N 5 xxx.job
, so there are now 20 total tasks on 5 nodes, and each node will run 4 tasks. To get an efficient thread layout, the run command would be:
ibrun ./ChaNGa ++ppn 11 +setcpuaffinity +commap 0,24,1,25 +pemap 2-22:2,26-46:2,3-23:2,27-47:2 xxx.param
Use ./build ChaNGa mpi-bluegenel -O3
to compile charm++ with the GCC compiler and bluegene specific communication library, or ./build ChaNGa mpi-bluegenel xlc
to compile charm++ with the IBM C compiler and a bluegene specific communication library. The IBM C compiler (v. 9) introduces bugs at high optimizations, so beware.
This architecture does not come with an XDR library which ChaNGa uses for machine independent output. For this machine a compiled version of the XDR library is provided on our distribution site. Download the file xdr.tgz from the distribution site http://faculty.washington.edu/trq/hpcc/distribution/changa/ and unpack it in the ChaNGa directory. The configure script for ChaNGa will then detect it and link to it appropriately.
Previous problems with linking in the XDR library have been fixed.
The mvapich2_ib MPI implementation seems to be the most performant. To use this, load the intel
module and the mvapich2_ib
model, then follow the directions for mpi-linux-x86_64 below.
- For systems with a Federation switch (ARSC iceberg), directly using IBM's communication layer may give better performance.
- May need to use
gmake
instead ofmake
(depending on which make are installed) - A few extra options are needed:
-qstrict
and-qrtti=dyna
- Make using
make OPTS="-O3 -qstrict -qrtti=dyna"
- Make using
- Builds fine otherwise.
- Note that
charmrun
is just a wrapper aroundpoe
. It is more robust just to usepoe
to start a parallel job.
- If there is a complain at runtime that libjpeg cannot be loaded, modify
conv-autoconfig.h
in thetmp
directory of charm++. Enter thelibs/ck-libs/liveViz
directory andmake clean; make
.
Builds and runs out of the box.
- May need to
configure --host aix
if the C/C++ compiler produces an executable that needs an MPI environment to run. - Need to use
gmake
instead ofmake
- Builds fine otherwise.
- Note that
charmrun
is just a wrapper aroundpoe
. It is more robust just to usepoe
to start a parallel job.
The Cray OS (catamount) does not have the xdr library available. Download it from our distribution site, and compile it with "gcc" before building ChaNGa.
- The following commands need to be executed before Charm and ChaNGa can be built.
module load gcc/4.0.2
module remove acml
- configure needs to be run as
./configure -host linux
since the cray front end is actually a cross-compilation environment. -
charmrun
doesn't work on bigben. Use the standardpbsyod
to runChaNGa
.
This uses the default pathScale compiler which may give better performance. However building charm is a little tricky.
- Use
./build charm++ mpi-crayxt3
to build charm. - Change to the
tmp
directory and editconv-mach.sh
and change theCMK_SEQ_CC
andCMK_SEQ_LD
definitions togcc
. - Type
make charm++
to rebuild with these changes. - Build
ChaNGa
with the standardconfigure
andmake
commands. - As above, use
pbsyod
to run.
alloca()
error.
For some reason, at this date (July 2009) the PGI compiler produces code that is an order of magnitude slower (!!!) than code from the GCC compiler. The procedure is therefore as follows.
- Switch to the GNU programming environment:
module swap PrgEnv-pgi PrgEnv-gnu
- Build charm with
./build ChaNGa mpi-crayxt -O3
. - configure and make ChaNGa
These directions will also help with a single workstation with a CUDA capable GPU.
When building charm++, it needs to know where the CUDA compiler lives. This can be set with an environment variable; here is an example for forge:
export CUDA_DIR=/usr/local/cuda_4.0.17/cuda ./build ChaNGa net-linux-x86_64 cuda -O2
Once ChaNGa is configured, the Makefile needs to be edited to point CUDA_DIR and NVIDIA_CUDA_SDK at the above directories, but also to uncomment the CUDA = ...
line. Also CUDA does not handle hexadecapole multipole moments, so the HEXADECAPOLE =
line needs to be commented out.