Skip to content

Artifact for OSDI'23: MGG: Accelerating Graph Neural Networks with Fine-grained intra-kernel Communication-Computation Pipelining on Multi-GPU Platforms.

Notifications You must be signed in to change notification settings

YukeWang96/MGG_OSDI23

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Artifact for OSDI'23 paper

Yuke Wang, et al. Accelerating Graph Neural Networks with Fine-grained intra-kernel Communication-Computation Pipelining on Multi-GPU Platforms. OSDI'23.

[Paper] [Bibtex] DOI

1. Setup (Skip to Section-2 if evaluated on provided GCP)

1.1. Clone this project from Github.

git clone --recursive git@github.com:YukeWang96/MGG-OSDI23-AE.git

1.2. Download libraries and datasets.

  • Download libraries (cudnn-v8.2, nvshmem_src_2.0.3-0, openmpi-4.1.1).
wget https://proj-dat.s3.us-west-1.amazonaws.com/local.tar.gz
tar -zxvf local.tar.gz && rm local.tar.gz
tar -zxvf local/nvshmem_src_2.0.3-0/build_cu112.tar.gz
wget https://proj-dat.s3.us-west-1.amazonaws.com/dataset.tar.gz && tar -zxvf dataset.tar.gz && rm dataset.tar.gz
  • Setup baseline DGL
cd dgl_pydirect_internal
wget https://proj-dat.s3.us-west-1.amazonaws.com/graphdata.tar.gz && tar -zxvf graphdata.tar.gz && rm graphdata.tar.gz
cd ..
  • Setup baseline ROC
wget https://proj-dat.s3.us-west-1.amazonaws.com/roc-new.tar.gz && tar -zxvf roc-new.tar.gz && rm roc-new.tar.gz

1.3. Launch Docker for MGG.

cd docker 
./launch.sh

1.4. Compile implementation.

mkdir build && cd build && cmake .. && cd ..
./0_mgg_build.sh

2. Run initial test experiment.

  • Please try study experiments in below Section-3.4 and Section-3.5

3. Reproduce the major results from paper.

3.1 Compare with UVM on 4xA100 and 8xA100 (Fig.8a and Fig.8b).

./0_run_MGG_UVM_4GPU_GCN.sh
./0_run_MGG_UVM_4GPU_GIN.sh
./0_run_MGG_UVM_8GPU_GCN.sh
./0_run_MGG_UVM_8GPU_GIN.sh

Note that the results can be found at Fig_8_UVM_MGG_4GPU_GCN.csv, Fig_8_UVM_MGG_4GPU_GIN.csv, Fig_8_UVM_MGG_8GPU_GCN.csv, and Fig_8_UVM_MGG_8GPU_GIN.csv.

3.2 Compare with DGL on 8xA100 for GCN and GIN (Fig.7a and Fig.7b).

./launch_docker.sh
cd gcn/
./0_run_gcn.sh
cd ../gin/
./0_run_gin.sh

Note that the results can be found at 1_dgl_gin.csv and 1_dgl_gcn.csv and our MGG reference is in MGG_GCN_8GPU.csv and MGG_8GPU_GIN.csv.

3.3 Compare with ROC on 8xA100 (Fig.9).

cd roc-new/docker
./launch.sh
./run_all.sh

Note that the results can be found at Fig_9_ROC_MGG_8GPU_GCN.csv and Fig_9_ROC_MGG_8GPU_GIN.csv.

Results of ROC is similar as

Dataset Time (ms)
reddit 425.67
enwiki-2013 619.33
it-2004 5160.18
paper100M 8179.35
ogbn-products 529.74
ogbn-proteins 423.82
com-orkut 571.62

3.4 Compare NP with w/o NP (Fig.10a).

python 2_MGG_NP.py

Note that the results can be found at MGG_NP_study.csv. Similar to following table.

Dataset MGG_WO_NP MGG_W_NP Speedup (x)
Reddit 76.797 16.716 4.594
enwiki-2013 290.169 88.249 3.288
ogbn-product 86.362 26.008 3.321

3.5 Compare WL with w/o WL (Fig.10b).

python 3_MGG_WL.py

Note that the results can be found at MGG_WL_study.csv. Results are similar to

Dataset MGG_WO_NP MGG_W_NP Speedup (x)
Reddit 75.035 18.92 3.966
enwiki-2013 292.022 104.878 2.784
ogbn-product 86.632 29.941 2.893

3.6 Compare API (Fig.10c).

python 4_MGG_API.py

Note that the results can be found at MGG_API_study.csv. Results are similar to

Norm.Time w.r.t. Thread MGG_Thread MGG_Warp MGG_Block
Reddit 1.0 0.299 0.295
enwiki-2013 1.0 0.267 0.263
ogbn-product 1.0 0.310 0.317

3.7 Design Space Search (Fig.11a)

python 5_MGG_DSE_4GPU.py

Note that the results can be found at Reddit_4xA100_dist_ps.csv and Reddit_4xA100_dist_wpb.csv. Results similar to

  • Reddit_4xA100_dist_ps.csv
dist\ps 1 2 4 8 16 32
1 17.866 17.459 16.821 16.244 16.711 17.125
2 17.247 16.722 16.437 16.682 17.053 17.808
4 16.826 16.41 16.583 17.217 17.627 18.298
8 16.271 16.725 17.193 17.655 18.426 18.99
16 16.593 17.214 17.617 18.266 19.009 19.909
  • Reddit_4xA100_dist_wpb.csv
dist\wpb 1 2 4 8 16
1 34.773 23.164 16.576 15.235 16.519
2 34.599 23.557 17.254 15.981 19.56
4 34.835 23.616 17.674 17.034 22.084
8 34.729 23.817 18.302 18.708 25.656
16 34.803 24.161 18.879 23.44 32.978
python 5_MGG_DSE_8GPU.py

Note that the results can be found at Reddit_8xA100_dist_ps.csv and Reddit_8xA100_dist_wpb.csv.

4. Use MGG as a Tool or Library for your project.

Building a new design based on MGG with NVSHMEM is simple, there are only several steps:

4.1 Build the C++ design based on our existing examples

  • Create a new .cu file under src/. An example is shown below.

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &nranks);
attr.mpi_comm = &mpi_comm;
// Set up NVSHMEM device.
nvshmemx_init_attr(NVSHMEMX_INIT_WITH_MPI_COMM, &attr);
int mype_node = nvshmem_team_my_pe(NVSHMEMX_TEAM_NODE);
cudaSetDevice(mype_node);
cudaStreamCreate(&stream);

4.2 Build the CUDA kernel design based on our existing examples.

  • Add a kernel design in include/neighbor_utils.cuh. An example is shown below.

void mgg_SAG_np_div(
float* output, // NVSHMEM
const float* input, // NVSHMEM
const nidType* row_pointers_l,
const nidType* column_index_l,
const nidType* row_pointers_r,
const nidType* column_index_r,
const nidType lb,
const nidType ub,
const int dim,
const int nodePerPE,
const int peid,
const int np_size,
const int warpPerBlock,
const int interleaved_dist
){

__global__
void mgg_SAG_np_div_cuda(
float* output,
const float* input,
const nidType* row_pointers_l,
const nidType* column_index_l,
const nidType* row_pointers_r,
const nidType* column_index_r,
const nidType lb,
const nidType ub,
const int dim,
const int nodePerPE,
const int partSize,
const int warpPerBlock,
const int interleaved_dist
){

https://github.com/YukeWang96/MGG_OSDI23/blob/73e1866f23d001491f0c69d5216dec680593de27/include/neighbor_utils.cuh#L277C1-L292

4.3 Register the new design to CMake.

  • Add a compilation entry in CMakeLists.txt).
  • Add a command make filename.cu in 0_mgg_build.cu.
  • An example is shown below. Note that please match the filename with your newly created .cu in step-1.

add_executable(MGG_np_div_kernel
src/mgg_np_div_kernel.cu
include/loss.cu
include/layer.cu
)

MGG_OSDI23/CMakeLists.txt

Lines 218 to 249 in 73e1866

cudnn
gomp
curand
)
#############################################################################
target_include_directories(MGG_np_div_kernel
PRIVATE
${NVSHMEM_HOME}/include
${CUDA_HOME}/include
${MPI_HOME}/include
${CUDNN_HOME}/include
include
)
target_compile_options(MGG_np_div_kernel PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:
-Xcompiler
-pthread
-rdc=true
-ccbin g++
-arch ${SM_ARCH}
>)
set_target_properties(MGG_np_div_kernel PROPERTIES CUDA_SEPARABLE_COMPILATION ON)
set_target_properties(MGG_np_div_kernel PROPERTIES CUDA_ARCHITECTURES "80")
target_link_libraries(MGG_np_div_kernel
nvshmem
cuda
mpi_cxx
mpi
cublas

4.4 Launch the MGG docker and recompile,

  • The compiled executable will be located under build/.
cd docker 
./launch.sh
cd build && cmake ..
cd .. && ./0_mgg_build.sh

4.5 Run the compiled executable.

MGG_OSDI23/bench_MGG.py

Lines 5 to 51 in 73e1866

os.environ["PYTHONWARNINGS"] = "ignore"
os.environ["LD_LIBRARY_PATH"] += os.pathsep + 'local/openmpi-4.1.1/lib/'
os.environ["PATH"] += os.pathsep + 'local/openmpi-4.1.1/bin/'
os.environ["LD_LIBRARY_PATH"] += os.pathsep + 'local/cudnn-v8.2/lib64'
# os.environ["NVSHMEM_SYMMETRIC_SIZE"] = '3690987520' # paper100M
# os.environ["NVSHMEM_SYMMETRIC_SIZE"] = '7381975040' # paper100M
# os.environ["NVSHMEM_SYMMETRIC_SIZE"] = '14763950080' # paper100M
os.environ["NVSHMEM_SYMMETRIC_SIZE"] = '21474836480' # paper100M
num_GPUs = int(sys.argv[1])
hidden = 16
warpPerblock = 4
partSize = 16
interleaved_dist = 16
dataset = [
( 'Reddit' , 602 , 41),
( 'enwiki-2013' , 300 , 12),
( 'it-2004' , 256 , 64),
( 'paper100M' , 128 , 64),
( 'ogbn-products' , 100 , 47),
( 'ogbn-proteins' , 8 , 112),
( 'com-Orkut' , 128 , 32),
]
GPU_avail = "CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 "
# GPU_avail = "CUDA_VISIBLE_DEVICES=4,5,6,7 "
pre_condit = GPU_avail + 'OMPI_MCA_plm_rsh_agent=sh\
mpirun --allow-run-as-root -np {} '.format(num_GPUs)
command = "build/MGG_np_div_mem "
# command = "build/MGG_np_div "
# command = "build/MGG_np_div_kernel "
for data, in_dim, out_dim in dataset:
beg_file = "dataset/bin/{}_beg_pos.bin".format(data)
csr_file = "dataset/bin/{}_csr.bin".format(data)
weight_file = "dataset/bin/{}_weight.bin".format(data)
if data != 'enwiki-2013':
os.system(pre_condit + "{0} {1} {2} {3} {4} {5} {6} {7} {8} {9} {10}".
format(command, beg_file, csr_file, weight_file,
num_GPUs, partSize, warpPerblock, interleaved_dist,
in_dim, hidden, out_dim))
else:

Reference

About

Artifact for OSDI'23: MGG: Accelerating Graph Neural Networks with Fine-grained intra-kernel Communication-Computation Pipelining on Multi-GPU Platforms.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published