Skip to content

Commit

Permalink
Merge pull request #506 from m-yamanashi/add-v3-document
Browse files Browse the repository at this point in the history
Make corrections for the 3rd division
  • Loading branch information
s-yama authored Dec 2, 2024
2 parents 7ad28bd + 027d04e commit 9fab183
Show file tree
Hide file tree
Showing 6 changed files with 61 additions and 63 deletions.
14 changes: 7 additions & 7 deletions v3/en/docs/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@

## Connecting to Interactive Node

To connect to the interactive node (*int*), the ABCI frontend, two-step SSH public key authentication is required.
To connect to the interactive node (*login*), the ABCI frontend, two-step SSH public key authentication is required.

1. Login to the access server (*as.v3.abci.ai*) with SSH public key authentication, so as to create an *SSH tunnel* between your computer and *int*.
2. Login to the interactive node (*int*) with SSH public key authentication via the SSH tunnel.
1. Login to the access server (*as.v3.abci.ai*) with SSH public key authentication, so as to create an *SSH tunnel* between your computer and *login*.
2. Login to the interactive node (*login*) with SSH public key authentication via the SSH tunnel.

In this document, ABCI server names are written in *italics*.

Expand Down Expand Up @@ -33,7 +33,7 @@ In this section, we will describe two methods to login to the interactive node u
Login to the access server (*as.v3.abci.ai*) with following command:

```
[yourpc ~]$ ssh -i /path/identity_file -L 10022:int:22 -l username as.v3.abci.ai
[yourpc ~]$ ssh -i /path/identity_file -L 10022:login:22 -l username as.v3.abci.ai
The authenticity of host 'as.v3.abci.ai (0.0.0.1)' can't be established.
RSA key fingerprint is XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX. <- Display only at the first login
Are you sure you want to continue connecting (yes/no)? <- Enter "yes"
Expand Down Expand Up @@ -61,7 +61,7 @@ RSA key fingerprint is XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX. <- Displ
Are you sure you want to continue connecting (yes/no)? <- Enter "yes"
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
Enter passphrase for key '/path/identity_file': <- Enter passphrase
[username@int1 ~]$
[username@login1 ~]$
```

#### ProxyJump
Expand All @@ -72,7 +72,7 @@ First, add the following configuration to your ``$HOME/.ssh/config``:

```
Host abci
HostName int
HostName login
User username
ProxyJump %r@as.v3.abci.ai
IdentityFile /path/to/identity_file
Expand All @@ -92,7 +92,7 @@ ProxyJump does not work with OpenSSH_for_Windows_7.7p1 which is bundled with Win

```
Host abci
HostName int
HostName login
User username
ProxyCommand C:\WINDOWS\System32\OpenSSH\ssh.exe -W %h:%p %r@as.v3.abci.ai
IdentityFile C:\path\to\identity_file
Expand Down
28 changes: 14 additions & 14 deletions v3/en/docs/job-execution.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,7 @@ $ qsub -I -P group -q resource_type -l select=num [options]
Example) Executing an interactive job (On-demand service)

```
[username@int1 ~]$ qsub -I -P grpname -q rt_HF -l select=1
[username@login1 ~]$ qsub -I -P grpname -q rt_HF -l select=1
[username@hnode001 ~]$
```

Expand Down Expand Up @@ -180,7 +180,7 @@ $ qsub job_script
Example) Submission job script run.sh as a batch job (Spot service)

```
[username@int1 ~]$ qsub run.sh
[username@login1 ~]$ qsub run.sh
1234.pbs1
```

Expand Down Expand Up @@ -210,7 +210,7 @@ The major options of the `qstat` command are follows.
Example)

```
[username@int1 ~]$ qstat
[username@login1 ~]$ qstat
Job id Name User Time Use S Queue
--------------------- ---------------- ---------------- -------- - -----
12345.pbs1 run.sh username 00:01:23 R rt_HF
Expand All @@ -236,12 +236,12 @@ $ qdel job_ID
Example) Delete a batch job

```
[username@int1 ~]$ qstat
[username@login1 ~]$ qstat
Job id Name User Time Use S Queue
--------------------- ---------------- ---------------- -------- - -----
12345.pbs1 run.sh username 00:01:23 R rt_HF
[username@int1 ~]$ qdel 12345.pbs1
[username@int1 ~]$
[username@login1 ~]$ qdel 12345.pbs1
[username@login1 ~]$
```


Expand Down Expand Up @@ -296,7 +296,7 @@ $ qrsub options
Example) Make a reservation 4 compute nodes (H) from 2024/07/05 to 1 week (7 days)

```
[username@int1 ~]$ qrsub -a 20240705 -d 7 -P grpname -n 4 -N "Reserve_for_AI"
[username@login1 ~]$ qrsub -a 20240705 -d 7 -P grpname -n 4 -N "Reserve_for_AI"
Your advance reservation 12345 has been granted
```

Expand All @@ -318,7 +318,7 @@ To show the current status of reservations, use the `qrstat` command.
Example)

```
[username@int1 ~]$ qrstat
[username@login1 ~]$ qrstat
ar-id name owner state start at end at duration sr
----------------------------------------------------------------------------------------------------
12345 Reserve_fo root w 07/05/2024 10:00:00 07/12/2024 09:30:00 167:30:00 false
Expand All @@ -339,7 +339,7 @@ If you want to show the number of nodes that can be reserved, use`qrstat` comman

Checking the Number of Reservable Nodes for Compute Nodes
```
[username@int1 ~]$ qrstat --available
[username@login1 ~]$ qrstat --available
06/27/2024 441
07/05/2024 432
07/06/2024 434
Expand All @@ -359,7 +359,7 @@ To cancel a reservation, use the `qrdel` command. When canceling reservation wit
Example) Cancel a reservation

```
[username@int1 ~]$ qrdel 12345,12346
[username@login1 ~]$ qrdel 12345,12346
```

### How to use reserved node
Expand All @@ -369,14 +369,14 @@ To run a job using reserved compute nodes, specify reservation ID with the `-ar`
Example) Execute an interactive job on compute node reserved with reservation ID `12345`.

```
[username@int1 ~]$ qrsh -g grpname -ar 12345 -l rt_HF=1 -l h_rt=1:00:00
[username@login1 ~]$ qrsh -g grpname -ar 12345 -l rt_HF=1 -l h_rt=1:00:00
[username@hnode001 ~]$
```

Example) Submit a batch job on compute node reserved with reservation ID `12345`.

```
[username@int1 ~]$ qsub -P grpname -ar 12345 run.sh
[username@login1 ~]$ qsub -P grpname -ar 12345 run.sh
Your job 12345 ("run.sh") has been submitted
```

Expand All @@ -403,9 +403,9 @@ Advance Reservation does not guarantee the health of the compute node for the du

Example) hnode001 is available, hnode002 is unavailable
```
[username@int1 ~]$ qrsub -a 20240705 -d 7 -P grpname -n 2 -N "Reserve_for_AI"
[username@login1 ~]$ qrsub -a 20240705 -d 7 -P grpname -n 2 -N "Reserve_for_AI"
Your advance reservation 12345 has been granted
[username@int1 ~]$ qrstat -ar 12345
[username@login1 ~]$ qrstat -ar 12345
(snip)
message reserved queue gpu@hnode002 is disabled
message reserved queue gpu@hnode002 is unknown
Expand Down
21 changes: 10 additions & 11 deletions v3/en/docs/system-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## System Architecture

The ABCI system consists of 766 compute nodes with 6,128 NVIDIA H200 GPU accelerators and other computing resources, shared file systems with total capacity of approximately 74 PB, InfiniBand network that connects these elements at high speed, firewall, and so on. It also includes software to make the best use of these hardware. And, the ABCI system uses SINET6, the Science Information NETwork, to connect to the Internet at 100 Gbps.
The ABCI system consists of 766 compute nodes with 6,128 NVIDIA H200 GPU accelerators and other computing resources, 75PB of physical storage, InfiniBand network that connects these elements at high speed, firewall, and so on. It also includes software to make the best use of these hardware. And, the ABCI system uses SINET6, the Science Information NETwork, to connect to the Internet at 100 Gbps.

## Computing Resources

Expand All @@ -11,35 +11,34 @@ Below is a list of the computational resources of the ABCI system.
| Node Type | Hostname | Description | # |
|:--|:--|:--|:--|
| Access Server | *as.v3.abci.ai* | SSH server for external access | 2 |
| Interactive Node | *int* | Login server, the frontend of the ABCI system | 5 |
| Interactive Node | *login* | Login server, the frontend of the ABCI system | 5 |
| Compute Node (H) | *hnode001*-*hnode108*[^1] | Server w/ NVIDIA H200 GPU accelerators | 108 |

[^1]: 766 compute nodes (H) will become available around January 2025.

!!! note
Due to operational and maintenance reasons, some computing resources may not be provided.

Among them, each interactive node and compute node (H) are equipped with InfiniBand HDR and are connected to Storage Systems described later by InfiniBand switch group.
Also, each compute node (H) is equipped with 8 port of InfiniBand NDR and the compute nodes (H) are connected by InfiniBand switch.
Among them, each interactive node and compute node (H) are equipped with InfiniBand HDR (200 Gbps) and are connected to Storage Systems described later by InfiniBand switch group.
Also, each compute node (H) is equipped with 8 port of InfiniBand NDR (200 Gbps) and the compute nodes (H) are connected by InfiniBand switch.

Below are the details of these nodes.

### Interactive Node

The interactive node of ABCI system consists of HPE ProLiant DL380 Gen11.
The interactive node is equipped with two Intel Xeon Platinum 8468 Processors and approximately 1100 GB of main memory available.
The interactive node is equipped with two Intel Xeon Platinum 8468 Processors and approximately 1024 GB of main memory available.

The specifications of the interactive node are shown below:

| Item| Description | # |
|:--|:--|:--|
| CPU | Intel Xeon Platinum 8468 Processor 2.1 GHz, 48 Cores | 2 |
| Memory | 68 GB DDR5-4800 | 16 |
| Memory | 64 GB DDR5-4800 | 16 |
| SSD | SAS SSD 960 GB | 2 |
| SSD | NVMe SSD 3.2 TB | 4 |
| Interconnect | InfiniBand HDR (200 Gbps) | 2 |
| | 10GBASE-SR | 1 |
| | 1GBASE-SR | 1 |

Users can login to the interactive node, the frontend of the ABCI system, using SSH tunneling via the access server.

Expand All @@ -66,7 +65,7 @@ The specifications of the compute node (H) are shown below:
|:--|:--|:--|
| CPU | Intel Xeon Platinum 8558 2.1GHz, 48cores | 2 |
| GPU | NVIDIA H200 SXM 141GB | 8 |
| Memory | 68 GB DDR5-5600 4400 MHz | 32 |
| Memory | 64 GB DDR5-5600 4400 MHz | 32 |
| SSD | NVMe SSD 7.68 TB | 2 |
| Interconnect | InfiniBand NDR (200 Gbps) | 8 |
| | InfiniBand HDR (200 Gbps) | 1 |
Expand All @@ -75,7 +74,7 @@ The specifications of the compute node (H) are shown below:

## Storage Systems

The ABCI system has three storage systems for storing large amounts of data used for AI and Big Data applications, and these are used to provide shared file systems. The total effective capacity is up to approximately 74 PB.
The ABCI system has three storage systems for storing large amounts of data used for AI and Big Data applications, and these are used to provide shared file systems. Combined, /home, /groups, and /groups_s3 have an effective capacity of approximately 74 PB.

| # | Storage System | Media | Usage |
|:--|:--|:--|:--|
Expand All @@ -85,7 +84,7 @@ The ABCI system has three storage systems for storing large amounts of data used

Below is a list of shared file systems provided by the ABCI system using the above storage systems.

| Usage | Mount point | Capacity | File system | Notes |
| Usage | Mount point | Effective capacity | File system | Notes |
|:--|:--|:--|:--|:--|
| Home area | /home | 10 PB | Lustre | |
| Group area | /groups | 63 PB | Lustre | |
Expand Down Expand Up @@ -125,7 +124,7 @@ The software available on the ABCI system is shown below. Details on the version
| File System | DDN Lustre | | |
| | BeeOND | | |
| Object Storage | DDN S3 API | | |
| Container | Singularity-CE | | |
| Container | SingularityCE | | |
| MPI | Intel MPI | | |
| Library | cuDNN | | |
| | NCCL | | |
Expand Down
12 changes: 6 additions & 6 deletions v3/ja/docs/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@

## インタラクティブノードへの接続 {#connecting-to-interactive-node}

ABCIシステムのフロントエンドであるインタラクティブノード(ホスト名: *int*)に接続するには、二段階のSSH公開鍵認証による接続を行います。
ABCIシステムのフロントエンドであるインタラクティブノード(ホスト名: *login*)に接続するには、二段階のSSH公開鍵認証による接続を行います。

1. SSH公開鍵認証を用いてアクセスサーバ(ホスト名: *as.v3.abci.ai*)にログインして、ローカルPCとインタラクティブノードの間にSSHポートフォワーディングによるトンネリング(以下「SSHトンネル」という)を作成
2. SSHトンネルを介して、SSH公開鍵認証を用いてインタラクティブノード(*int*)にログイン
2. SSHトンネルを介して、SSH公開鍵認証を用いてインタラクティブノード(*login*)にログイン

なお本章では、ABCIのサーバ名は *イタリック* で表記します。

Expand Down Expand Up @@ -33,7 +33,7 @@ ABCIシステムのフロントエンドであるインタラクティブノー
以下のコマンドでアクセスサーバ(*as.v3.abci.ai*)にログインし、SSHトンネルを作成します。

```
[yourpc ~]$ ssh -i /path/identity_file -L 10022:int:22 -l username as.v3.abci.ai
[yourpc ~]$ ssh -i /path/identity_file -L 10022:login:22 -l username as.v3.abci.ai
The authenticity of host 'as.v3.abci.ai (0.0.0.1)' can't be established.
RSA key fingerprint is XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX. <- 初回ログイン時のみ表示
Are you sure you want to continue connecting (yes/no)? <- yesを入力
Expand Down Expand Up @@ -61,7 +61,7 @@ RSA key fingerprint is XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX. <- 初
Are you sure you want to continue connecting (yes/no)? <- yesを入力
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
Enter passphrase for key '-i /path/identity_file': <- パスフレーズ入力
[username@int1 ~]$
[username@login1 ~]$
```

#### ProxyJumpの使用 {#proxyjump}
Expand All @@ -72,7 +72,7 @@ Enter passphrase for key '-i /path/identity_file': <- パスフレーズ入力

```
Host abci
HostName int
HostName login
User username
ProxyJump %r@as.v3.abci.ai
IdentityFile /path/to/identity_file
Expand All @@ -92,7 +92,7 @@ Windows 10 バージョン 1803 以降に標準でバンドルされている Op

```
Host abci
HostName int
HostName login
User username
ProxyCommand C:\WINDOWS\System32\OpenSSH\ssh.exe -W %h:%p %r@as.v3.abci.ai
IdentityFile C:\path\to\identity_file
Expand Down
Loading

0 comments on commit 9fab183

Please sign in to comment.