Skip to content

Local

On this webpage one should be able to find related information about both hardware and software setups for distriuted training.

1. Hardware configuration

The performance of hardware has a strong influence on the training speed for sure. Before moving on to the softwaere environment, let's talk about hardware setup first:

  1. GPUs Make sure the remote servers have available Nvidia GPUs (GTX 1080, RTX 2080, RTX 3008, etc.). These computational processors are key to training the deep learning models faster.

    GPU manufacturers

    my codes only supports NVIDIA GPUs for now.

  2. CPUs Type head /proc/cpuinfo to check CPU information. Though most computation cost should be given to GPUs, the performance of CPU (like the number of cores) has strong influence on some multi-processing tasks like loading the traning and testing data. Besides, the choice of number of workers for data loader depends on the number of CPU cores .

  3. Storage Device Check the information of the storage device where the training and testing data is stored. Usually solid-state drive (SSD) has much faster reading and loadind speed than hard disk drive (HDD).

2. Software environment

To properly configure the software envirionment used for distributed training, the following steps are necessary to be done:

NVIDIA driver

Depending on the CUDA version, the Nvidia driver has to be properly installed. Take a look at Nvidia website for CUDA Compatibility.

apt search nvidia-driver              # search resources to install Nvidia driver
sudo apt install nvidia-driver-460    # install latest version of Nvidia driver
sudo reboot                           # reboot the computer

Note

Usually, installing the latest version of NVIDIA driver is the best choice.

After installation and rebooting, type the following commands in the terminal.

nvidia-smi                 # check current status of GPUs on the machine
watch -n0.1 nvidia-smi     # check status of GPUs on the machine for every 0.1 second
If the returned information looks like then the GPUs and NVIDAI driver work functionally.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 00000000:09:00.0 Off |                  N/A |
| 27%   29C    P8     5W / 180W |      2MiB /  8119MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1080    Off  | 00000000:0A:00.0 Off |                  N/A |
| 28%   28C    P8     6W / 180W |      2MiB /  8119MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 1080    Off  | 00000000:41:00.0 Off |                  N/A |
| 28%   32C    P8     5W / 180W |      2MiB /  8119MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 1080    Off  | 00000000:42:00.0 Off |                  N/A |
| 28%   33C    P8     6W / 180W |     17MiB /  8117MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Help

  1. On the top row of the table, if shows the both the versions of NVIDIA driver and CUDA library.
  2. Besides the information of NVIDIA driver and CUDA library, the table also indicated the number of GPUs the operating system could recognize and the status of GPUs (e.g. temperature, memory, utility, etc.).
  3. Usually we are cared about two columns: Memort-Usage and GPU-Util. If the memory consumption exceeds the capacity of current GPU the GPU will stop working directly. While if the value of GPU-Util remains freeezed then the GPU is not working as expected.
  4. NVIDIA driver will be installed at the root repository so different users share the same driver.

PyTorch

Go to the official website of PyTorch and install the corresponding PyTorch version.

Note

  1. CUDA library will be implicitly installed along with PyTorch.
  2. PyTorch will be installed in each user's local home folder so different users will have their own PyTorch installed.