Local¶
On this webpage one should be able to find related information about both hardware and software setups for distriuted training.
1. Hardware configuration¶
The performance of hardware has a strong influence on the training speed for sure. Before moving on to the softwaere environment, let's talk about hardware setup first:
-
GPUs Make sure the remote servers have available Nvidia GPUs (GTX 1080, RTX 2080, RTX 3008, etc.). These computational processors are key to training the deep learning models faster.
GPU manufacturers
my codes only supports NVIDIA GPUs for now.
-
CPUs Type
head /proc/cpuinfo
to check CPU information. Though most computation cost should be given to GPUs, the performance of CPU (like the number of cores) has strong influence on some multi-processing tasks like loading the traning and testing data. Besides, the choice of number of workers for data loader depends on the number of CPU cores . -
Storage Device Check the information of the storage device where the training and testing data is stored. Usually solid-state drive (SSD) has much faster reading and loadind speed than hard disk drive (HDD).
2. Software environment¶
To properly configure the software envirionment used for distributed training, the following steps are necessary to be done:
NVIDIA driver¶
Depending on the CUDA version, the Nvidia driver has to be properly installed. Take a look at Nvidia website for CUDA Compatibility.
apt search nvidia-driver # search resources to install Nvidia driver
sudo apt install nvidia-driver-460 # install latest version of Nvidia driver
sudo reboot # reboot the computer
Note
Usually, installing the latest version of NVIDIA driver is the best choice.
After installation and rebooting, type the following commands in the terminal.
nvidia-smi # check current status of GPUs on the machine
watch -n0.1 nvidia-smi # check status of GPUs on the machine for every 0.1 second
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 00000000:09:00.0 Off | N/A |
| 27% 29C P8 5W / 180W | 2MiB / 8119MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 1080 Off | 00000000:0A:00.0 Off | N/A |
| 28% 28C P8 6W / 180W | 2MiB / 8119MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 1080 Off | 00000000:41:00.0 Off | N/A |
| 28% 32C P8 5W / 180W | 2MiB / 8119MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 1080 Off | 00000000:42:00.0 Off | N/A |
| 28% 33C P8 6W / 180W | 17MiB / 8117MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Help
- On the top row of the table, if shows the both the versions of NVIDIA driver and CUDA library.
- Besides the information of NVIDIA driver and CUDA library, the table also indicated the number of GPUs the operating system could recognize and the status of GPUs (e.g. temperature, memory, utility, etc.).
- Usually we are cared about two columns: Memort-Usage and GPU-Util. If the memory consumption exceeds the capacity of current GPU the GPU will stop working directly. While if the value of GPU-Util remains freeezed then the GPU is not working as expected.
- NVIDIA driver will be installed at the root repository so different users share the same driver.
PyTorch¶
Go to the official website of PyTorch and install the corresponding PyTorch version.
Note
- CUDA library will be implicitly installed along with PyTorch.
- PyTorch will be installed in each user's local home folder so different users will have their own PyTorch installed.