Remote¶
SSH to the server¶
To do distributed training, we usually create and modify the codes on a local machine first and then talk with the GPU servers through ssh
.
Create new user¶
Use the following command to create a new user if necessary. Different users will usually share the same NVIDIA driver but different PyTorch versions.
sudo adduser [username]
sudo usermod -aG sudo [username]
Password-free login¶
On local machine, generage amn SSH key first
ssh-keygen
ssh-copy-id user@hostname.example.com
Help
- Make sure the user name is consistent with the one on the local machine if one would like to login without typing in the username explicitly. This is helpful when there are multiple users on the same machine.
- How to
ssh
a machine by customized name instead of server's ip address? On Ubuntu, dovim /etc/hosts
in the terminal and add a new line like[server_ip_address] [my_server]
. After refreshing one could direcly dossh [my_server]
to login into the remote server.
Test bandwidth between servers¶
Install iperf
package which is included in most Linux distribution’s repositories.
apt-get install iperf
iperf -s
iperf -c [first_server_ip_address]
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 128 KByte (default)
------------------------------------------------------------
[ 4] local 128.238.9.108 port 5001 connected with 128.238.9.109 port 42488
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.0 sec 6.58 GBytes 5.65 Gbits/sec
Network bandwidth
The Bandwidth is indicated in Gbits/sec where 1 Byte = 8 bits
. The common Ethernet usually supports a bandwidth of 1 Gbits/sec
which might be too slow for distriuted training. In order to improve the performance for distributed training, a minimum bandwidth of 10 Gbits/sec
is necessary. That requires the upgrade of network cable, network switch and network interface controller. The data center might use InfiniBand directly which is much faster and more expensive as well.
Excecute remote command¶
-
Execute remote command
For example, one might always dossh [my_server] '[command]'
ssh [my_server] 'pkill -9 python'
to kill all python processes on the remote server for distributed training in case the training pipeline breaks. -
Run multiple commands
ssh [my_server] '[command1]; [command2]; [command3]'