Skip to content

Remote

SSH to the server

To do distributed training, we usually create and modify the codes on a local machine first and then talk with the GPU servers through ssh. "ssh to the server"

Create new user

Use the following command to create a new user if necessary. Different users will usually share the same NVIDIA driver but different PyTorch versions.

sudo adduser [username]
sudo usermod -aG sudo [username]

Password-free login

On local machine, generage amn SSH key first

ssh-keygen
Then copy the key to a destinate server
ssh-copy-id user@hostname.example.com
After these two steps one should be able to directly login into the remote server without entering passwords. However, when logining from the server to the local machine a password is still needed, so if one would like to login from the servers back to the local machine without being asked for a password just repeat the same steps on the server side.

Help

  1. Make sure the user name is consistent with the one on the local machine if one would like to login without typing in the username explicitly. This is helpful when there are multiple users on the same machine.
  2. How to ssh a machine by customized name instead of server's ip address? On Ubuntu, do vim /etc/hosts in the terminal and add a new line like [server_ip_address] [my_server]. After refreshing one could direcly do ssh [my_server] to login into the remote server.

Test bandwidth between servers

Install iperf package which is included in most Linux distribution’s repositories.

apt-get install iperf
On the terminal of one server
iperf -s
On the terminal of the other server
iperf -c [first_server_ip_address]
The returned results shoud look like
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size:  128 KByte (default)
------------------------------------------------------------
[  4] local 128.238.9.108 port 5001 connected with 128.238.9.109 port 42488
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec  6.58 GBytes  5.65 Gbits/sec

Network bandwidth

The Bandwidth is indicated in Gbits/sec where 1 Byte = 8 bits. The common Ethernet usually supports a bandwidth of 1 Gbits/sec which might be too slow for distriuted training. In order to improve the performance for distributed training, a minimum bandwidth of 10 Gbits/sec is necessary. That requires the upgrade of network cable, network switch and network interface controller. The data center might use InfiniBand directly which is much faster and more expensive as well.

Excecute remote command

  • Execute remote command

    ssh [my_server] '[command]'
    
    For example, one might always do ssh [my_server] 'pkill -9 python' to kill all python processes on the remote server for distributed training in case the training pipeline breaks.

  • Run multiple commands

    ssh [my_server] '[command1]; [command2]; [command3]'