Table of Contents
You can get started with FastMoE with docker or in a direct way.
Docker #
Environment Setup #
On host machine #
First, you need to setup the environment on the host machine.
Then, we recommend the official PyTorch docker image, as the environment is
well-setup there. Note that you should use the image with
devel
in its tag, rather than
runtime
. Theoretically,
Pytorch environment on your host machine is not needed.
For example, you can run
docker pull pytorch/pytorch:1.8.0-cuda11.1-cudnn8-devel
to get a Pytorch docker image.
Inside the docker #
Run a docker container with commands like:
docker run --name pytorch -it pytorch/pytorch:1.7.1-cuda11.0-cudnn8-devel
And use bash to interact with it:
docker exec -ti pytorch /bin/bash
For distributed expert feature, NCCL is required. Inside the docker, you can first check if the NCCL is installed, such as:
$ apt list --installed | grep nccl
libnccl-dev/unknown,now 2.8.4-1+cuda11.2 amd64 [installed]
libnccl2/unknown,now 2.8.4-1+cuda11.2 amd64 [installed]
If not, you can follow the official
documentation
to install the right version according to CUDA version (which can be
inspected by nvcc -V
) in your
docker. After that, you need to setup NCCL in your conda environment,
following this.
Finally, you can check NCCL simply with
torch.cuda.nccl.version()
in
Python. Additionally, there is an official
repo for testing NCCL, and it is
up to you.
Installation #
Enter our repo directory inside the well-prepared docker container. By
default, the distributed expert feature is disabled. So you need to set
environment variable USE_NCCL=1
to enable it. Use
python setup.py install
to
easily install our FastMoE, and you can check the installation with:
$ conda list | grep fastmoe
fastmoe 0.1.1 pypi_0 pypi
Finally, enjoy using FastMoE for training!
Without Docker #
Preparations #
To use FastMoe, CUDA and PyTorch are required.
-
CUDA Tookit is available at https://developer.nvidia.com/cuda-downloads. Select your operating system and follow instructions on the website to install CUDA. Notice: version of CUDA must match the version of nvidia driver. If you’re not sure whether you have installed nvidia driver or you don’t know its version, you may use
nvidia-smi
to get information about nvidia driver. -
Add CUDA to the list of environmental variables. If you work with Linux, use command
vi ~/.bashrc
and add the following content to the end of file (replace X.X with version of CUDA you’ve downloaded):
export PATH=$PATH:/usr/local/cuda-X.X/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-X.X/lib64
export CUDA_HOME=$CUDA_HOME:/usr/local/cuda-X.X
Then don’t forget to use source ~/.bashrc
to update the configurations. So far, CUDA has been
installed successfully, you can use nvcc --version
to check its version.
- PyTorch can be installed with pip. Version
>=1.8.0
is required if you want to use Megatron. After installation, run the following Python code:
import torch
torch.cuda.is_available()
torch.cuda.decive_count()
If result of torch.cuda.is_available()
is True
and torch.cuda.decive_count()
returns number of your device, then conguatulations! CUDA and PyTorch
run successfully on your device.
NCCL #
-
If you want to enable distributed expert feature, please download NCCL at https://developer.nvidia.com/nccl/nccl-legacy-downloads. Version of NCCL should be no less than
2.7.5
and match the version of PyTorch. You can use functiontorch.cuda.nccl.version()
to see the version of NCCL required. -
Install the ‘deb’ file. If you use Ubuntu or Debian, just use the following commands (nccl_repo_file is your file, XXX and X.X mean version of NCCL and CUDA):
sudo dpkg -i nccl_repo_file.deb
sudo apt update
sudo apt install libnccl2=XXX+cudaX.X libnccl-dev=XXX+cudaX.X
FastMoE Installation #
Clone the repo of FastMoe from https://github.com/laekov/fastmoe, and use the following command to install:
python3 setup.py install
If you need NCCL, set environmental variable USE_NCCL=1
before installation. For example, you may use command as follows:
export USE_NCCL=1
Installation finishes. Enjoy FastMoe now! You can try excuting benchmark_mlp.py
in directory tests
.