<languages />
<translate>
<!--T:1-->
[https://mxnet.incubator.apache.org/ Apache MXNet] is a deep learning framework designed for both efficiency and flexibility. It allows you to mix symbolic and imperative programming to maximize efficiency and productivity. At its core, MXNet contains a dynamic dependency scheduler that automatically parallelizes both symbolic and imperative operations on the fly. A graph optimization layer on top of that makes symbolic execution fast and memory efficient. MXNet is portable and lightweight, scalable to many GPUs and machines.

= Available wheels = <!--T:9-->
You can list available wheels using the <tt>avail_wheels</tt> command.
{{Command
|avail_wheels mxnet
|result=
name    version    python    arch
------  ---------  --------  ------
mxnet   1.9.1      cp39      avx2
mxnet   1.9.1      cp38      avx2
mxnet   1.9.1      cp310     avx2
}}

= Installing in a Python virtual environment = <!--T:10-->
1. Create and activate a Python virtual environment.
{{Commands
|module load python/3.10
|virtualenv --no-download ~/env
|source ~/env/bin/activate
}}

<!--T:11-->
2. Install MXNet and its Python dependencies.
{{Command
|prompt=(env) [name@server ~]
|pip install --no-index mxnet
}}

<!--T:12-->
3. Validate it.
{{Command
|prompt=(env) [name@server ~]
|python -c "import mxnet as mx;print((mx.nd.ones((2, 3))*2).asnumpy());"
|result=
[[2. 2. 2.]
 [2. 2. 2.]]
}}

= Running a job = <!--T:13-->
A single Convolution layer:
{{File
|name=mxnet-conv-ex.py
|lang="python"
|contents=
#!/bin/env python

<!--T:14-->
import mxnet as mx
import numpy as np

<!--T:15-->
num_filter = 32
kernel = (3, 3)
pad = (1, 1)
shape = (32, 32, 256, 256)

<!--T:16-->
x = mx.sym.Variable('x')
w = mx.sym.Variable('w')
y = mx.sym.Convolution(data=x, weight=w, num_filter=num_filter, kernel=kernel, no_bias=True, pad=pad)

<!--T:17-->
device = mx.gpu() if mx.context.num_gpus() > 0 else mx.cpu()

<!--T:18-->
# On CPU will use MKLDNN, or will use cuDNN
exe = y.simple_bind(device, x=shape)

<!--T:19-->
exe.arg_arrays[0][:] = np.random.normal(size=exe.arg_arrays[0].shape)
exe.arg_arrays[1][:] = np.random.normal(size=exe.arg_arrays[1].shape)

<!--T:20-->
exe.forward(is_train=False)
o = exe.outputs[0]
t = o.asnumpy()
print(t)
}}

<!--T:21-->
2. Edit the following submission script according to your needs.
<tabs>
<tab name="CPU">
{{File
|name=mxnet-conv.sh
|lang="bash"
|contents=
#!/bin/bash

<!--T:22-->
#SBATCH --job-name=mxnet-conv
#SBATCH --account=def-someprof    # adjust this to match the accounting group you are using to submit jobs
#SBATCH --time=01:00:00           # adjust this to match the walltime of your job
#SBATCH --cpus-per-task=2         # adjust this to match the number of cores
#SBATCH --mem=20G                 # adjust this according to the memory you need

<!--T:23-->
# Load modules dependencies
module load python/3.10

<!--T:24-->
# Generate your virtual environment in $SLURM_TMPDIR
virtualenv --no-download ${SLURM_TMPDIR}/env
source ${SLURM_TMPDIR}/env/bin/activate

<!--T:25-->
# Install MXNet and its dependencies
pip install --no-index mxnet==1.9.1

<!--T:26-->
# Will use MKLDNN
python mxnet-conv-ex.py
}}
</tab>

<!--T:27-->
<tab name="GPU">
{{File
|name=mxnet-conv.sh
|lang="bash"
|contents=
#!/bin/bash

<!--T:28-->
#SBATCH --job-name=mxnet-conv
#SBATCH --account=def-someprof    # adjust this to match the accounting group you are using to submit jobs
#SBATCH --time=01:00:00           # adjust this to match the walltime of your job
#SBATCH --cpus-per-task=2         # adjust this to match the number of cores
#SBATCH --mem=20G                 # adjust this according to the memory you need
#SBATCH --gres=gpu:1              # adjust this to match the number of GPUs, unless distributed training, use 1

<!--T:29-->
# Load modules dependencies
module load python/3.10

<!--T:30-->
# Generate your virtual environment in $SLURM_TMPDIR
virtualenv --no-download ${SLURM_TMPDIR}/env
source ${SLURM_TMPDIR}/env/bin/activate

<!--T:31-->
# Install MXNet and its dependencies
pip install --no-index mxnet==1.9.1

<!--T:32-->
# Will use cuDNN
python mxnet-conv-ex.py
}}
</tab>
</tabs>

<!--T:33-->
3. Submit the job to the scheduler.
{{Command
|sbatch mxnet-conv.sh
}}


= High Performance with MXNet = <!--T:34-->

== MXNet with Multiple CPUs or a Single GPU == <!--T:35-->

<!--T:36-->
Similar to PyTorch and TensorFlow, MXNet contains both CPU and GPU-based parallel implementations of operators commonly used in Deep Learning, such as matrix multiplication and convolution, using OpenMP and MKLDNN (CPU) or CUDA and CUDNN (GPU). Whenever you run MXNet code that performs such operations, they will either automatically leverage multi-threading over as many CPU cores as are available to your job, or run on the GPU if your job requests one.

<!--T:37-->
With the above being said, when training small scale models we strongly recommend using '''multiple CPUs instead of using a GPU'''. While training will almost certainly run faster on a GPU (except in cases where the model is very small), if your model and your dataset are not large enough, the speed up relative to CPU will likely not be very significant and your job will end up using only a small portion of the GPU's compute capabilities. This might not be an issue on your own workstation, but in a shared environment like our HPC clusters this means you are unnecessarily blocking a resource that another user may need to run actual large scale computations! Furthermore, you would be unnecessarily using up your group's allocation and affecting the priority of your colleagues' jobs. 

<!--T:38-->
Simply put, '''you should not ask for a GPU''' if your code is not capable of making a reasonable use of its compute capacity. The following example illustrates how to train a Convolutional Neural Network using MXNet with or without a GPU:

<!--T:39-->
{{File
  |name=mxnet-example.sh
  |lang="bash"
  |contents= 

<!--T:40-->
#!/bin/bash
#SBATCH --nodes 1
#SBATCH --tasks-per-node=1 
#SBATCH --cpus-per-task=1 # change this parameter to 2,4,6,... to see the effect on performance
#SBATCH --gres=gpu:1 # Remove this line to run using CPU only 

<!--T:41-->
#SBATCH --mem=8G      
#SBATCH --time=0:05:00
#SBATCH --output=%N-%j.out
#SBATCH --account=<your account>

<!--T:42-->
module load python # Using Default Python version - Make sure to choose a version that suits your application

<!--T:43-->
virtualenv --no-download $SLURM_TMPDIR/env
source $SLURM_TMPDIR/env/bin/activate
pip install mxnet --no-index

<!--T:44-->
echo "starting training..."

<!--T:45-->
python mxnet-example.py
}}

<!--T:46-->
{{File
  |name=mxnet-example.py
  |lang="python"
  |contents=
import numpy as np
import time

<!--T:47-->
from mxnet import context
from mxnet import autograd, gpu, cpu
from mxnet.gluon import nn, Trainer
from mxnet.gluon.loss import SoftmaxCrossEntropyLoss
from mxnet.gluon.data.vision import transforms
from mxnet.gluon.data.vision.datasets import CIFAR10
from mxnet.gluon.data import DataLoader

<!--T:48-->
import argparse

<!--T:49-->
parser = argparse.ArgumentParser(description='cifar10 classification models, cpu performance test')
parser.add_argument('--lr', default=0.1, help='')
parser.add_argument('--batch_size', type=int, default=512, help='')
parser.add_argument('--num_workers', type=int, default=0, help='')

<!--T:50-->
def main():

    <!--T:51-->
args = parser.parse_args()

    <!--T:52-->
ctx = gpu() if context.num_gpus() > 0 else cpu()

    <!--T:53-->
net = nn.Sequential()

    <!--T:54-->
net.add(nn.Conv2D(channels=6, kernel_size=5, activation='relu'),
        nn.MaxPool2D(pool_size=2, strides=2),
        nn.Conv2D(channels=16, kernel_size=5, activation='relu'),
        nn.MaxPool2D(pool_size=2, strides=2),
        nn.Flatten(),
        nn.Dense(120, activation="relu"),
        nn.Dense(84, activation="relu"),
        nn.Dense(10))

    <!--T:55-->
net.initialize(ctx=ctx)

    <!--T:56-->
criterion = SoftmaxCrossEntropyLoss()
    trainer = Trainer(net.collect_params(),'sgd', {'learning_rate': args.lr})

    <!--T:57-->
transform_train = transforms.Compose([transforms.ToTensor(),transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
    dataset_train = CIFAR10(root='./data', train=True).transform_first(transform_train)
    train_loader = DataLoader(dataset_train, batch_size=args.batch_size, shuffle=True, num_workers=args.num_workers)

    <!--T:58-->
perf = []

    <!--T:59-->
for inputs, targets in train_loader:

       <!--T:60-->
inputs = inputs.as_in_context(ctx)
       targets = targets.as_in_context(ctx)

       <!--T:61-->
start = time.time()

       <!--T:62-->
with autograd.record():

          <!--T:63-->
outputs = net(inputs)
          loss = criterion(outputs, targets)

       <!--T:64-->
loss.backward()
       trainer.step(batch_size=args.batch_size)

       <!--T:65-->
batch_time = time.time() - start
       images_per_sec = args.batch_size/batch_time
       perf.append(images_per_sec)

       <!--T:66-->
print(f"Current Loss: {loss.mean().asscalar()}")

    <!--T:67-->
print(f"Images processed per second: {np.mean(perf)}")

<!--T:68-->
if __name__=='__main__':
   main()


<!--T:69-->
}}
</translate>