<languages />
[[Category:AI and Machine Learning]]
<translate>
<!--T:1-->
[https://wandb.ai Weights & Biases (wandb)] is a <i>meta machine learning platform</i> designed to help AI practitioners and teams build reliable machine learning models for real-world applications by streamlining the machine learning model lifecycle. By using wandb, you can track, compare, explain and reproduce machine learning experiments.

== Using wandb on Alliance clusters == <!--T:2-->

=== Availability on compute nodes === <!--T:3--> 


<!--T:4-->
Full usage of <tt>wandb</tt> on compute nodes requires internet access as well as access to Google Cloud Storage, both of which may not be available depending on the cluster:

<!--T:5-->
{| class="wikitable"
|-
! Cluster !! Wandb Availability !! Note
|-
| Béluga || rowspan="3"| Limited ❌  || rowspan="3"| Users from '''MILA and other eligible groups only''' via <tt>httpproxy</tt>
|-
| Narval
|-
|TamIA
|-
| Cedar || Yes ✅ || <tt>httpproxy</tt> not required
|-
| Graham || No ❌ || internet access is disabled on compute nodes
|-
|Vulcan || Yes ✅ || <tt>httpproxy</tt> not required
|}

== Users from MILA and other eligible groups == <!--T:50-->

<!--T:51-->
Members of the MILA Québec AI Institute may use <tt>wandb</tt> on any of our clusters with internet access, provided that they use a valid '''Mila-org''' Weights & Biases account to log into <tt>wandb</tt>. Please see the table above for more information on modules required for using <tt>wandb</tt> on each cluster.

<!--T:52-->
Other groups are known to have made arrangements with Weights & Biases to bypass calls to the Google Cloud Storage API. Please contact your PI to find out if your group has made such arrangements.

== Béluga, Narval and TamIA == <!--T:40-->

<!--T:41-->
While it is possible to upload basic metrics to Weights&Biases during a job on Béluga, narval and TamIA, the wandb package will automatically attempt to upload information about your environment to a Google Cloud Storage bucket, which is not allowed on the compute nodes of these clusters. This will result in a crash during or at the very end of a training run. Your job may also freeze until it reaches its wall time, thereby wasting resources. It is not currently possible to disable this behaviour. Note that uploading artifacts to W&B with <code>wandb.save()</code> also requires access to Google Cloud Storage and will cause your job to freeze or crash.

<!--T:42-->
You can still use wandb by enabling the [https://docs.wandb.ai/library/cli#wandb-offline <code>offline</code>] mode. In this mode, wandb will write all metrics, logs and artifacts to the local disk and will not attempt to sync anything to the Weights&Biases service on the internet. After your jobs finish running, you can sync their wandb content to the online service by running the command [https://docs.wandb.ai/ref/cli#wandb-sync <code>wandb sync</code>] on the login node.

<!--T:46-->
Note that [[Comet.ml]] is a product very similar to Weights & Biases, and works on Béluga, Narval and TamIA.

== Example == <!--T:6-->

<!--T:7-->
The following is an example of how to use wandb to track experiments in offline mode. To run in online mode, load the module <tt>httpproxy</tt> on applicable clusters and follow the comments on the example script below.

<!--T:8-->
{{File
  |name=wandb-test.sh
  |lang="bash"
  |contents=
#!/bin/bash
#SBATCH --account=YOUR_ACCOUNT
#SBATCH --cpus-per-task=2 # At least two cpus is recommended - one for the main process and one for the wandB process
#SBATCH --mem=4G       
#SBATCH --time=0-03:00
#SBATCH --output=%N-%j.out


<!--T:9-->
module load python
virtualenv --no-download $SLURM_TMPDIR/env
source $SLURM_TMPDIR/env/bin/activate
pip install --no-index wandb

<!--T:43-->
### Save your wandb API key in your .bash_profile or replace $API_KEY with your actual API key. Uncomment the line below and comment out "wandb offline" if running in online mode ###

<!--T:44-->
#wandb login $API_KEY 

<!--T:45-->
wandb offline

<!--T:12-->
python wandb-test.py
}}

<!--T:13-->
The script wandb-test.py is a simple example of metric logging. See [https://docs.wandb.ai W&B's full documentation] for more options.

<!--T:14-->
{{File
  |name=wandb-test.py
  |lang="python"
  |contents=
import wandb

<!--T:47-->
wandb.init(project="wandb-pytorch-test", settings=wandb.Settings(start_method="fork"))

<!--T:48-->
for my_metric in range(10):
    wandb.log({'my_metric': my_metric})

<!--T:39-->
}}

<!--T:49-->
After a training run in offline mode, there will be a new folder <code>./wandb/offline-run*</code>. You can send the metrics to the server using the command <code>wandb sync ./wandb/offline-run*</code>. Note that using <code>*</code> will sync all runs.

</translate>