<languages />
{| class="wikitable"
|-
| Availability : '''March 31, 2025'''
|-
| Login node : '''tamia.alliancecan.ca'''
|-
| Globus collection : [https://app.globus.org/file-manager?origin_id=72c3bca0-9281-4742-b066-333ba0fdef72 TamIA's Globus v5 Server]
|-
| Data transfer node (rsync, scp, sftp,...) : '''tamia.alliancecan.ca'''
|-
| Portal : to be announced
|}

tamIA is a cluster dedicated to artificial intelligence for the Canadian scientific community. Located at [http://www.ulaval.ca/ Université Laval], tamIA is co-managed with [https://mila.quebec/ Mila] and [https://calculquebec.ca/ Calcul Québec]. The cluster is named for the [https://en.wikipedia.org/wiki/Tamias eastern chipmunk], a common species found in eastern North America.

tamIA is part of [https://alliancecan.ca/en/services/advanced-research-computing/pan-canadian-ai-compute-environment-paice PAICE, the Pan-Canadian AI Compute Environment].

==Site-specific policies==

* By policy, tamIA's compute nodes cannot access the internet. If you need an exception to this rule, contact [[Technical_support|technical support]] explaining what you need and why.
* <code>crontab</code> is not offered on tamIA.
* Each job should be at least one hour long (at least five minutes for test jobs) and you can't have more than 1000 jobs (running and pending) at a time.
* The maximum duration of a task is one day (24 hours).
* Each task must use 4 GPUs, or 1 full node.

==Access==
To access the cluster, each researcher must complete [https://ccdb.alliancecan.ca/me/access_services an access request in the CCDB]. Access to the cluster may take up to one hour after completing the access request.

Eligible principal investigators are members of an AIP-type RAP (prefix <code>aip-</code>).

The procedure for sponsoring other researchers is as follows:
* In the '''[https://ccdb.alliancecan.ca/ CCDB home page]''', go to the ''Resource Allocation Projects'' table
* Look for the RAPI of the <code>aip-</code> project and click on that to be redirected to the RAP management page
* At the bottom of the RAP management page, click on '''Manage RAP memberships'''
* To add a new member, go to ''Add Members'' and enter the CCRI of the user you want to add.

==Storage==
{| class="wikitable sortable"

|-
| HOME <br> Lustre file system || 

* Location of home directories, each of which has a small fixed quota.
* You should use the <code>project</code> space for larger storage needs.
* Small per user [[Storage_and_file_management#Filesystem_quotas_and_policies|quota]].
* There is currently no backup of the home directories. (ETA Summer 2025)

|-
| SCRATCH <br> Lustre file system ||

* Large space for storing temporary files during computations.
* No backup system in place. 
* Large [[Storage_and_file_management#Filesystem_quotas_and_policies|quota]] per user.
* There is an [[Scratch_purging_policy|automated purge]] of older files in this space.

|-
| PROJECT <br> Lustre file system ||

* This space is designed for sharing data among the members of a research group and for storing large amounts of data. 
* Large and adjustable per group [[Storage and file management#Filesystem_quotas_and_policies|quota]]. 
* There is currently no backup of the home directories. (ETA Summer 2025)
|}

For transferring data via [[Globus]], you should use the endpoint specified at the top of this page, while for tools like [[Transferring_data#Rsync|rsync]] and [[Transferring_data#SCP|scp]] you can use a login node.

==High-performance interconnect==
The [https://fr.wikipedia.org/wiki/Bus_InfiniBand InfiniBand] [https://www.nvidia.com/en-us/networking/quantum2/ NVIDIA NDR] network links together all of the nodes of the cluster. Each H100 GPU is connected to a single NDR200 port through an NVIDIA ConnectX-7 HCA. Eeach GPU server has 4 NDR200 ports connected to the InfiniBand fabric.

The InfiniBand network is non-blocking for compute servers and is composed of two levels of switches in a fat-tree topology. Storage and management nodes are connected via four 400Gb/s connections to the network core.

==Node characteristics==
{| class="wikitable sortable"
! nodes !! cores !! available memory !! CPU !! storage !! GPU
|-
|  42 || 48 || 512GB || 2 x [https://www.intel.com/content/www/us/en/products/sku/232380/intel-xeon-gold-6442y-processor-60m-cache-2-60-ghz/specifications.html Intel Xeon Gold 6442Y 2,6 GHz, 24C] || 1 x SSD de 7.68TB || 4 x NVIDIA HGX H100 SXM 80GB HBM3 700W, connected via NVLink
|-
|  4 || 64 || 512GB || 2 x [https://www.intel.com/content/www/us/en/products/sku/232398/intel-xeon-gold-6438m-processor-60m-cache-2-20-ghz/specifications.html Intel Xeon Gold 6438M 2.2G, 32C/64T] || 1 x SSD de 7.68TB || none
|}

===Software environments===
[[Standard software environments/fr|<tt>StdEnv/2023</tt>]] is the standard environment on tamIA.


<span id="Suivi_de_vos_tâches"></span>
==Monitoring jobs==

{{note|type=reminder|The portal is not yet available.}}

From the [https://portail.tamia.calculquebec.ca/ tamIA portal], you can monitor your jobs using CPUs and GPUs <b>in real time</b> or examine jobs that have run in the past. This can help you to optimize resource usage and shorten wait time in the queue.

You can monitor your usage of
* compute nodes,
* memory,
* GPU.

It is important that you use the allocated resources and to correct your requests when compute resources are less used or not used at all. For example, if you request 4 cores (CPUs) but use only one, you should adjust the script file accordingly.