Activités SED


name: inverse
class: center, middle, inverse

# Using the Jean Zay AI cluster
Loïc Estève & Mauricio Diaz

.medium[ .orange[Dev Meetup: 5 minutes talk] ]

.affiliations[
  ![Inria](imgs/inria-logo.png)
 ]

---

# Jean Zay

- the super-computer of the future for AI users

- 1000+ GPUs

- half of the super-computer is for more traditional HPC usages

Jean Zay Hardware: http://www.idris.fr/eng/jean-zay/cpu/jean-zay-cpu-hw-eng.html

Jean Zay Documentation: http://www.idris.fr/eng/jean-zay

---

# How to get access

TL; DR: https://github.com/willowsierra/jean-zay

Important details:

- everything is in French: filling the form in English is fine

- ask for <= 10000 GPU hours and they will be granted easily

- 5 lines is enough for the project description

- "Directeur de la structure de recherche": Éric Fleury

- "Responsable sécurité informatique": Didier Benza

---

# Some general comments

- cultural gap between HPC sys-admins and AI users is huge

- example: no internet access in most traditional "serious" clusters

- a lot of good will from Jean Zay sys-admins, let's try to use this and
  make constructive comments to make Jean Zay a nice place to work on

---

# Some comments about the procedure

- Remember: for HPC sys-admins this is a "lightweight" procedure

- For AI users like us, this is rather long and painful

- one data point: it took us 3 weeks to get access to Jean Zay from filling out
  the first form to be able to ssh to Jean Zay

- this talk is an attempt to make the procedure shorter and smoother for people
  coming after us

---

# Quick description

- Scheduler: .orange[Slurm]

- Interactive mode is available. 

- Modes: monoGPU, multiGPU and multiGPU MPI.

- Storage spaces :
  * `$HOME` .small[.blue[(3Gb)]]
  * `$WORK` .small[.blue[(quota for the project)]]
  * `$SCRATCH` .small[.blue[(large quota)]]
  * `$STORE` .small[.blue[(archive)]]
  * `$DSDIR` .small[.blue[(popular databases)]]
---

# Quick description 


- Libraries ready and optimized (via .blue[module]): 
  * Python 2/3, 
  * Conda, 
  * Tensorflow (1.4, 1.8, 1.13, 1.14, 2.0.0-beta1), 
  * Pytorch (1.1), 
  * Caffe (1.0)
  * Cuda (10.1.1)
  * nccl (2.4.2)
  * cudnn (10.1-v7.5.1.10)
  * OpenACC
  * MPI Cuda
  * GPUDirect

- Walltime: for batch 20h .red[( ! )] for dev 2h .red[(!!)]
---

# Early feedback 

- .green[Easy and fast] access (at least during the GC period). 
  However after opening to the public, ~10 min for the long queue.

- .orange[Perfect setup:] local cluster (with slurm) for dev -> then JZ 

- .red[Idea:] Setup an output proxy for Inria (see with DSI)

- Poor communication about maintenance (e.g. this morning)

.small[
```
upd53tc@jean-zay.idris.fr password:
***********************************************************************
*                                                                     *
*                        Maintenance sur Jean Zay                     *
*                                                                     *
*               Le mardi 29 octobre 2019 de 8h30 a 12h00              *
*                                                                     *
***********************************************************************


Connection closed by 130.84.132.17
Connection to 192.168.19.1 closed.

```
]
---