name: inverse class: center, middle, inverse # Using the Jean Zay AI cluster Loïc Estève & Mauricio Diaz .medium[ .orange[Dev Meetup: 5 minutes talk] ] .affiliations[ ![Inria](imgs/inria-logo.png) ] --- # Jean Zay - the super-computer of the future for AI users - 1000+ GPUs - half of the super-computer is for more traditional HPC usages Jean Zay Hardware: http://www.idris.fr/eng/jean-zay/cpu/jean-zay-cpu-hw-eng.html Jean Zay Documentation: http://www.idris.fr/eng/jean-zay --- # How to get access TL; DR: https://github.com/willowsierra/jean-zay Important details: - everything is in French: filling the form in English is fine - ask for <= 10000 GPU hours and they will be granted easily - 5 lines is enough for the project description - "Directeur de la structure de recherche": Éric Fleury - "Responsable sécurité informatique": Didier Benza --- # Some general comments - cultural gap between HPC sys-admins and AI users is huge - example: no internet access in most traditional "serious" clusters - a lot of good will from Jean Zay sys-admins, let's try to use this and make constructive comments to make Jean Zay a nice place to work on --- # Some comments about the procedure - Remember: for HPC sys-admins this is a "lightweight" procedure - For AI users like us, this is rather long and painful - one data point: it took us 3 weeks to get access to Jean Zay from filling out the first form to be able to ssh to Jean Zay - this talk is an attempt to make the procedure shorter and smoother for people coming after us --- # Quick description - Scheduler: .orange[Slurm] - Interactive mode is available. - Modes: monoGPU, multiGPU and multiGPU MPI. - Storage spaces : * `$HOME` .small[.blue[(3Gb)]] * `$WORK` .small[.blue[(quota for the project)]] * `$SCRATCH` .small[.blue[(large quota)]] * `$STORE` .small[.blue[(archive)]] * `$DSDIR` .small[.blue[(popular databases)]] --- # Quick description - Libraries ready and optimized (via .blue[module]): * Python 2/3, * Conda, * Tensorflow (1.4, 1.8, 1.13, 1.14, 2.0.0-beta1), * Pytorch (1.1), * Caffe (1.0) * Cuda (10.1.1) * nccl (2.4.2) * cudnn (10.1-v7.5.1.10) * OpenACC * MPI Cuda * GPUDirect - Walltime: for batch 20h .red[( ! )] for dev 2h .red[(!!)] --- # Early feedback - .green[Easy and fast] access (at least during the GC period). However after opening to the public, ~10 min for the long queue. - .orange[Perfect setup:] local cluster (with slurm) for dev -> then JZ - .red[Idea:] Setup an output proxy for Inria (see with DSI) - Poor communication about maintenance (e.g. this morning) .small[ ``` upd53tc@jean-zay.idris.fr password: *********************************************************************** * * * Maintenance sur Jean Zay * * * * Le mardi 29 octobre 2019 de 8h30 a 12h00 * * * *********************************************************************** Connection closed by 130.84.132.17 Connection to 192.168.19.1 closed. ``` ] ---