-- MikeGore - 2018-12-03

tml.cs and tml2.cs hardware build notes and software configuration

Contacts

  • These machines are owned by Yaoliang Yu
  • Managed by Mike Gore

Notes

  • The main constraint in the design of this machine was to permit up to 4 liquid cooled GPU cards in one chassis.
  • The biggest problem we faced is that most GPU cards are air cooled from the side - putting 4 every other slot would obstruct the fans
  • I9 System with 64G ram (expandable to 128G)
  • Mother board needs at least 7 slots and the chassis must have 8 slots (GPU cards are two slots wide) the
    • GPU cards plug into slots: 1,3,5,7
  • Build is only cost effective if you plan to add the 4 GPU's - currently the CPU is 35% of the overall cost
  • Overall this system runs quiet as all of the fans ar 120mm and because of the large radiators they do not need to run fast

Parts

Documentation

Fri Nov 30 15:14:29 CST 2018 2018 IMPORTANT UPDATE tml.cs.uwaterloo.ca has cuda 9.0 cuDNN 7.05 tensorflow and torch installed
The packages are all interdependent on specific cuda versions chosen.
We installed anaconda to permit private python environments. We then create an environment called "ml" for math learning
FYI: Tensorflow and Torch use "ml" for their installation.
Place the following lines in any bash script you create
    source "/root/gpu-setup"/install_env   - this sets search paths and library paths
    source activate ml                    - makes sure that you are in the ml workspace!
To make it easy to share cuda and related tools between users I created a new system group called "ml"
    I added all users to the "ml" group
    You can run the script update_ml_users as root at any time to update all users to be part of the ml group
    Example ml group sharing:    chgrp -R ml /home/share;  chmod -R g+w /home/share
The following directories belong to the "ml group and all their files have group write added to them
    /usr/local/cuda*
    /usr/local/torch
    /usr/local/ansconda3
    "/root/gpu-setup"/cudnn_samples_v7

Installation scripts I used are under "/root/gpu-setup" and were installed in the following order:
  install_1st
  install_2nd   - reboot after this
  install_cuda  - reboot after this
  install_cuDNN-7.05
  install_tensorflow
  install_torch
  Note: install_cuda, install_cuDNN-7.05,  install_tensorflow and install_torch can be rerun anytime
Testing tensorflow:
  cd "/root/gpu-setup"
  ./test_tensorflow
Testing cuda - system GPU benchmark
  cd "/root/gpu-setup"
  ./benchmark_gpu

Pictures

  • TML - with covers off power supply side view:
    IMG_20181203_102802.jpg

  • TMP with covers off rear top view:
    IMG_20181203_102818.jpg

  • TML covers off CPU side view:
    IMG_20181203_102831.jpg

Install scripts

3 Dec 2018 - Mike Gore
  • Note: These scripts are in constant development - please use the latest version of these scripts which can be found on cscf-adm@asimov.uwaterloo.ca:/cscf-adm/src/gpu-setup

  • install_1st: initial install script - a few basic package installs

  • install_2nd: install anaconda , create python environment "ml" for math learning, installed support packages

  • install_cuda: install cuda 9.0 and drivers using nVidias site - removes any existing nvidia or cuda drivers

  • install_env: source this file in your shell scripts to setup environment and libraries paths

Topic attachments
I Attachment Action Size Date Who Comment
JPEGjpg IMG_20181203_102802.jpg manage 949.7 K 2018-12-03 - 10:38 MikeGore TML - with covers off power supply side view
JPEGjpg IMG_20181203_102818.jpg manage 1014.8 K 2018-12-03 - 10:39 MikeGore TMP with covers off rear top view
JPEGjpg IMG_20181203_102831.jpg manage 997.0 K 2018-12-03 - 10:40 MikeGore TML covers off CPU side view
Unknown file formatEXT benchmark_gpu manage 0.2 K 2018-12-03 - 11:09 MikeGore Cuda benchmarks
Unknown file formatEXT common_functions manage 84.0 K 2018-12-03 - 11:09 MikeGore support shell functions using in all scripts
Unknown file formatEXT install_1st manage 8.7 K 2018-12-03 - 11:03 MikeGore initial install script - a few basic package installs
Unknown file formatEXT install_2nd manage 1.3 K 2018-12-03 - 11:05 MikeGore install anaconda , create python envioronment "ml" for math learings, installed support packages
Unknown file format05 install_cuDNN-7.05 manage 0.9 K 2018-12-03 - 11:06 MikeGore install cuDNN 7.05
Unknown file formatEXT install_cuda manage 2.2 K 2018-12-03 - 11:06 MikeGore install cuda 9.0 and drivers using nVidias site - removes any existing nvidia or cuda drivers
Unknown file formatEXT install_env manage 1.0 K 2018-12-03 - 11:08 MikeGore source this file in your shell scripts to setup invironment and libraries paths
Unknown file formatEXT install_pytorch manage 0.8 K 2018-12-03 - 11:07 MikeGore install pytorch
Unknown file formatEXT install_tensorflow manage 1.8 K 2018-12-03 - 11:07 MikeGore install tensorflow
Unknown file formatEXT test_tensorflow manage 0.2 K 2018-12-03 - 11:10 MikeGore test tensorflow
Unknown file formatEXT update_ml_users manage 0.1 K 2018-12-03 - 11:11 MikeGore add all users to the system group called ml
Topic revision: r5 - 2019-02-28 - MikeGore
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback