SGI Altix System Administration Dave Wright out of Egan, Minnesota daw@sgi.com 10.15.0.2-7 hostname: tng2-tng7 Course expects some amount of Red Hat background - SGI will be switching to Suse SALE - SGI Advanced Linux Environment EFI - goes with Itanium boxes SGI 750 - early development boxes - not officially supported - ELILO - unique to I64 - elilo.conf - startup and shutdown from a controller - Configuring ESP - Path conventions for SCSI and SATA - 2.4.21 - Propack 3 - 2.6 - Propack 4 - partitioning - fdisk (gone with Suse) - parted - XVM - stripes, mirrors - can partition a drive, but not recommended - works in a cluster, CXFS - XFS - from SGI - ships with standard Suse - Suse on SGI uses standard - Performance Co-pilot - PCP - bundled on the Altix (licensed on Irix) - kdb - kernel debugger - kernel interrupt information - "survival training" - get basic diagnostic info - LKCD, lcrash - obtain a system dump - bundled with Suse now - module command - manipulate paths - ethernet interface (eth0) - SGI Advanced Linux Environment and Propack support issues - Suse updates (YOU) - go to sgi.com - System Administration - Suse - YAST, YAST2 - Linux - setup SGI Installation - 4 SALE CDs - Disk 1 is rescue disk - 2 Propack CDs - Docs /usr/src/linux/Documentation http://techpubs.sgi.com - installer is slightly different from standard Red Hat CD - includes support for XFS - understand the PROM code Load the first CD - from L1 prompt, issue =reset= command - from boot menu, choose EFI Shell - choose fs# for CD room - From fs# =cd efi\boot= - =elilo= - skip testing CD - custom installation - Autopartition - ignore partition errors (SGI versions) - Remove all partitions - install on target1 (not target2) (on this system) - 500M /boot/efi - 9G - swap - don't make too big (no longer needs to be size of physical memory) - 25G - root - remove everything on target2 -network - pci - default internet - on-board disabled - 10.15.0.2 255.255.255.0 - - package selection - use lab manual - SGI ProPack installation - common problems with the 750 installatin - tty - image - add dig - mouse - link to /dev/psaux - modules - /mnt/cdrom/INSTALL - install ALL - customize software as per lab manual /.unconfigured - prompts to change root passwd - runs netconfig, timeconfig, kbdconfig, authconfig, ntsysv /fastboot, /fsckoptions, /forcefsck, /halt - used by /etc/rc.sysinit System information - cat /etc/*rel* /etc/sgi-release LSB_Version 1.3 Red Hat Enterprise Linux AS release 3 (Taroon) SGI ProPack 3 - cd /proc/sgi_sn cat system_serial_number - /etc/sysconfig/networking/eth0_persist eth0 08:00:69:13:db:88 - uname -a - lmhostid FlexLM host ID - Common rpm options -qa query for installed packages -ivh Install packages -ql List files in a package -qf List package that file is from -V Verify package -U upgrade and replace older package -e Erase package -qpil Query rpm file for information -F Upgrade installed packages (only) -force Downgrade --last History of packages - user accounts - useradd - add users - (adduser also works in RedHat - not Suse) - ussermod - userdel passwd chage pwconv - creates /etc/shadow pwunconv - useradd - Creates home directory - copies files from /etc/skel into home - group same as username is also created, called a User Private Group /etc/default/useradd gives defaulst /etc/profile.d scripts are run - Host info /etc/sysconfig/network (RedHat) - use =setup= in RedHat - use =yast= in Suse /etc/sysconfig/network-scripts ifcfg-eth0 - Hardware - I/O slots - 6 buses, 2 slots per bus - rotate through buses first - don't mix card types on same bus - IO9 card - bandwidth vs latency - Kernel modules - keeps kernel small - modules located in /lib/modules/<kernelname> - kernel name must exist Day 2 ===== Application Performance Tuning Resources CPU memory disk cache network IPC Health of System - "shrink" sar pcp pmchart topdisk Quality of Service "accountant" - time to solution - top codes, top users Profiling "efficiency expert" - we'll focus on CPU time primarily - Floating point, Integer, Branch, inefficiencies - top application for profiling: * histx http://www.sgi.com/products/software/histx.html - similar to "SpeedShop" in Irix - SGI working on an open source version of SpeedShop Intel: Vtune Linux community: PAPI http://icl.cs.utk.edu/papi/ NCSA: psrun (similar to histx) SGI Propack utility: profile.pl - avoid strace Ed: prof / gprof - "garbage" pfmon ---++ Application Behavioural Problems ---+++ Cache misses - Cache thrash - set associativity - avoid with padding - avoid arrays of size ^2 - Stride - how we stride through the data - TLB misses (http://www.cs.umass.edu/~weems/CmpSci635A/Lecture11/L11.18.html) - cache misses - columns vs rows - can make a significant difference going by rows vs columns or vice verse (eg: switch i and j, 1000 secs vs 1700 secs) - avoid with: - larger pages - change stride i,j / j,i - transposing - re-organize array - Cache busting - data larger than cache - avoid with: - blocking (chunking data into cache-sized pieces) - multithreading - System cache thrash - sharing the caches - swapping between processes, reloading caches - avoid with: - dplace - place/pin processes into that CPU set - cpuset - private CPU set - page coloring (software solution) - do not have processes/threads share a CPU - TLB misses - Floating Point errors - shows up as system time - software pipelining - multiple instruction pipes - make sure that every pipe has something to do - compiler will do that, but may need clues / directives - False Cache sharing - unique to multi processor systems - where threads "step on each other" and cause the cpus to refresh their caches - 2 cpus writing to the same boundary area - Barrier synchronization - "#1 problem out there" - app taking 40 secs vs 3 weeks ... - environment variables to guide how that is done - CFQ - Complete Fair Queueing - Robert Love User System Memory Use I/O wait Module 8 - Application User Time - viewing stack - idb, gdb, totalview - histx - shareware - Application Tuning - top cpu, top i/o, etc - csacms - top then use options: C, i, I (Iris mode), fu Irix mode - shows %of CPU - Application Tuning steps - let the compiler do the work - use existing libraries - profile the application - recode the expensive algorithms - resolve software pipelining - tune single threaded first - multi-thread and run with dplace or a cpuset - SGI's MPT's MPI is NUMA topology aware - fix barrier synchronization / load balance problems - resolve false cache sharing and data placement - do friendly, well formed I/O - Compiler choices - Intel Compilers - ifort - Fortran 77, 90 and 95 - icc C and C++ - guideefc, guidec (for OpenMP programs) - KAP/PRO Openmp directives - parallel - Vtune analyzer (GUI) - Other alternatives - gnu tools, gcc, g77 and g++ (from Red Hat) - ORC - the Open Research Compiler, based on SGI's Pro64 http://ipf-orc.sourceforge.net/ - Compiler Optimixation - Runtime performance, but longer compile - o0 - no optimization -o1 - Local (just within routine) -o2 - Extensive but conservative - some swp - Default -o3 - Agressive (prefetch, IPO, LNO) - Inter procedure - Profiling Tools - gprof - profile.pl (Propack) - does not work in multi-user environment - samples a single CPU - pfmon - parent of application and monitors hardware counters (only 4 counters at a time) - histx - Evaluation software - runs as parent of application - iprep, csrep, lipfpm, samppm - report tools (after running histx) - strace - trace system calls, I/O characteristics - dlook - what node the pages are on - very verbose - top, pmap, ps -l - pmap: memory map - lsof - lists open files Performance Tuning and Optimization Guide http://techpubs.sgi.com/library/tpl/cgi-bin/browse.cgi?coll=0650&db=bks&cmd=toc&pth=/SGI_Developer/OrOn2_PfTune Compiler Setup cat /etc/motd - read notes cat /local_pilatus/UsageNotes/Intel_Compilers - read notes bash source /opt/intel_cc_80/bin/iccvars.sh Histx Setup cd /usr/local/histx source histx+.sh - INTEL's VTUNE GUI /usr/local/histx/doc/doc.txt - doc file - watching all nodes with a graphical display: pmshub GNU gprof experiment info gprof g77 -pg -o3 prog prog.f ./prog gprof prog gmon.out more gmon.sum Brian Sumner @ SGI - bls@sgi.com for latest histx Application Tuning Lab - histx not working - now working - histx 1.2a Processes ASE 2.0 /proc/pid/* /proc/tid/* ASE 3.0 NPTL /proc/pid/* (processes) /proc/.tid/* (threads) NPTL - Native pthread Library 2.6 /proc/pid/tasks.tid/* Multi-threading techniques - Tightly Coupled - OpenMP Micro-tasking - SMP aware, not cluster aware - preprocessor Auto-tasking Compiler detects - SMP aware - parallel - LD_ASSUME_KERNEL=2.4.19 Pre PP3.0 OpenMP - MPI (Message Passing Interface) Macro-tasking - cluster aware - SHMEM put/get message passing - pthreads Posix standard - clone Kernel system call - was sprock in older SGI - Loosely coupled - InterProcess Communication (IPC) - Other multi-threading techniques - LINDA - Compiler language - PVM - Parallel Virtual Machine - MLP - Nasa-Ames multitasking libraries Sample app: - single thread Real: 23s, User: 23s - -parallel (64 processors) Real: 35s, User: 26m24s (!) Counting Threads - ps, top H option in top shows threads ps -m shows threads Multi-threading issues - Data Locality - Partitinong Data (Chunk Scheduling) - load balancing - static, dynamic, guided - Orchestration (Thread Scheduling) - static or dynamic - Communication (Barrier Synchronization) - spin or yield - Data and Thread Placement Dplace dplace -c16-31 -x2 place the jobs on CPUs 16-31, skipping every 2nd CPU (?) Trying to determine the nature of a por ifort -O3 -g -o code2o3 code2.f -ldl - compile with symbol table histx -l -e pm:L2_MISSES@50000 ./code2o3 - check L2_MISSES every 50000 occurrences iprep *14034 - show report with source lines vi code2.f See: aappl1.0.pdf for documentation on handling these kinds of issues Summary: Kinds of Cache misses: - cache thrash - array results step on cache line - set associativity of the the chip - avoid ^2 / padding D(i) = A(i)+B(i)+C(i) - stride - TLB misses, walk through array, columns vs rows (i,j vs j,i) - transposing array - cache busting
-- LawrenceFolland - 08 Jun 2005