Initial Planning - Monday 19 December 2005

  • First thing to do: look at the upgrade to ALE 3 and see how that went. It looks like I took things like backups for granted. (I believe what I did is just ensure we had good Legato backups, and then I rsync'ed everything to fireblade, which had about 400gb of free disk space at the time. Now fireblade is down, so I'll need to do something else if I want to duplicate that plan.)
  • Itemize configurations we'll need to have immediately handy, ie separate backup:
    1. SSH keys and config - /etc/ssh
    2. /etc{passwd,shadow,group}
    3. Stuff for XVM modules to load: /etc/flexlm has license.dat
    4. Are there any other XVM-related configs? Scott says that the xvm disk headers should just get re-read with Suse (provided we've taken care of the license.dat thing).
  • Documentation we'll need handy:
    1. Release notes for Pro Pack 4 and any subsequent service packs
    2. Administrator manual for XVM (just in case!)
    3. Other documents as noted on page 13 of the ProPack 4 Start Here book
  • Things to do pre-upgrade:
    1. Kill user processes, kick users off, sync disks, and run last second rsync (if we have a place to which we can rsync)
    2. Mirror the root / system disk so we can revert - last time this hung the machine frown
    3. Install PROM 4.33 - per SLES SP2 release notes for Altix (page 20, when printed)
  • Considerations during installation:
    1. We'll want to use the kernel-sn2 image (per the Start Here book, p11)
  • Items to install and test post-upgrade:
    1. sn2 kernel modules for debugging per Start Here p32 (?)
    2. Legato backups and restores
    3. Any service packs: looks like the install kit came with SP2 and that's the most recent.
    4. System accounting
    5. Reconcile account userids with core region uids (insamuch as possible) and also fix ownership on all filesystems (ick)
    6. histx / modules for icc
    7. Java - required for Maple anyway
    8. Maple 9.5 and 10
    9. ilog / CPLEX (talk to Chris Calzonetti)
    10. Intel compilers: versions 8 and 9 of ifort and icc
    11. Matlab 7
    12. ATLAS - may need to have the gcc version rebuilt
    13. postgresql 7.4.5 in case anybody wants at the gelato project's benchmarks

There's documentation for integrating these CD sets with a SuSE install mirror. I wonder if it's worthwhile looking at putting this stuff on mirror.cs.

Actual Process - 20 December 2005

I stashed stuff on my laptop in ~/pilatus. In there I put:

  • install Notes from SLES SP2
  • /etc/{group,shadow,passwd}
  • /etc/flexlm/license.dat
  • /etc/ssh/*
  • list of userids and their numerics from the core region (for later reconciliation) in a file called pilatus_users

I also kept handy hardcopies of:

  • this web page
  • release notes for SLES SP2
  • the Getting Started book
  • SGI PROM rescue CD

So, first things first.

  1. login at system console and run init 1 to chase everybody off. There were still people logged in, but I tried to get them to log off themselves. That didn't seem to actually log them off though. Fine. I'll just use halt then.
  2. Bring the system back up, immediately drop back to single user (init 1).
  3. Run /usr/local/maintenance/clone_script and cross fingers. It barfed out a bunch of errors, but I think that's normal - the target disk already had junk on it. It seems to take about 20 minutes to run.
  4. Reboot again, this time off the CD - so we can flash the PROM. On pilatus, the CDROM can be accessed from the EFI Shell as fs0:. See note at bottom.
  5. Power system off, and back on again. (L2 pwr d, then pwr u).
  6. I had to add an option to the EFI Boot Manager: "Boot from CDROM".
  7. Boot from SP2 CD1 and choose Install. No joy with text-based install, so use vnc. See Notes for details.
  8. Installer warned that it couldn't read /dev/sdc with parted. The Notes say to wipe the partitions out and start again (ick). It also complained about sdd, sde, sdf, right up to sdn. See Notes: Installing. Careful, it looks like it's seeing the T1000 arrays as a JBOD. It didn't ask for all the CDs to be fed to it.
  9. Set root password, network configuration: static IP. Leave "SGI Cross Partition Network adapter" unconfigured.
  10. Enable VNC remote administration on the theory that if it's good for the install, it's good for the post-install too.
  11. I let it run the post-install upgrade, but it wants logins that I don't think I have right now (at all?). So abort that for now.
  12. CA settings - skip these for now.
  13. Add a user mpatters, just in case.
  14. System is responsive, but:
    • it's not fully patched up - need to fix that
    • hostname never got set - fixed with yast2
  15. I can log in as root via ssh (for now!) so delete my user account for now.
  16. mkdir /fsys1 in preparation for getting the TP9100s back on line.
  17. Scott helped me get the TP9100s back online. That involved installing the proprietary and open source SGI software (ProPack 4). esp is crapping out, segfaulting at startup. Likely the config files are hosed - will need to recover those from backups.
  18. Scott also helped me to set up the Online Update in yast - point it at SGI's server.
  19. The installer mounted each of the TP9100 devices it "saw" as /data*. Thanks guys, absolutely the wrong thing. Fix up fstab to remove those and to add the TPs.
  20. I changed the mount point for /dev/lxvm/raid2home to be /fsys1 (instead of /fsys1/home). Other home directories fixups listed under Notes. I chowned /home/oldtng to root:root.
  21. Discover that the system clock was not UTC. Fix that with date to set gross date, ntpdate to set it more or less exactly. sigh.
  22. Intermission: Scott wanted us to try cloning the disk again before applying patches. Clear off partition table on sdb and run clone_disk. That barfed, tell Scott.

21 December 2005

  1. finish off reconciliation of userids and gids.
  2. Find Legato client for Altix. Grab 7.2.1 for ia64 from Legato's download site. RPM installed it just fine, fix /nsr/res/servers.
  3. Recover Intel license files from backups. Don't forget to take separate copies of them next time.
  4. While that's going, fix up /etc/sudoers - we'll allow people in group wheel to sudo.
  5. We got hosed on our hostid - it's changed again. So matlab is busted. Maple works fine. We'll need to talk to Intel about the FlexLM licenses, I think.

22 December 2005

  1. I realized that maybe I'm just not licensed for matlab. Romy Shioda is working on that with me.
  2. I blew away /local_pilatus by mistake during the install. Maybe I'll move that to fsys1. Meantime, recover from backups.
  3. Fix up SSH keys.
  4. Discover that it only allows 8 character passwords. I'd taken the default DES encryption. yast2, then Security and Users, then Security Settings.
  5. Maple 10 doesn't want to start its FlexLM daemon. Details in the ST#50343, I don't want to give away our hostid and such.
  6. Get modules working for icc 9.0 at least.

23 December 2005

  1. Install ifort 9.0 and module file. The installation script hung at the end (after telling me the install was successful), but a ^C didn't seem to hurt - ifort still works.

Notes

Flashing PROM

On pilatus, the CDROM is fs0: in the EFI shell:

EFI Shell version 1.10 [14.62]
Device mapping table
  fs0  : Acpi(PNP0A03,1)/Pci(1|0)/Ata(Primary,Master)/CDROM(Entry0)
so:
fs0:\> flash -V snprom433.bin
SGI PROM Flashing Utility
Version of prom image in file: 4.33  
  SGI SAL Version 4.33 rel050928 IP41 built 01:41:32 PM Sep 28, 2005
fs0:\>      
Now I do a flash -a snprom433.bin and off it goes. Seems to take 5-10 minutes, so bring a book.

Installation method

Text-based looks like a loser, at least with our serial console setup: Boot from SP2 CD1 and choose Install. That went for a long time and eventually stopped at "Console: colour dummy device 80x25". Guessing that was due to some issue between the aggregation of weirdities between various terminal types (Terminal.app + ssh to intermediate host + screen there + ssh to console server) I eliminated screen from the equation after getting a wired connection. That crapped out too. Scott Wilson from SGI recommends trying a VNC install. So I'll reset and try again with that. Details are on pp 13-14 of the Getting Started book. It will look like it hung for a while, but it hasn't - and eventually come up to text console saying "Make sure that CD number 1 is in your drive." Release Notes for SP2 say to put the first CD from "SLES 9 GA". They mean the first disk of "SP1". If you leave in the SP2 boot CD it'll tell you it couldn't find the appropriate CD and go to Manual Installation. At some point it will also totally bugger up your terminal settings! Getting Started says to use UTF-8 encoding. Nice. ssh to intermediate point and xterm -sb -en UTF-8 -geometry 130x40 -title "SGI console" -name "SGI console". Then use that xterm to ssh to console server. Even if you tell it a vnc password to use when you boot the installer, it'll prompt you for one afterwards. You can then DHCP or manually set the machine's IP and such. Since I'd just updated DNS, I used manual. I couldn't get Chicken of VNC to connect, it kept wanting to go to port 5901, so I used Safari to connect to port 5801 like it said. That worked, but it seemed to hang when I tried to accept the license agreement - probably will have to make liberal use of the "Refresh" button.

Installing

  1. Choose New Installation.
  2. Partitioning: fix up what it wants to do to our partitions, choose Expert partitioning, then sda. Tell it to format sda3 with XFS, and reformat sda1 and sda2. sda1 should be mounted at /boot/efi and type FAT32. It'll apparently see each separate TP9100 as a different JBOD, I guess - 500gb apiece.
  3. Software: choose Full Install.
  4. Set time zone. System Clock is UTC.
  5. The Boot Loader Setup looks weird, but I'll leave it alone. It seems to be appending the same stuff I had used to boot (ie, with vncpasswd=).
  6. Default Runlevel: it wants to use 5 (start xdm). We don't have a graphics head, so change that to 3: "Full multiuser with network".
  7. Then it'll go "I have everything I need," and ask for "SUSE SLES 9 Service-Pack Version 2 CD 1" so I gave it the SP2 CD1. That seemed to satisfy it.
  8. Swap CDs. Again and again. Bring a book, 17 CDs x 5-10m per CD is a long time. Cute, it gives an ETA. It started at about 2 hours, I think, but doesn't seem to be very accurate as it dropped 20 minutes in about 10 minutes of real time.

Home directories and user accounts

We had a bunch of tng accounts left over from the SGI training. I made a /home/oldtng and stuffed them all in there. Good news: looks like we can copy and paste old /etc/shadow passwd entries and they'll work.

My old uid:gid was 504; my core is 1633:1633. I'll experiment with myself. It's not a currently allocated uid, so first, I'll create a user mpatters:

pilatus:~ # useradd -u 1633 -g 1633 mpatters
useradd: Unknown group `1633'.
pilatus:~ # 
That failed miserably. Do this:
pilatus:~ # groupadd -g 1633 mpatters
pilatus:~ # useradd -u 1633 -g 1633 mpatters
Now I have a /home/mpatters with the wrong permissions, and maybe files elsewhere on the filesystem too. I think doing something like find / -uid 504 is too time-consuming, considering we've got 50-odd users. Try it on just /fsys1:
pilatus:/fsys1 # find . -uid 504 -gid 504 -exec chown mpatters:mpatters {} \;

That works, except symlinks keep their old uid/gids. Ugh. I'll have to, one by one, recreate user accounts and then do chowns and chmods. Do the same thing in /scratch.

-- MikePatterson - 19,20 Dec 2005

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2012-09-06 - BillInce
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback