-- MikePatterson - 16 Dec 2004

Things we definitely need to do:

  1. ensure all disks are in use
  2. ensure user data integrity

-- MikePatterson - 20 Dec 2004

First, backup:

  1. from mudge: rsync -a --progress --exclude '/tmp/*' --exclude '/proc/*' / root@tumbo.cs:/fsys2/ (I've already verified that /fsys2 on tumbo has sufficient room to hold "all" of mudge.) This is to both have a backup and to have a "live" copy of the data from /etc so I can duplicate the old setup. For instance, mudge uses softbase as a mail handler... Except this isn't working because for some reason tumbo's decided to hate mudge. And after trying different methods three or four times apiece, it suddenly started working again. This is rather distressing. Logged in as mpatters, ran screen, then suw. Once that is complete, run it again to catch any files that have been changed.

Simultaneously with this, shut down unneeded machines in the mudge region (gooch, quadra). First do "dpkg --get-selections > quadrapkgs" on quadra, just to get latest package selections. Those will be over on tumbo due to the backup, so we'll be able to get at them when we restore files.

  1. Now ensure synchronization of u. On tumbo: "umount /u" (just to be safe). Then on mudge: "rsync -a --progress u/ tumbo:/fsys2/u"

We got:

2>(root)@mudge[118]% rsync -a --progress u/ tumbo:/fsys2/u
         138 100%    0.00kB/s    0:00:00
        1920 100%    0.00kB/s    0:00:00
        1953 100%    0.00kB/s    0:00:00
        7724 100%    0.00kB/s    0:00:00
2>(root)@mudge[119]% 

That's a reasonable number of files to have changed, given that I was faffing about as my own user on several of the machines during the first rsync.

Now, I'll leave tumbo alive while mudge is reinstalled. That way I can refer back if necessary. It may be safest to leave tumbo's network off while I'm in the machine room and able to physically access the console.

Just to be sure I also put a copy of mudge's dpkg --get-selections on mpatters@torres.

Like GoochSargeUpgrade, I needed to start out with a 2.4 kernel to make it see the onboard SCSI controller.

It would appear we were misled: there's no hardware RAID controller in mudge, just two SCSI controllers (one of which isn't even being used, nice). So I'll have to cheat and use software RAID for the /u partition. 5 x 36GB(ish) disks.

  • sda will be /, /var, and tmp/swap.
    • / - 13gb - sda1
    • /var - 13gb - sda5
    • /tmp - 2gb - sda6
    • swap - 8gb - sda7-9 (4gb physical ram, we'll need 4 x 2gb)

This leaves 427.7mb of freespace, according to the sarge partitioner. I'll just leave that as-is for now. Next, let's use those other disks. I told the partitioner to use each as a RAID volume. It did some stuff and then just started installing the base system! That's ok, I didn't want to actually USE those partitions or anything. I'm not clear on what the differences between what the sarge installer calls "Software RAID" and "LVM" anyway - two ways of accomplishing the same thing? Well, worry about that later.

Copy old ssh host keys back from tumbo. Also install rsync, rsh-client, and rsh-server, I know I'm going to need those. Also copy over the kernel-image I built on quadra. Go into dselect, install module-init-tools and everything else it wants.

mdadm package asked about starting RAID volumes automatically (I told it no). I told it to email alerts to root. Install the kernel-image I made, then reboot.

  1. Copy sources.list from tumbo
  2. apt-get update
  3. dpkg --purge --force-all exim4 exim4-config exim4-daemon-light exim4-base
  4. userdel Debian-exim
  5. apt-get -f dist-upgrade (now we have exim), configure it:
    • listen only on 127.0.0.1
    • host mailname mudge.cs.uwaterloo.ca
  6. before I forget, copy /etc/exports and /etc/ssh/*key* back from tumbo
  7. in /etc: "sed -e 's/math/cs/g' exports > whappa && mv whappa exports"

Guess what I forgot - /fsys1 partition. sigh. OK, we'll just make that a directory off of / - I was thinking about xhier when I made the partition that big, honest. So, create the rest of the xhier tree. Won't use it right away. Now I need to figure out software RAID. I want to RAID-0 sd{b,c,d,e}.

Looks like this is what I want to do:

mudge:~# mdadm --create /dev/md0 --level=0 --raid-devices=4 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1
(note that ought to be all one line)

Indeed:

mudge:~# cat /proc/mdstat 
Personalities : [raid0] 
md0 : active raid0 sde1[3] sdd1[2] sdc1[1] sdb1[0]
      142238976 blocks 64k chunks
      
unused devices: <none>
mudge:~# 

So now I can make a filesystem on it: "mke2fs -j /dev/md0". Hopefully the defaults aren't too slow.

mudge:~# df -h /u
Filesystem            Size  Used Avail Use% Mounted on
/dev/md0              134G   33M  127G   1% /u
mudge:~# 
and that's about what I expect to see. Now I ought to be able to copy the user data back from tumbo. "rsync -a --progress root@tumbo.cs:/fsys2/u /u". Whups, that wasn't quite what I wanted, it's putting everything under /u/u. Oh well, I just mv'ed the contents of /u/u up one level and rmdir'ed /u/u.

  1. debian31: xh-first-time -v mudge.cs
  2. forgot to change the crontab gid to none (sigh)
  3. adduser --home /software/accounts/home --shell /bin/csh --no-create-home --uid 1000 --gid 103 --disabled-login accounts
  4. xh-install everything is clean, xh-local-maintenance is too
  5. debian31: xh-dist2 mudge.cs cf-specific cscf-specific
  6. (stop for a pause from xhier: apt-get install nfs-kernel-server, it complains about /etc/exports but runs anyway, and tumbo can mount /u again, just for a test)
  7. debian31: xh-dist2 mudge.cs accounts accounts_client
  8. cd /etc && make backup copies {passwd,shadow,group}.mpatters (this will give us more-or-less stock files in case something goes aspodin)
  9. copy /etc/{passwd,shadow,group} from tumbo
  10. xh-install all the packages that didn't get installed (except for accounts and accounts_client)
  11. fix up stuff for accounts package
  12. xh-install accounts (and it complained about stuff in /etc/passwd)
  13. xh-install accounts_client
  14. on capo, change cf-specific files (cscf_sponsored, cscf_clients) and dist to debian31 and sun580/cscf.cs

Hrm, one thing I noticed: no oddities with xh-first-time barfing a few times and then mysteriously working. It Just Worked the first time.

Now accounts software is busted. frown Bill and I did a bunch of stuff on cscf.cs and now everything that's broken is broken on mudge.

  1. Copy old /software/accounts/data/users back from tumbo.
  2. xh-dist2 debian-1 package (should have done this before, argh) and xh-local-maintenance on mudge, and now xh-local-maintenance is clean
  3. cscf.cs: accounts-client host=mudge.cs (for good luck - looks good!)
  4. dist2 graveyard man nameserver-3 pine_config resolv-config security service-request watform
  5. cd /fsys1/watform && rsync root@tumbo:/fsys2/fsys1/watform . (damn, it made /fsys1/watform/watform, sigh, fix that)
  6. cd /u/mpatterson && dpkg --set-selections < quadrapkgs
  7. apt-get -f dselect-upgrade (installed 120 new packages, removed none, hurray)
  8. rsync the mail back over from tumbo
  9. xh-local-maintenance

Hrm, ssh doesn't seem to honour my client key any more. I wonder if sshd_config is up to snuff. Ah, turns out it was permissions on my home directory. Fixed some sshd_config options too though.

  1. put super_users back into regional config - yup, I can suw now on both mudge and tumbo.


Post-mortem (25 April 2005):

The above procedure did not create an /etc/mdadm/mdadm.conf, nor did it create an entry in /etc/fstab for /u. As a result, when the machine was rebooted it did not mount user home directories (nor did it know how to configure the RAID). I created mdadm.conf by hand and used mdadm to re-assemble the RAID. See ST#48324.

Topic revision: r6 - 2012-09-06 - BillInce
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback