PilatusCrashed < CF

CF Web>Research>ResearchGroups>CFINewOpportunities2002>PilatusCrashed (2005-04-15, MikePatterson)

First, make sure you know how to get at the console. (Hint: cts2.cscf, line 20.) You may or may not need the root password.

Scott Wilson from SGI sent me the following directions (I've edited somewhat for wikifying). They apply to systems without an L3 controller, which is the state pilatus is currently in. We're looking at making consort be an L3 controller for pilatus as well as flexor.

1 Start up a script(1) session.
2 Connect to the L2 controller (through the console server).
3 Ensure console selection is set up correctly:

              ctrl-t                  (Escape to the L2)
              ?-001-L2>l2             (Select the L2)
              ?-001-L2>sel reset      (Reset selections to defaults)
              console input: 001c11 console0
              console output: not filtered

              ?-001-L2>sel          (ensure that the selection is correct)

              known system consoles (nonpartitioned)

              001c11 - L2 detected

              current system console

              console input: 001c11 console0
              console output: not filtered

4 Here one would select the desired partition, if the system were partitioned. pilatus isn't.
5 Determine which L1 controllers are functional:

              ?-001-L2>cfg

              L2 163.154.17.66: - 001 (LOCAL)
              L1 163.154.17.66:0:0     - 001c11
              L1 163.154.17.66:0:1     - 002i01
              L1 163.154.17.66:0:5     - 001c14

NOTE: In systems with routers, each C-brick may show up twice. This is normal.

              ?-001-L2>pwr
              001c10:
              power appears on
              001c13:
              power appears on
              001r16:
              power appears on

If some L1 controllers are missing, reseat the USB connections between the R-bricks and the L2 controller.

6 Record the LED and port status:

              ctrl-t                  (escape to the L2)
              ?-001-L2>l2             (select the L2)
              ?-001-L2>leds
              001c11:
              CPU 0A: 0x3c:   SAL calling OS_INIT
              CPU 0C: 0x3c:   SAL calling OS_INIT

              CPU 1A: 0x3c:   SAL calling OS_INIT
              CPU 1C: 0x3c:   SAL calling OS_INIT

              001c14:
              CPU 0A: 0x3c:   SAL calling OS_INIT
              CPU 0C: 0x3c:   SAL calling OS_INIT

              CPU 1A: 0x3c:   SAL calling OS_INIT
              CPU 1C: 0x3c:   SAL calling OS_INIT


              ?-001-L2>port           (record the link LED status;
                                       a missing link can cause a hang)
              001c11:
              Port Stat Remote Pwr Local Pwr  Link LED SW LED
              ---- ---- ---------- ---------- -------- --------
                 A 0x0f       okay       okay       on       on
                 B 0x0f       okay       okay       on       on
                 C 0x0f       okay       okay       on       on
                 D 0x02       none       okay      off      off
              001c14:
              Port Stat Remote Pwr Local Pwr  Link LED SW LED
              ---- ---- ---------- ---------- -------- --------
                 A 0x0f       okay       okay       on       on
                 B 0x02       none       okay      off      off
                 C 0x0f       okay       okay       on       on
                 D 0x02       none       okay      off      off

7 Determine whether the system has hung or crashed. Try to ping the system. Also hit ctrl-d to get to the console mode, and type '#' followed by Enter a few times, you should get some sort of response. A linefeed means the kernel is still running. You may also get something like no response from 001c10 console, system not responding. If you get a kdb> prompt, you're at the kernel debugger - it crashed.
8 If you're not at the kernel debugger already, drop to it by typing ESC KDB . You may see a message like "127 out of 128 cpus in kdb, waiting for the rest" - be patient if so. If the system does not respond to the KDB command, issue an NMI from the L2 controller:

              ctrl-t
              ?-001-L2>nmi

9 If you see a message of the form "1 cpu is not in kdb, its state is unknown", issue the "cpu" command to determine which CPU(s) are hung:

              [0]kdb> cpu
              Currently on cpu 0
              Available cpus: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 
13,
              14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
29,
              30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44,
45,
              46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60,
61,
              62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73*, 74, 75, 76,
77,
              78, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109,
110,
              111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122,
123,
              124, 125, 126, 127

The CPU(s) which are hung will be marked with a '*'. Issue an init to those CPU(s):

              [0]kdb> init 73

Note: Issuing an init does not remove the '*' when you issue the "cpu" command again.

10 Capture and print FRU information:

              ctrl-t
              ?-001-L2>fru capture
              354/436 MMRs captured (status 0
              ctrl-t
              ?-001-L2>fru print

10 Record information from KDB, so at the kdb> prompt:

              [0]kdb> sn2kdb

which should produce lots of output.

11 Use LKCD to take a crash dump.

               [0]kdb> sr c

You should see output similar to the following:

                   Start a Crash Dump (If Configured)
                   Dumping from interrupt handler !
                   Uncertain scenario - but will try my best

If the dump is successful, the system will reset, and you will see the following message as it boots:

                   Configuring system to save crash dumps [ OK ]
                   Generating crash report - this may take a few minutes

The crash dump will be saved in a numbered directory under /var/log/dump.

When opening a case, please provide copies of console output from the above procedures, /var/log/messages , crash dumps from /var/log/dump , and SAL records from /var/log/salinfo .

-- MikePatterson - 15 Apr 2005

Note that things seem to have changed somewhat since the upgrade to SuSE 10 - no sn2kdb, and the SGI docs seem to suggest sr d instead of sr c. Except here's the output:

[0]kdb> sr d
SysRq : Starting crash dump[0]
kdb> sr cSysRq : HELP : loglevel0-8 reBoot Dump tErm Full kIll saK showMem Nice powerOff showPc unRaw Sync showTasks Unmount
[0]kdb> sr showtasks
SysRq : Emergency Sync
[0]kdb>

Doesn't seem to be as friendly.

Topic revision: r1 - 2005-04-15 - MikePatterson

Information in this area is meant for use by CSCF staff and is not official documentation, but anybody who is interested is welcome to use it if they find it useful.

Other Webs

My links
- People
- CERAS
- WatForm
- Tetherless lab
- Ubuntu Main.HowTo
- eDocs
- RGG NE notes
- RGG
- CS infrastructure
- Grad images

Edit