Scott Wilson from SGI sent me the following directions (I've edited somewhat for wikifying). They apply to systems without an L3 controller, which is the state pilatus is currently in. We're looking at making consort be an L3 controller for pilatus as well as flexor.
script(1)
session.
ctrl-t (Escape to the L2) ?-001-L2>l2 (Select the L2) ?-001-L2>sel reset (Reset selections to defaults) console input: 001c11 console0 console output: not filtered ?-001-L2>sel (ensure that the selection is correct) known system consoles (nonpartitioned) 001c11 - L2 detected current system console console input: 001c11 console0 console output: not filtered
?-001-L2>cfg L2 163.154.17.66: - 001 (LOCAL) L1 163.154.17.66:0:0 - 001c11 L1 163.154.17.66:0:1 - 002i01 L1 163.154.17.66:0:5 - 001c14NOTE: In systems with routers, each C-brick may show up twice. This is normal.
?-001-L2>pwr 001c10: power appears on 001c13: power appears on 001r16: power appears onIf some L1 controllers are missing, reseat the USB connections between the R-bricks and the L2 controller.
ctrl-t (escape to the L2) ?-001-L2>l2 (select the L2) ?-001-L2>leds 001c11: CPU 0A: 0x3c: SAL calling OS_INIT CPU 0C: 0x3c: SAL calling OS_INIT CPU 1A: 0x3c: SAL calling OS_INIT CPU 1C: 0x3c: SAL calling OS_INIT 001c14: CPU 0A: 0x3c: SAL calling OS_INIT CPU 0C: 0x3c: SAL calling OS_INIT CPU 1A: 0x3c: SAL calling OS_INIT CPU 1C: 0x3c: SAL calling OS_INIT ?-001-L2>port (record the link LED status; a missing link can cause a hang) 001c11: Port Stat Remote Pwr Local Pwr Link LED SW LED ---- ---- ---------- ---------- -------- -------- A 0x0f okay okay on on B 0x0f okay okay on on C 0x0f okay okay on on D 0x02 none okay off off 001c14: Port Stat Remote Pwr Local Pwr Link LED SW LED ---- ---- ---------- ---------- -------- -------- A 0x0f okay okay on on B 0x02 none okay off off C 0x0f okay okay on on D 0x02 none okay off off
no response from 001c10 console, system not responding
. If you get a kdb>
prompt, you're at the kernel debugger - it crashed.
ESC KDB
. You may see a message like "127 out of 128 cpus in kdb, waiting for the rest" - be patient if so. If the system does not respond to the KDB command, issue an NMI from the L2 controller:
ctrl-t ?-001-L2>nmi
[0]kdb> cpu Currently on cpu 0 Available cpus: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73*, 74, 75, 76, 77, 78, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127The CPU(s) which are hung will be marked with a '*'. Issue an init to those CPU(s):
[0]kdb> init 73Note: Issuing an init does not remove the '*' when you issue the "cpu" command again.
ctrl-t ?-001-L2>fru capture 354/436 MMRs captured (status 0 ctrl-t ?-001-L2>fru print
kdb>
prompt:
[0]kdb> sn2kdbwhich should produce lots of output.
[0]kdb> sr cYou should see output similar to the following:
Start a Crash Dump (If Configured) Dumping from interrupt handler ! Uncertain scenario - but will try my bestIf the dump is successful, the system will reset, and you will see the following message as it boots:
Configuring system to save crash dumps [ OK ] Generating crash report - this may take a few minutesThe crash dump will be saved in a numbered directory under
/var/log/dump
.
When opening a case, please provide copies of console output from the above procedures, /var/log/messages
, crash dumps from /var/log/dump
, and SAL records from /var/log/salinfo
.
-- MikePatterson - 15 Apr 2005
Note that things seem to have changed somewhat since the upgrade to SuSE 10 - no sn2kdb, and the SGI docs seem to suggest sr d
instead of sr c
. Except here's the output:
[0]kdb> sr d SysRq : Starting crash dump[0] kdb> sr cSysRq : HELP : loglevel0-8 reBoot Dump tErm Full kIll saK showMem Nice powerOff showPc unRaw Sync showTasks Unmount [0]kdb> sr showtasks SysRq : Emergency Sync [0]kdb>
Doesn't seem to be as friendly.