First, make sure you know how to get at the console. (Hint: cts2.cscf, line 20.) You may or may not need the root password.
Scott Wilson from SGI sent me the following directions (I've edited somewhat for wikifying). They apply to systems without an L3 controller, which is the state pilatus is
currently in. We're looking at making consort be an L3 controller for pilatus as well as flexor.
- 1 Start up a
script(1)
session.
- 2 Connect to the L2 controller (through the console server).
- 3 Ensure console selection is set up correctly:
ctrl-t (Escape to the L2)
?-001-L2>l2 (Select the L2)
?-001-L2>sel reset (Reset selections to defaults)
console input: 001c11 console0
console output: not filtered
?-001-L2>sel (ensure that the selection is correct)
known system consoles (nonpartitioned)
001c11 - L2 detected
current system console
console input: 001c11 console0
console output: not filtered
- 4 Here one would select the desired partition, if the system were partitioned. pilatus isn't.
- 5 Determine which L1 controllers are functional:
?-001-L2>cfg
L2 163.154.17.66: - 001 (LOCAL)
L1 163.154.17.66:0:0 - 001c11
L1 163.154.17.66:0:1 - 002i01
L1 163.154.17.66:0:5 - 001c14
NOTE: In systems with routers, each C-brick may show up twice. This is normal.
?-001-L2>pwr
001c10:
power appears on
001c13:
power appears on
001r16:
power appears on
If some L1 controllers are missing, reseat the USB connections between the R-bricks and the L2 controller.
- 6 Record the LED and port status:
ctrl-t (escape to the L2)
?-001-L2>l2 (select the L2)
?-001-L2>leds
001c11:
CPU 0A: 0x3c: SAL calling OS_INIT
CPU 0C: 0x3c: SAL calling OS_INIT
CPU 1A: 0x3c: SAL calling OS_INIT
CPU 1C: 0x3c: SAL calling OS_INIT
001c14:
CPU 0A: 0x3c: SAL calling OS_INIT
CPU 0C: 0x3c: SAL calling OS_INIT
CPU 1A: 0x3c: SAL calling OS_INIT
CPU 1C: 0x3c: SAL calling OS_INIT
?-001-L2>port (record the link LED status;
a missing link can cause a hang)
001c11:
Port Stat Remote Pwr Local Pwr Link LED SW LED
---- ---- ---------- ---------- -------- --------
A 0x0f okay okay on on
B 0x0f okay okay on on
C 0x0f okay okay on on
D 0x02 none okay off off
001c14:
Port Stat Remote Pwr Local Pwr Link LED SW LED
---- ---- ---------- ---------- -------- --------
A 0x0f okay okay on on
B 0x02 none okay off off
C 0x0f okay okay on on
D 0x02 none okay off off
- 7 Determine whether the system has hung or crashed. Try to ping the system. Also hit ctrl-d to get to the console mode, and type '#' followed by Enter a few times, you should get some sort of response. A linefeed means the kernel is still running. You may also get something like
no response from 001c10 console, system not responding
. If you get a kdb>
prompt, you're at the kernel debugger - it crashed.
- 8 If you're not at the kernel debugger already, drop to it by typing
ESC KDB
. You may see a message like "127 out of 128 cpus in kdb, waiting for the rest" - be patient if so. If the system does not respond to the KDB command, issue an NMI from the L2 controller:
ctrl-t
?-001-L2>nmi
- 9 If you see a message of the form "1 cpu is not in kdb, its state is unknown", issue the "cpu" command to determine which CPU(s) are hung:
[0]kdb> cpu
Currently on cpu 0
Available cpus: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13,
14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
29,
30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44,
45,
46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60,
61,
62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73*, 74, 75, 76,
77,
78, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109,
110,
111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122,
123,
124, 125, 126, 127
The CPU(s) which are hung will be marked with a '*'. Issue an init to those CPU(s):
[0]kdb> init 73
Note: Issuing an init does not remove the '*' when you issue the "cpu" command again.
- 10 Capture and print FRU information:
ctrl-t
?-001-L2>fru capture
354/436 MMRs captured (status 0
ctrl-t
?-001-L2>fru print
- 10 Record information from KDB, so at the
kdb>
prompt:
[0]kdb> sn2kdb
which should produce lots of output.
- 11 Use LKCD to take a crash dump.
[0]kdb> sr c
You should see output similar to the following:
Start a Crash Dump (If Configured)
Dumping from interrupt handler !
Uncertain scenario - but will try my best
If the dump is successful, the system will reset, and you will see the following message as it boots:
Configuring system to save crash dumps [ OK ]
Generating crash report - this may take a few minutes
The crash dump will be saved in a numbered directory under
/var/log/dump
.
When opening a case, please provide copies of console output from the above procedures,
/var/log/messages
, crash dumps from
/var/log/dump
, and SAL records from
/var/log/salinfo
.
--
MikePatterson - 15 Apr 2005
Note that things seem to have changed somewhat since the upgrade to SuSE 10 - no sn2kdb, and the SGI docs seem to suggest
sr d
instead of
sr c
. Except here's the output:
[0]kdb> sr d
SysRq : Starting crash dump[0]
kdb> sr cSysRq : HELP : loglevel0-8 reBoot Dump tErm Full kIll saK showMem Nice powerOff showPc unRaw Sync showTasks Unmount
[0]kdb> sr showtasks
SysRq : Emergency Sync
[0]kdb>
Doesn't seem to be as friendly.