First, make sure you know how to get at the console. (Hint: cts2.cscf, line 20.) You may or may not need the root password.

Scott Wilson from SGI sent me the following directions (I've edited somewhat for wikifying). They apply to systems without an L3 controller, which is the state pilatus is currently in. We're looking at making consort be an L3 controller for pilatus as well as flexor.

  • 1 Start up a script(1) session.
  • 2 Connect to the L2 controller (through the console server).
  • 3 Ensure console selection is set up correctly:
              ctrl-t                  (Escape to the L2)
              ?-001-L2>l2             (Select the L2)
              ?-001-L2>sel reset      (Reset selections to defaults)
              console input: 001c11 console0
              console output: not filtered

              ?-001-L2>sel          (ensure that the selection is correct)

              known system consoles (nonpartitioned)

              001c11 - L2 detected

              current system console

              console input: 001c11 console0
              console output: not filtered
  • 4 Here one would select the desired partition, if the system were partitioned. pilatus isn't.
  • 5 Determine which L1 controllers are functional:
              ?-001-L2>cfg

              L2 163.154.17.66: - 001 (LOCAL)
              L1 163.154.17.66:0:0     - 001c11
              L1 163.154.17.66:0:1     - 002i01
              L1 163.154.17.66:0:5     - 001c14
NOTE: In systems with routers, each C-brick may show up twice. This is normal.
              ?-001-L2>pwr
              001c10:
              power appears on
              001c13:
              power appears on
              001r16:
              power appears on
If some L1 controllers are missing, reseat the USB connections between the R-bricks and the L2 controller.
  • 6 Record the LED and port status:
              ctrl-t                  (escape to the L2)
              ?-001-L2>l2             (select the L2)
              ?-001-L2>leds
              001c11:
              CPU 0A: 0x3c:   SAL calling OS_INIT
              CPU 0C: 0x3c:   SAL calling OS_INIT

              CPU 1A: 0x3c:   SAL calling OS_INIT
              CPU 1C: 0x3c:   SAL calling OS_INIT

              001c14:
              CPU 0A: 0x3c:   SAL calling OS_INIT
              CPU 0C: 0x3c:   SAL calling OS_INIT

              CPU 1A: 0x3c:   SAL calling OS_INIT
              CPU 1C: 0x3c:   SAL calling OS_INIT


              ?-001-L2>port           (record the link LED status;
                                       a missing link can cause a hang)
              001c11:
              Port Stat Remote Pwr Local Pwr  Link LED SW LED
              ---- ---- ---------- ---------- -------- --------
                 A 0x0f       okay       okay       on       on
                 B 0x0f       okay       okay       on       on
                 C 0x0f       okay       okay       on       on
                 D 0x02       none       okay      off      off
              001c14:
              Port Stat Remote Pwr Local Pwr  Link LED SW LED
              ---- ---- ---------- ---------- -------- --------
                 A 0x0f       okay       okay       on       on
                 B 0x02       none       okay      off      off
                 C 0x0f       okay       okay       on       on
                 D 0x02       none       okay      off      off
  • 7 Determine whether the system has hung or crashed. Try to ping the system. Also hit ctrl-d to get to the console mode, and type '#' followed by Enter a few times, you should get some sort of response. A linefeed means the kernel is still running. You may also get something like no response from 001c10 console, system not responding. If you get a kdb> prompt, you're at the kernel debugger - it crashed.
  • 8 If you're not at the kernel debugger already, drop to it by typing ESC KDB . You may see a message like "127 out of 128 cpus in kdb, waiting for the rest" - be patient if so. If the system does not respond to the KDB command, issue an NMI from the L2 controller:
              ctrl-t
              ?-001-L2>nmi
  • 9 If you see a message of the form "1 cpu is not in kdb, its state is unknown", issue the "cpu" command to determine which CPU(s) are hung:
              [0]kdb> cpu
              Currently on cpu 0
              Available cpus: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 
13,
              14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
29,
              30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44,
45,
              46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60,
61,
              62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73*, 74, 75, 76,
77,
              78, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109,
110,
              111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122,
123,
              124, 125, 126, 127
The CPU(s) which are hung will be marked with a '*'. Issue an init to those CPU(s):
              [0]kdb> init 73
Note: Issuing an init does not remove the '*' when you issue the "cpu" command again.
  • 10 Capture and print FRU information:
              ctrl-t
              ?-001-L2>fru capture
              354/436 MMRs captured (status 0
              ctrl-t
              ?-001-L2>fru print
  • 10 Record information from KDB, so at the kdb> prompt:
              [0]kdb> sn2kdb
which should produce lots of output.
  • 11 Use LKCD to take a crash dump.
               [0]kdb> sr c
You should see output similar to the following:
                   Start a Crash Dump (If Configured)
                   Dumping from interrupt handler !
                   Uncertain scenario - but will try my best
If the dump is successful, the system will reset, and you will see the following message as it boots:
                   Configuring system to save crash dumps [ OK ]
                   Generating crash report - this may take a few minutes
The crash dump will be saved in a numbered directory under /var/log/dump.

When opening a case, please provide copies of console output from the above procedures, /var/log/messages , crash dumps from /var/log/dump , and SAL records from /var/log/salinfo .

-- MikePatterson - 15 Apr 2005

Note that things seem to have changed somewhat since the upgrade to SuSE 10 - no sn2kdb, and the SGI docs seem to suggest sr d instead of sr c. Except here's the output:

[0]kdb> sr d
SysRq : Starting crash dump[0]
kdb> sr cSysRq : HELP : loglevel0-8 reBoot Dump tErm Full kIll saK showMem Nice powerOff showPc unRaw Sync showTasks Unmount
[0]kdb> sr showtasks
SysRq : Emergency Sync
[0]kdb> 

Doesn't seem to be as friendly.

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2005-04-15 - MikePatterson
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback