Sometimes one of the most useful resources at your disposal when troubleshooting a hang or other issues is the memory dump file Windows will write out during a blue screen. If a system is hung and you are not able to get to it locally, pressing Ctrl+ScrollLock, ScrollLock isn't going to be a feasible solution. If the server is an HP server with an iLO card (Integrated Lights Out), and you've set a registry key in Windows ahead of time, you can force the system to bluescreen, write the memory dump, and restart.

The key to doing this is generating what's called a nonmaskable interrupt or NMI. The long and short of it is that NMIs are hardware interrupts which have to be serviced immediately. Windows has a concept of IRQ levels, or IRQLs. The highest IRQL is always serviced, preempting any lower level interrupts which are currently being serviced. The preemptive behavior here is called masking the interrupt. So, an NMI is an interrupt which must be serviced immediately. Generally you get an NMI when there's a major hardware fault that prevents the operating system from continuing. This is exactly what happens if we trigger one manually in the iLO.

The first step to getting this functionality working is setting a registry key outlined in KB 927069. Don't mind the part about this only applying to HP Blades or that it only applies to Windows 2000. This works on 2000 and 2003 and it works with hardware other than HP blades. Here's the registry key info:

Path: HKLM\System\CurrentControlSet\Control\CrashControl

Value: NMICrashDump

Data: 1

Type: REG_DWORD

You'll need to reboot for the change to take effect.

If you have the Automated System Recovery (ASR) functionality enabled on the server and you need to get a full memory dump, you will need to turn it off as it can interfere with this process. This is a BIOS setting which I don't have the steps to change easily available. If there's demand (leave a comment), I can track them down.

To crash the box, these are the steps. I shot these screens on a DL360 G4 which is fairly recent hardware. I suspect the screens and locations of options may vary a bit by age (and especially on older legacy Compaq stuff), but the basic process is the same.

1. Login to the ILO and then proceed to the "Server and iLO Diagnostics" link on the left hand navigation:

    

2. Select the Virtual NMI Button option on the toolbar:

    

Warning! I can't guarantee that this button generates a warning when you click it on all versions of the iLO firmware. Generating an NMI will HALT your system. Don't click this button just to see what happens!

3. Generate the NMI. This button is towards the bottom of the page so if your browser doesn't automatically scroll down to it, you'll have to drill down:

    

4. You will get a warning dialog to make sure you're really certain this is what you want to happen. Remember, doing this will HALT your system!

    

5. The iLO will write a status message to the status bar in IE:

    

6. At this point Windows will crash with a 0x80 bugcheck and reboot (assuming your machine is configured to automatically reboot after a bluescreen). You can hopefully use the memory dump to assist in troubleshooting the problem at hand.

Note that this capability is present in the Dell DRAC cards (at least certain versions). I'm trying to find out what happened to the option in the latest versions of the cards as it seems to have gone AWOL. I'll post the directions whenever I find out.

Share this post: email it! | digg it! | bookmark it! | live it!