Well, your question takes me back about 20 years to when I was working on embedded systems using 8 bit processors. Until I did a quick check just now, I didn't even realize x86 had an NMI! So w/o further searching I can only give you a general idea.
It sounds to me like the only thing the card does is generate an NMI (non-maskable interrupt) when the button is pushed and that anything useful for actual debugging is external to the card. I would think the NMI would push the program counter and all of the registers onto the stack so that you have saved the state of the machine. According to
this Wikipedia article:
Quote:
With the introduction of Windows 2000, Microsoft allowed the use of an NMI to cause a system to either break into a debugger, or dump the contents of memory to disk and reboot.
|
I am not aware of any such capability within Linux, but perhaps there is. You could search the Internet to try to find out. If it doesn't already exist, you would have to write your own software and tie it to the NMI. If, for example, the NMI was handled with a debugger, the debugger could look at the stack and tell you the state of all of the registers and program counter at the time you interrupted things. If the debugger didn't disturb any critical memory, then the whole state of the system would be preserved and you could, in principle, trace back how you got to the hung state. I have worked with debugger/emulators that could then even single step the program that had been running, but I believe that requires additional hardware which may not be available to you. (My memory is very vague and, anyway, this was all on 8 bit processors.)
However, troubleshooting these situations can be very complicated. I have never attempted it on anything close to the complexity of a running Linux system. But my recollection is that it was as much art as science. So I don't know of any general, systematic way to proceed. Just look at the information of where you are,
possibly single step if you have that capability and if other interrupts and real time events don't turn that into a meaningless exercise, and try to deduce what it might make sense to do or look at next.
Alternatively, you could have software that simply dumped all the memory along with the registers and program counter and try to figure it out from there. I have never done anything like that and so can't give you an pointers. Perhaps somebody else with experience with large(r) systems can help give you some advice.
As far as figuring out hardware issues, all I know is to deduce or hypothesize it based on what you observe the software has done. If it is flaky hardware rather than a solid failure, things can get quite tricky.