LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (http://www.linuxquestions.org/questions/linux-software-2/)
-   -   Safe way to force load on cpu/gpu? (http://www.linuxquestions.org/questions/linux-software-2/safe-way-to-force-load-on-cpu-gpu-776169/)

damgar 12-16-2009 02:34 PM

Safe way to force load on cpu/gpu?
 
I am experiencing stability issues that seem to related to my nvidia drivers. Is there a way to force a load on the cpu and the gpu independent on each other to help determine if the problem is truely as random as it seems. When I crash there are no logs generated and no recovery possible. I just have to do a hard reset (not happy about that). And nothing seems to be grabbing resources except possibly at the time of crash (which is why I want to force a load). I've run memtest86 for 4 hours without a single error, blown out the case, reseated everything..........I'm confident in the general area that I'm looking at, I'm just trying to narrow that down.

ShadowCat8 12-16-2009 07:27 PM

Hmmm...

A question that quickly comes to mind is which version of Xorg are you running? The newer versions of Xorg do not set the Ctrl+Alt+Bkspace key sequence by default anymore. However, you can re-enable them. :) It just depends on which way the X server is set to run: With HAL or without. If the X server doesn't need HAL to run, then you can add the following line to your xorg.conf inside the keyboard section of "InputDevice":
Code:

Option "XkbOptions" "terminate:ctrl_alt_bksp"
... and in the "ServerFlags" section, find the DontZap option, uncomment it and add "False" to the end of it, like so:
Code:

Option "DontZap"  "False"
Of course, if your X server *does* require HAL to run, then you need to put a different line in /etc/hal/fdi/policy/10-keymap.fdi:
Code:

<merge key="input.xkb.options" type="string">terminate:ctrl_alt_bksp</merge>
Yet another option is to set it up for the user. If you are using HAL, you can add the following line to ~/.xinitrc
Code:

setxkbmap -option terminate:ctrl_alt_bksp
If you do not have an /etc/X11/xorg.conf file, then your X server needs HAL to run.

==============

Another possibility for not having to do the hard reset is if you have an ssh server running on the box, can you log in from another box and just kill the X server?

And, does /var/log/Xorg.0.log not have *anything* regarding the crashes? That seems strange to me. Have you checked /var/log/Xorg.0.log.old as well?

Let us know.

PTrenholme 12-16-2009 08:01 PM

If you're using a newer X-server (where the ctrl-alt-bksp doesn't reboot the sever) try setting the server options like this in your xorg.conf file:
Code:

Section "ServerFlags"
#      Option      "AIGLX" "on"
#      Option      "Xinerama" "0"
        Option      "DontZap" "off"
        Option      "DontVTSwitch" "off"
#      Option      "HandleSpecialKeys" "Always"
#      Option      "AllowEmptyInput" "off"   
EndSection

That's not to say that ShadowCat8's comment is incorrect, just that the man page for xorg.conf suggests the "DontZap" option as the way to go.

damgar 12-17-2009 12:24 AM

CTRL+ALT+backspace reboots x and takes me to the console under normal circumstances.. I can't find reference to the crashes in any of the logs......that I can make out anyway. When the system freezes it's like a kernel panic. Keyboard LEDs blink, screen freezes, audio, skips in place.

PTrenholme 12-17-2009 10:16 PM

Hum. I (obviously) don't know how every distribution works, but - in my limited experience - blinking keyboard lights indicate a BIOS error condition, not a Linux/GNU problem. If I were you, I'd turn on your BIOS logging and check that log (assuming, of course, that you're using a BIOS that supports logging). If your BIOS doesn't support logging, check your BIOS manual for methods for debugging BIOS issues.

Again, in my limited experience, the most common hardware problems causing the BIOS to "barf" are loose hard drive connectors or failing memory chips. If your system is one that uses cabled hard drives (most desktop systems do), try unplugging and replugging the cables (at each end) to "clean" the contacts from any minor corrosion and reset the connectors. If that fails to help, try running the memtest program from the boot prompt. (If it's not on your boot menu, download a rescue CD and boot from it. SystemRescueCD is a good choice.)

<edit>
Note: A full memtest of a 4GB memory block can take a much as eight hours - or more - to run.
</edit>

It might also be useful to pull any "cards" in your computer and reseat them at the same time as you check the drive cables.

Of course, if your system is a laptop one, then the memtest is the easiest thing to try. The other connections seldom go bad on a laptop before it breaks for some other reason. (I believe that few laptops are designed to survive for more than a year or two of moderate use.)

damgar 12-18-2009 12:57 PM

All of those things were suspect and were done (pulling cards, cables, even removing additional cards/hard drives that aren't being used at the present in case it was a power issue. I've gone far enough to know that I can run a 2.6.30.10 kernel with NVIDIA proprietary drivers without an issue, but starting at 2.6.31 and any version of the NVIDIA drivers (including 190.53 just released) I die in under 5 hours. I will have to look closer at the kernel options starting at 2.6.31. I assume whatever it is has to be a default answer in that kernel and higher, because my .config file from 2.6.30.10 and only default answers to make oldconfig doesn't work.

I haven't tried bios logging, did try changing bios settings. I'll have to look into bios logging to find out if it's an option.

Thanks for the reply.

PTrenholme 12-18-2009 02:27 PM

I just looked back at your posts in this thread (and noticed that you mentioned doing the obvious things in you first post - sorry :redface:), but I don't see any mention of which distribution you're using when the problem occurs. (You list three distributions in your "member info" section, but it's not clear that you're using the 2.6.31 kernel with all of them.)

I had a lot of problems with the nVidia chipset drivers on this laptop when Fedora switched to loading the nouveau driver with the kernel since that driver does not (correctly) support the MCP67 chips. (Which worries me since Linus has accepted the nouveau for inclusion in the 2.6.33 kernel. But that's a different issue.) Anyhow, you may want to consider reviewing your initial RAM drive (initrd) image file to see if another driver for your card is being loaded with the kernel. (This is not likely, since the X-server would, naturally, fail to load the nvidia driver if another driver was already loaded, and would then fail to start.)

For what it's worth, I'm now using the nVidia 190.42 driver with the Fedora 2.6.31.6 kernel (and MCP67 chipset) with no problems. But, of course, you're using a different chipset, so that's probably irrelevant.

Anyway, to answer the question you asked when you started this thread, one simple way to load up your system is to start several instances of the glxgears app. (On this laptop, three instances running at the same time bumped CPU usage to 99% on both processors.)

damgar 12-18-2009 03:48 PM

Thank you for that tip. I'm currently running slackware64 current. I got the same issue with slackware13(32bit) stable when I used the 2.6.31.6 kernel as I now get with 2.6.32 or 2.6.32.1 on curent_64. I've been through probably 15 rolls in the last 2 weeks or so and it's just not happening. I thought that surely it was me learning to roll kernels that was causing the problem, so I rolled versions starting at 2.6.29.6 (shipped with slack13 and slack64) and I'm good up to 2.6.30.10, but when I go to 2.6.31.x The issue kicks in. I'm downloading kernels from kernel.org and nouveou isn't supposed to be included until 2.6.33 (I was HOPING that would be a good open source alternative, but after your post I'm not so sure) and in a few rolls I've removed every video driver except VESA to make sure there was no weird loading of conflicting modules. My most recent troubleshooting path has been to start with the stock .config file-> new kernel -> make oldconfig -> default answers -> make -> test till break. I now believe it to be either a bug (I don't know enough about the problem to report it I don't think) or more likely some nuance in the default options beginning in 2.6.31. I don't run any exotic hardware, it's a 3 year old gigabyte motherboard with an intel chipset and an e6600 core2duo with 6GB ram.

It's daunting and time consuming to wade through even just the NEW options in a new version config, but I have tried. I have come to appreciate just how much work is going on in the kernel development community that I never would have understood just sticking with a stock installation. I'm learning A LOT through this, I just really would like to learn THE ANSWER to my problem at some point! LOL

PTrenholme 12-18-2009 09:59 PM

Is your nVidia hardware a separate card or one "built in" to your mother board?

Have you looked at the changes made in the kernel between the 2.6.30.10 that works for you and the next kernel that fails? (Was that 2.6.30.11 - if that even exists - or did you jump from 2.6.30 to 2.6.31?)

Have to asked the nVidia Linux support people for help? (Last time I did that I received a automated response suggesting I try the procedure that I told them had failed in my message, so contacting them may not be as responsive as you might wish it would be. I never bothered to get back to them after receiving the "response" their system generated.) But, hey, they might know what the problem is, and how to fix it. :)

damgar 12-18-2009 11:46 PM

Quote:

in my limited experience - blinking keyboard lights indicate a BIOS error condition, not a Linux/GNU problem
I'm currently testing 2.6.32.1 and it's gone longer than any other incarnation so far that I know of. I did a little digging around to check for changes in the kernel, that seemed likely, but didn't find much (I prefer hammers to books LOL). I found some error reports in outside forums that described kernel panics with 2.6.31.6 (the first kernel version above 2.6.29.6 I'd tried which didn't work), but they didn't seem to be the same problem exactly. Then I saw their motherboards were gigabyte like mine, but for AMD instead of intel. Then I remembered your post. So I went through the bios settings again when the machine froze with yet another kernel and found the setting for "pci express frequency" (I'm thinking that was what it was called....I'll edit later if that proves to be incorrect) which was set to "auto". I set it manually to 100 (the bios menu said their were no guarantees above 100) and rebooted into 2.6.32.1.

I then used your suggestion with glxgears and got 6 instances of glx gears going, then opened up firefox, konqueror, and google chrome and pointed them all to youtube and started videos going while watching top and then threw amarok on top of that to get cpu% up to a consistent 93 and kept that going for 10 minutes or so with no problems except the expected stuttering video. I'm up for going on 3 hours so far.

If this is the answer then the only thing left to do is figure out what changed from 2.6.30.10 to 2.6.31.6 that caused a bios setting that has worked unnoticed by me for years to go silly.

damgar 12-19-2009 10:07 AM

I'm going to call this solved. glxgears is a great tool I wasn't aware of. Thanks for that.

Also, thanks for mentioning the bios as ultimately you were dead on. I feel like an idiot for not finding this earlier, but I guess with all the things I learned that DIDN'T fix the problem I am now a SMARTER IDIOT than when I started! LOL

Thanks again.


All times are GMT -5. The time now is 08:32 AM.