LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Kernel (https://www.linuxquestions.org/questions/linux-kernel-70/)
-   -   video corruption when reading from disk, Debian jessie (https://www.linuxquestions.org/questions/linux-kernel-70/video-corruption-when-reading-from-disk-debian-jessie-4175640268/)

danch 10-12-2018 08:10 PM

video corruption when reading from disk, Debian jessie
 
1 Attachment(s)
I recently upgraded a machine from Debian wheezy to Debian jessie. After the upgrade, if I run a command like

Code:

find . -type f -exec cat {} \; > /dev/null
from a large directory, I get video corruption after 1 or 2 seconds. It's completely reproducable.
It happens under X or even in a virtual terminal with X not running and the nvidia module never loaded.
It happens when reading from an ext4 partition on sdb or on an xfs partition on sdd.
It happens under the default jessie kernel 3.16.0-7-686-pae, under the still installed wheezy kernel 3.2.0-6-686-pae or under the 4.9 kernel also available in jessie 4.9.0-0.bpo.7-686.

The machine has been in regular use as a backup server and media server for many years running wheezy, without problems. The problems started right after the first boot into jessie.

I booted into System Rescue CD 3.9.2 on a thumb drive. I got similar vt corruption at the start, while it read the USB drive before starting to boot. But then as soon as the kernel started booting, the corruption was gone, and my find test didn't cause further problems.

When the corruption happens in a virtual terminal, nothing shows up in dmesg. If X is running, then there are error messages in dmesg and the machine freezes up soon after. I'll paste these message below.

One time, I got read errors from the hard drive, but SMART tests and later checks showed that the drives were fine.

Hardware:

Gigabyte GA-E7AUM-DS2H motherboard, with on board NVidia GeForce 9400 graphics
Intel Core2Duo E7400 2.8GHz 65W
2x2G Kingson 800MHz ram
4 SATA hard drives

I ran memtest directly from grub for several hours without a problem.

dmesg output when corruption happens and X is running:

Code:

[  360.980157] NVRM: GPU at PCI:0000:02:00: GPU-579d39ac-2eaa-3c97-d407-7d020ce553e2
[  360.980164] NVRM: Xid (PCI:0000:02:00): 8, Channel 0000007e
[  362.980103] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[  366.980179] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[  368.981729] [sched_delayed] sched: RT throttling activated
[  424.682600] perf interrupt took too long (2513 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
[  436.996006] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 57.899 msecs
[  436.996006] perf interrupt took too long (454980 > 5000), lowering kernel.perf_event_max_sample_rate to 25000
[  437.924150] perf interrupt took too long (451438 > 10000), lowering kernel.perf_event_max_sample_rate to 12500
[  438.792378] perf interrupt took too long (447921 > 20000), lowering kernel.perf_event_max_sample_rate to 6250
[  439.718485] perf interrupt took too long (896752 > 38461), lowering kernel.perf_event_max_sample_rate to 3250
[  440.470959] perf interrupt took too long (889756 > 71428), lowering kernel.perf_event_max_sample_rate to 1750
[  441.281296] perf interrupt took too long (882819 > 125000), lowering kernel.perf_event_max_sample_rate to 1000
[  442.149523] perf interrupt took too long (875931 > 250000), lowering kernel.perf_event_max_sample_rate to 500
[  443.017769] perf interrupt took too long (869098 > 500000), lowering kernel.perf_event_max_sample_rate to 250
[  443.885975] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 115.795 msecs
[  444.518713] hpet1: lost 6 rtc interrupts
[  444.692362] hpet1: lost 10 rtc interrupts
[  444.866006] hpet1: lost 10 rtc interrupts
[  455.747847] NVRM: Xid (PCI:0000:02:00): 1, Channel 00000001 Method 00000000 Data 00006861
[  457.992006] INFO: rcu_sched self-detected stall on CPU { 0}  (t=5250 jiffies g=10451 c=10450 q=1182)
[  457.992006] sending NMI to all CPUs:
[  457.992006] NMI backtrace for cpu 0

I'm happy to provide any other info or try any suggestions, but I thought I'd start with this.

Thanks for any help trying to figure this out!

Mara 10-13-2018 07:04 AM

It looks this is the nvidia module, it is loaded because it complains (NVRM prefix is from this driver):
Quote:

[ 362.980103] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Please try to remove it/blacklist it and check if the problem doesn't happen anymore.

danch 10-13-2018 07:33 AM

The problem also happens if I rename the nvidia modules and reboot, but then there are no messages in dmesg, and the system remains stable other than the corruption of the virtual terminal.

pan64 10-16-2018 08:40 AM

I had to remove all the X related packages from my ubuntu after a dist upgrade and reinstall them. Probably that will help you too.
But I do not really understand: virtual terminal is something running inside X, or ??

What about switching to console (Ctrl-Alt-F1) and back again?

danch 10-16-2018 08:48 AM

Quote:

Originally Posted by pan64 (Post 5915399)
I had to remove all the X related packages from my ubuntu after a dist upgrade and reinstall them. Probably that will help you too.
But I do not really understand: virtual terminal is something running inside X, or ??

What about switching to console (Ctrl-Alt-F1) and back again?

By a "virtual terminal" I mean the console you get with Ctrl-Alt-F1. I get corruption in that console when X is not even running and the nvidia driver has not been loaded into the kernel. So reinstalling packages related to X shouldn't affect that.

pan64 10-16-2018 08:55 AM

anyway it looks like the upgrade was not really successful.
did you try to execute the command: reset? does it help?

danch 10-16-2018 09:25 AM

Typing the
Code:

reset
command in the virtual terminal doesn't improve anything. I think the corruption is at a much lower level than terminal settings. See the screenshot I attached to the original question for how it typically looks. Many of the coloured boxes are flashing.

pan64 10-17-2018 01:09 AM

that is the corruption of the buffer, where the "content" is stored. Usually reset forces to clean it. But as you told it is on a lower level. I would suggest you to boot from a live CD or do something similar if that works. I still think a clean reinstall may help.

danch 12-01-2018 08:03 PM

Despite the fact that the corruption problems occurred after an OS upgrade, I'm pretty sure they were caused by a faulty motherboard in the end. I replaced the motherboard, CPU and RAM, but kept the same hard drives and same installation of debian, and the problem went away.


All times are GMT -5. The time now is 12:07 AM.