[SOLVED] rcu_process_callbacks panic on file download

rpedrica · 03-03-2018, 04:10 PM

I've had a very odd panic occurring in the last few days specifically when downloading files in a browser (both firefox and chrome) seem to be doing this. The problem is 100% reproducible and results in the panic below.

Quote:

Mar 3 23:46:35 googly kernel: [ 1220.152555] ------------[ cut here ]------------
Mar 3 23:46:35 googly kernel: [ 1220.152562] WARNING: CPU: 3 PID: 0 at kernel/rcu/tree.c:2725 rcu_process_callbacks+0x47f/0x4a0
Mar 3 23:46:35 googly kernel: [ 1220.152563] Modules linked in: nvidia_uvm(PO) cifs fscache bridge stp llc ipv6 fuse joydev hid_generic hid_microsoft usbhid uas usb_storage hid nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) i2c_dev mxm_wmi kvm_amd drm_kms_helper evdev kvm drm irqbypass agpgart snd_cmipci crct10dif_pclmul ipmi_devintf ipmi_msghandler crc32_pclmul snd_mpu401_uart fam15h_power crc32c_intel snd_opl3_lib fb_sys_fops k10temp syscopyarea snd_rawmidi ghash_clmulni_intel hwmon sysfillrect snd_seq_device xhci_pci snd_hda_codec_realtek snd_hda_codec_hdmi sysimgblt i2c_piix4 gameport r8169 ohci_pci snd_hda_codec_generic xhci_hcd mii i2c_core button ehci_pci ohci_hcd ehci_hcd wmi snd_hda_intel snd_hda_codec snd_hda_core shpchp snd_hwdep snd_pcm snd_timer snd soundcore acpi_cpufreq loop
Mar 3 23:46:35 googly kernel: [ 1220.152607] CPU: 3 PID: 0 Comm: swapper/3 Tainted: P O 4.14.22 #2
Mar 3 23:46:35 googly kernel: [ 1220.152608] Hardware name: MSI MS-7640/990FXA-GD65 (MS-7640), BIOS V20.3 09/26/2013
Mar 3 23:46:35 googly kernel: [ 1220.152609] task: ffff8ca1fea65100 task.stack: ffff9f128007c000
Mar 3 23:46:35 googly kernel: [ 1220.152611] RIP: 0010:rcu_process_callbacks+0x47f/0x4a0
Mar 3 23:46:35 googly kernel: [ 1220.152612] RSP: 0018:ffff8ca1fecc3f10 EFLAGS: 00010002
Mar 3 23:46:35 googly kernel: [ 1220.152614] RAX: ffffffffffffd800 RBX: ffff8ca1fecdfd40 RCX: 000000000005cf01
Mar 3 23:46:35 googly kernel: [ 1220.152615] RDX: 0000000000000001 RSI: ffff8ca1fecc3f18 RDI: ffff8ca1fecdfd78
Mar 3 23:46:35 googly kernel: [ 1220.152616] RBP: ffffffff8c83d640 R08: 00000000000222c0 R09: ffffffff8b0ff68b
Mar 3 23:46:35 googly kernel: [ 1220.152617] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8ca1fecdfd78
Mar 3 23:46:35 googly kernel: [ 1220.152618] R13: 0000000000000246 R14: 7fffffffffffffff R15: fffffffffffffff1
Mar 3 23:46:35 googly kernel: [ 1220.152620] FS: 0000000000000000(0000) GS:ffff8ca1fecc0000(0000) knlGS:0000000000000000
Mar 3 23:46:35 googly kernel: [ 1220.152621] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 3 23:46:35 googly kernel: [ 1220.152622] CR2: 000000000210a000 CR3: 00000003ab774000 CR4: 00000000000406e0
Mar 3 23:46:35 googly kernel: [ 1220.152623] Call Trace:
Mar 3 23:46:35 googly kernel: [ 1220.152625] <IRQ>
Mar 3 23:46:35 googly kernel: [ 1220.152629] __do_softirq+0xe0/0x2dc
Mar 3 23:46:35 googly kernel: [ 1220.152633] irq_exit+0xae/0xb0
Mar 3 23:46:35 googly kernel: [ 1220.152635] smp_apic_timer_interrupt+0x7a/0x130
Mar 3 23:46:35 googly kernel: [ 1220.152638] apic_timer_interrupt+0x7d/0x90
Mar 3 23:46:35 googly kernel: [ 1220.152639] </IRQ>
Mar 3 23:46:35 googly kernel: [ 1220.152642] RIP: 0010:cpuidle_enter_state+0xb4/0x2e0
Mar 3 23:46:35 googly kernel: [ 1220.152643] RSP: 0018:ffff9f128007fec0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
Mar 3 23:46:35 googly kernel: [ 1220.152644] RAX: ffff8ca1fecdf080 RBX: 0000011c16c23816 RCX: 000000000000001f
Mar 3 23:46:35 googly kernel: [ 1220.152645] RDX: 0000011c16c23816 RSI: 0000000020d18838 RDI: 0000000000000000
Mar 3 23:46:35 googly kernel: [ 1220.152646] RBP: ffff8ca1eadd3600 R08: 0000000000000172 R09: 0000000000000121
Mar 3 23:46:35 googly kernel: [ 1220.152647] R10: ffff9f128007fea0 R11: 0000000000000124 R12: 0000000000000002
Mar 3 23:46:35 googly kernel: [ 1220.152648] R13: 0000011c16ba4557 R14: 0000000000000000 R15: ffffffff8c8e2120
Mar 3 23:46:35 googly kernel: [ 1220.152651] do_idle+0x181/0x1e0
Mar 3 23:46:35 googly kernel: [ 1220.152653] cpu_startup_entry+0x5f/0x70
Mar 3 23:46:35 googly kernel: [ 1220.152655] start_secondary+0x18a/0x1b0
Mar 3 23:46:35 googly kernel: [ 1220.152658] secondary_startup_64+0xa5/0xb0
Mar 3 23:46:35 googly kernel: [ 1220.152660] Code: 8b 1d 06 62 86 01 48 85 db 74 1b 48 8b 03 48 8b 7b 08 48 83 c3 18 48 89 ee e8 ae 36 f0 00 48 8b 03 48 85 c0 75 e8 e9 b9 fb ff ff <0f> 0b e9 df fd ff ff 0f 0b e9 d9 fc ff ff 4c 89 ee 4c 89 e7 e8
Mar 3 23:46:35 googly kernel: [ 1220.152691] ---[ end trace 544ca9c607d57d54 ]---

I didn't at first realise what was happening but then saw a trend - within a minute or 2 of clicking a download link, and while the file is downloading in the browser, the panic above occurs at the console and the download stops. It also appears that all disk access stops and it's almost impossible to run anything else. A shutdown proceeds only partially and I have to hard stop/reset the machine.

There have been no changes to the OS directly before the issue started however I have been keeping my OS fairly up to date, with downloads and updates on an almost weekly basis.

slackware64 -current
athlon fx 6350 - kernel 4.14.20 and then updated to 4.14.22
nvidia gt1030 - binary 384.111 and then updated to 390.25
other running software: dolphin, thunderbird, sickrage, chrome, qbittorrent

The system had no updates for the week prior to when the problem started so had quite a few days when there was no issue. I've tried to google for this issue but there's not much for rcu_process_callbacks and panics that relate.

Hopefully someone can shed light on this ...

Regards, Robby

rpedrica · 03-03-2018, 04:26 PM

Additional information: the problem is happening with wget at a console as well. I tried downloading VirtualBox 5.2.8 now with wget and it stopped about 3/4 way through the download. I then tried to take a screenshot and save it which appeared to be fine but after restarting the machine, the saved screenshot is no longer on the drive. Another oddity is that when the problem happens, the machine drive light comes on solid although there is no drive usage noise.

bassmadrigal · 03-03-2018, 04:46 PM

First thing I would do is run a check on the drive. It could be the start of the drive failing. Check your SMART data and run an fsck on the drive/partition.

abga · 03-03-2018, 05:59 PM

@rpedrica

If you cannot find anything faulty with your HardDrive, following bassmadrigal's advice, then I'd suggest to try booting a live Linux Image with an older kernel - pre 4.14.20, mount your HardDrive (a partition that is not used by the system would be ideal) and try to replicate your reported issue.
It might be a kernel bug:
https://bugs.debian.org/cgi-bin/bugr...cgi?bug=891467
https://bugzilla.kernel.org/show_bug.cgi?id=198861

rpedrica · 03-04-2018, 01:54 AM

Hi @abga and @bassmadrigal, Thanks for your responses.

So first this machine is as is from last night - no issues overnight (with no downloading).

I had already done fsck's on all drives as a first stop and nothing of interest found. But I think your 2nd link above is pretty much spot on what's happening here. This feels like it started after the 4.14.20 kernel update.

I've just checked smart and everything is looking good:

sda = 500GB wd black (passed)
sdb = 1TB wd black (passed with interest: 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 13)
sdc = SanDisk Ultra II 240GB - passed but I'm not sure I'm reading this one right:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0032 100 100 --- Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 --- Old_age Always - 14097
12 Power_Cycle_Count 0x0032 100 100 --- Old_age Always - 71
165 Total_Write/Erase_Count 0x0032 100 100 --- Old_age Always - 1966098
166 Min_W/E_Cycle 0x0032 100 100 --- Old_age Always - 0
167 Min_Bad_Block/Die 0x0032 100 100 --- Old_age Always - 14
168 Maximum_Erase_Cycle 0x0032 100 100 --- Old_age Always - 1
169 Total_Bad_Block 0x0032 100 100 --- Old_age Always - 149
170 Unknown_Attribute 0x0032 100 100 --- Old_age Always - 0
171 Program_Fail_Count 0x0032 100 100 --- Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 --- Old_age Always - 0
173 Avg_Write/Erase_Count 0x0032 100 100 --- Old_age Always - 0
174 Unexpect_Power_Loss_Ct 0x0032 100 100 --- Old_age Always - 46
184 End-to-End_Error 0x0032 100 100 --- Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 --- Old_age Always - 0
188 Command_Timeout 0x0032 100 100 --- Old_age Always - 0
194 Temperature_Celsius 0x0022 071 066 --- Old_age Always - 29 (Min/Max 19/66)
199 SATA_CRC_Error 0x0032 100 100 --- Old_age Always - 0
230 Perc_Write/Erase_Count 0x0032 100 100 --- Old_age Always - 21474836485
232 Perc_Avail_Resrvd_Space 0x0033 100 100 004 Pre-fail Always - 100
233 Total_NAND_Writes_GiB 0x0032 100 100 --- Old_age Always - 162
234 Perc_Write/Erase_Ct_BC 0x0032 100 100 --- Old_age Always - 188
241 Total_Writes_GiB 0x0030 253 253 --- Old_age Offline - 184
242 Total_Reads_GiB 0x0030 253 253 --- Old_age Offline - 579
244 Thermal_Throttle 0x0032 000 100 --- Old_age Always - 0

Anyway, I'm going to find a boot mem stick and test/confirm the issue. Anyone know here where I can get a pre-4.14.20 kernel set from (besides building it myself)? The last iso I've made of -current is in Dec17 and that kernel is 4.9.53 - I'd like to have something that at least includes spectre/meltdown support ... I also see that 4.14.23 has been released by PV ... maybe I should test that as well.

rpedrica · 03-04-2018, 06:44 AM

Ok worked off Eric's liveslak for half hour and no problems - that's with 4.14.18. Luckily I took a quick look at dmesg output and here is the culprit for the 1TB WD Black:

Quote:

2.193018] ata2: SATA max UDMA/133 abar m1024@0xfe30b000 port 0xfe30b180 irq 19
[ 2.663322] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 2.664014] ata2.00: ATA-9: WDC WD1003FZEX-00MK2A0, 01.01A01, max UDMA/133
[ 2.664091] ata2.00: 1953525168 sectors, multi 16: LBA48 NCQ (depth 31/32), AA
[ 2.664975] ata2.00: configured for UDMA/133
[ 776.176289] ata2.00: exception Emask 0x10 SAct 0xf800000 SErr 0x400000 action 0x6 frozen
[ 776.176292] ata2.00: irq_stat 0x08000000, interface fatal error
[ 776.176296] ata2: SError: { Handshk }
[ 776.176300] ata2.00: failed command: WRITE FPDMA QUEUED
[ 776.176310] ata2.00: cmd 61/00:b8:b0:b3:9f/06:00:13:00:00/40 tag 23 ncq dma 786432 out
[ 776.176313] ata2.00: status: { DRDY }
[ 776.176316] ata2.00: failed command: WRITE FPDMA QUEUED
[ 776.176325] ata2.00: cmd 61/00:c0:b0:b9:9f/0a:00:13:00:00/40 tag 24 ncq dma 1310720 ou
[ 776.176327] ata2.00: status: { DRDY }
[ 776.176329] ata2.00: failed command: WRITE FPDMA QUEUED
[ 776.176338] ata2.00: cmd 61/50:c8:b0:c3:9f/04:00:13:00:00/40 tag 25 ncq dma 565248 out
[ 776.176340] ata2.00: status: { DRDY }
[ 776.176342] ata2.00: failed command: WRITE FPDMA QUEUED
[ 776.176351] ata2.00: cmd 61/00:d0:00:c8:9f/0a:00:13:00:00/40 tag 26 ncq dma 1310720 ou
[ 776.176353] ata2.00: status: { DRDY }
[ 776.176355] ata2.00: failed command: WRITE FPDMA QUEUED
[ 776.176363] ata2.00: cmd 61/48:d8:00:d2:9f/04:00:13:00:00/40 tag 27 ncq dma 561152 out
[ 776.176404] ata2.00: status: { DRDY }
[ 776.176410] ata2: hard resetting link
[ 776.639238] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 776.640549] ata2.00: configured for UDMA/133
[ 776.640635] ata2: EH complete
[ 799.454308] ata2.00: exception Emask 0x10 SAct 0x3fc0 SErr 0x400000 action 0x6 frozen
[ 799.454312] ata2.00: irq_stat 0x08000000, interface fatal error
[ 799.454315] ata2: SError: { Handshk }
[ 799.454319] ata2.00: failed command: WRITE FPDMA QUEUED
[ 799.454329] ata2.00: cmd 61/10:30:f0:d0:a8/07:00:13:00:00/40 tag 6 ncq dma 925696 out
[ 799.454332] ata2.00: status: { DRDY }
[ 799.454335] ata2.00: failed command: WRITE FPDMA QUEUED
[ 799.454344] ata2.00: cmd 61/90:38:00:d8:a8/08:00:13:00:00/40 tag 7 ncq dma 1122304 ou
[ 799.454346] ata2.00: status: { DRDY }
[ 799.454349] ata2.00: failed command: WRITE FPDMA QUEUED
[ 799.454357] ata2.00: cmd 61/70:40:90:e0:a8/07:00:13:00:00/40 tag 8 ncq dma 974848 out
[ 799.454359] ata2.00: status: { DRDY }

So possibly a drive, cable or drive interface issue. I'm going to change sata and power cables to see which.

rpedrica · 03-04-2018, 08:18 AM

Hmm so the problem is solved - a good dust clean out of the case, swap of sata cables between 1TB WD Black and 240GB SSD Sandisk Ultra, and no more errors. I'm going to put this down to a loose cable ...

@abga, thanks very much for those links - got me onto the right track!

Regards, Robby

abga · 03-04-2018, 06:23 PM

Always happy to help!

On your SMART data interpretation confusion, regrettably none of the HDD manufacturers are respecting the SMART standard anymore, instead they have their own "recipe" and "internal values" that only their Diagnosis Software can interpret. Furthermore, in the modern HardDisks the technology and material limits are pushed at their limits, they compress more data on the same magnetic surface and due to the constant internal errors the manufacturers have implemented (firmware) internal wear leveling and error correction. The SMART field Reallocated Sector Counter, that used to be a good indicator about a failing drive, doesn't reflect today too much. If it does, you'll get some astronomical values that the manufacturer considers "normal", still inside the threshold.
Depending on how the SMART was implemented in the HardDisk, there is still a way to look after failures in looking after the detailed SMART error log, again, if any:
https://www.thomas-krenn.com/en/wiki...using_Smartctl
Or, run some self-tests:
https://www.thomas-krenn.com/en/wiki..._with_smartctl

On your odd experience, I'm concerned that after the 4.14.20 kernel, any hardware related issue with the HardDisk will result in a kernel panic, which isn't really useful. The two patches that were submitted for resolving this issue are not yet accepted, one has the status "not applicable" and one is "deferred":
https://bugzilla.kernel.org/show_bug.cgi?id=198861#c1

rpedrica · 03-05-2018, 01:13 AM

Agreed - it was not immediately obvious that the panic was hardware related although that probably should have been a 1st stop. But which hardware? It was only by luck that after booting up a pre-20 kernel, that I saw the ata errors. This will make troubleshooting more difficult.