[SOLVED] Recurring Kernel Problem

Ellster · 10-26-2020, 06:55 AM

Hi all,

I'm dealing with a recurring problem and I just thought someone might be able to tell me what sensible next steps would be to figure out what's causing it.

A few weeks ago I got a Lenovo SL500 and installed Debian 9 with KDE. Around a week ago when I booted it up from standby it gave me the standard login mask and I logged in like normal. Then an error flashed up (too fast to read) and after that I only got a black screen. When I booted it up again, the GUI wouldn't load and after some google and experimentation I figured out the harddrive was full. I booted it with a puppy disc and found that the kernel log was pretty big, chalked it up to kernel error and reinstalled Debian.

Now on Debian 10 it's been working well for the past week, until yesterday while doing some work for Uni I a notice popped up that my drive is almost full. It was late, so I just saved and closed everything and when it wouldn't let me call up the shut down menu, I force shut down. Now when I booted it up (big surprise) it won't boot into GUI.

I don't have a problem just reinstalling Debian again, but if I have to do it every week it's going to get pretty time consuming so I'd like to figure out what's causing this. If I recall correctly the first time when I looked through the drive with Puppy Linux there was also around 30GB of full space unaccounted for, but I kind of just stopped looking once I saw the full kernel log. I did try to read the kernel log, but after an hour of stuff scrolling by too fast for me to read and it still going, I just gave up.

So, what should I do / look for now / after I reinstall Debian to figure out what's causing this?

Thanks everyone!

wpeckham · 10-26-2020, 09:03 AM

Quote:

Originally Posted by Ellster

Hi all,

I'm dealing with a recurring problem and I just thought someone might be able to tell me what sensible next steps would be to figure out what's causing it.

A few weeks ago I got a Lenovo SL500 and installed Debian 9 with KDE. Around a week ago when I booted it up from standby it gave me the standard login mask and I logged in like normal. Then an error flashed up (too fast to read) and after that I only got a black screen. When I booted it up again, the GUI wouldn't load and after some google and experimentation I figured out the harddrive was full. I booted it with a puppy disc and found that the kernel log was pretty big, chalked it up to kernel error and reinstalled Debian.

Now on Debian 10 it's been working well for the past week, until yesterday while doing some work for Uni I a notice popped up that my drive is almost full. It was late, so I just saved and closed everything and when it wouldn't let me call up the shut down menu, I force shut down. Now when I booted it up (big surprise) it won't boot into GUI.

I don't have a problem just reinstalling Debian again, but if I have to do it every week it's going to get pretty time consuming so I'd like to figure out what's causing this. If I recall correctly the first time when I looked through the drive with Puppy Linux there was also around 30GB of full space unaccounted for, but I kind of just stopped looking once I saw the full kernel log. I did try to read the kernel log, but after an hour of stuff scrolling by too fast for me to read and it still going, I just gave up.

So, what should I do / look for now / after I reinstall Debian to figure out what's causing this?

Thanks everyone!

How about if you STOP reinstalling and actually address the problem?

One way to do that: (My way, not saying there might not be a better one.)_
Boot with a live-cd and examine that log for the issue. Once you have that determined, clean up the log space so you can boot normally.
Then check for logrotate. If it is installed it needs configuration, if not install and configure it. That solves the space issue going forward.

If you record the specific errors or messages that have been filling the log, post them here and see if we can help you deal with that.

I will be watching for any update. Please let me know what you find.

sgosnell · 10-26-2020, 09:43 AM

That is not a kernel error. Either your root partition is too small, or something is writing a huge amount of data to files. It could be log files, or it could be something else. Backup files written to the internal drive can take a lot of space, and backups should not be done to the root drive. An incorrect mount to /media can cause files to be written to the root drive instead of to an external drive. There are multiple possibilities, and you need to sort out the cause instead of just reinstalling. When you boot from the Puppy drive, you can read the logs with a text editor, or pipe the output through more to read them at your leisure if you want to use the terminal. In addition to wpeckham's good advice, check the drive for directories that might be larger than normal. Some obvious places to start are /media, /var, and /usr. There should be no files in /media in most cases, just mountpoints for external drives. If all USB drives are removed and there are still files in /media, you may have found the problem. All this can take time and effort, but it's worth it to solve the problem. Reinstalling every time it happens will never solve it.

Ellster · 10-26-2020, 11:38 AM

Much thanks!

I know that my habit of just re-installing whenever I get stuck / frustrated isn't the best approach, but I mostly work by trying out things until they work and manage to accidentally shoot my system to bits a lot along the way, which is what I thought had happened the first time over. Only when it showed up again now that I hadn't done any previous experimentation since the latest reinstall, I realized there is a bigger problem here.

Anyway, I booted with puppy now and the big files seem to be var/log/kernel.log.1 and var/log/messages.1 with around 30GB each, which still leaves some 30GB unaccounted for. I'm currently trying to read those files, but they're still loading.

But I was just told it might be because my installation is not SSD optimized. I have never worked with an SSD before and therefore didn't consider this. Might that be the problem?

sgosnell · 10-26-2020, 12:07 PM

Something is wrong somewhere. There should not be 60GB of log files.

wpeckham · 10-26-2020, 06:55 PM

Quote:

Originally Posted by Ellster

Much thanks!

I know that my habit of just re-installing whenever I get stuck / frustrated isn't the best approach, but I mostly work by trying out things until they work and manage to accidentally shoot my system to bits a lot along the way, which is what I thought had happened the first time over. Only when it showed up again now that I hadn't done any previous experimentation since the latest reinstall, I realized there is a bigger problem here.

Anyway, I booted with puppy now and the big files seem to be var/log/kernel.log.1 and var/log/messages.1 with around 30GB each, which still leaves some 30GB unaccounted for. I'm currently trying to read those files, but they're still loading.

But I was just told it might be because my installation is not SSD optimized. I have never worked with an SSD before and therefore didn't consider this. Might that be the problem?

It is faintly possible. What is the filesystem format? Are you using EXT4, BTRFS, XFS, or something else?

beachboy2 · 10-27-2020, 10:11 AM

Ellster,

As wpeckham has mentioned, once you have discovered exactly what is causing this problem, you can progress to setting up logrotate.

Scroll down to “This is an old question...on:
https://stackoverflow.com/questions/...files/35658810

You could also install and then run ncdu:
https://www.binarytides.com/check-di...ge-linux-ncdu/

ondoho · 10-27-2020, 04:17 PM

Quote:

Originally Posted by Ellster

Anyway, I booted with puppy now and the big files seem to be var/log/kernel.log.1 and var/log/messages.1 with around 30GB each, which still leaves some 30GB unaccounted for. I'm currently trying to read those files, but they're still loading.

Yes, you need to look at these files.
I hope you are not trying this in a GUI editor...
Try

Code:

less var/log/kernel.log.1
# or
more var/log/kernel.log.1

instead.
Even so it will probably take many minutes.

Ellster · 10-28-2020, 10:18 AM

Thanks all!
Life came in between the last days, but now logrotate is set up.

Quote:

Originally Posted by wpeckham

It is faintly possible. What is the filesystem format? Are you using EXT4, BTRFS, XFS, or something else?

Ext4

Also I had look into the logs and while there was a lot, most seemed to be roughly either of those two warnings:

Code:

WARNING: CPU: 0 PID: 0 at drivers/mtd/nand/raw/r852.c:746 r852_irq.cold.25+0xc/0x13 $
Modules linked in: rfcomm fuse ctr ccm arc4 bnep ath5k btusb btrtl btbcm btintel ath$
i2c_algo_bit drm_kms_helper sdhci_pci cqhci sdhci drm uhci_hcd ehci_pci scsi_mod mm$
CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W         4.19.0-12-amd64 #1 Debian $
Hardware name: LENOVO                         2746F2G/2746F2G   , BIOS 6AET58WW 05/2$
RIP: 0010:r852_irq.cold.25+0xc/0x13 [r852]
Code: f0 ff ff 48 c7 c7 e0 41 ac c0 89 74 24 04 e8 c7 f3 c1 cb 0f 0b 8b 74 24 04 e9 $
RSP: 0018:ffff9aa2fb803ef0 EFLAGS: 00010046
RAX: 0000000000000024 RBX: ffff9aa2f8856300 RCX: 0000000000000006
RDX: 0000000000000000 RSI: 0000000000000092 RDI: ffff9aa2fb8166b0
RBP: ffff9aa2f88563dc R08: 000000000000044a R09: 0000000000000004
R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000086
R13: ffff9aa2fb803f5c R14: 0000000000000000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff9aa2fb800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fc38068e000 CR3: 000000005d60a000 CR4: 00000000000406f0
Call Trace:
 <IRQ>
 __handle_irq_event_percpu+0x46/0x190
 handle_irq_event_percpu+0x30/0x80
 handle_irq_event+0x3c/0x5c
 handle_fasteoi_irq+0xa3/0x160
 handle_irq+0x1f/0x30
 do_IRQ+0x49/0xe0
 common_interrupt+0xf/0xf
 </IRQ>
RIP: 0010:cpuidle_enter_state+0xb6/0x320
Code: 90 31 ff e8 dc b3 b0 ff 80 7c 24 0b 00 74 17 9c 58 66 66 90 66 90 f6 c4 02 0f 85 3b 02 00 00 31 ff e8 0e a6 b$
RSP: 0018:ffffffff8d603e70 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffd8
RAX: ffff9aa2fb8220c0 RBX: 000000be1aa63efc RCX: 000000be1aa63efc
RDX: 000000be1aa63efc RSI: 000000be1aa63e88 RDI: 0000000000000000
RBP: ffff9aa2fa494800 R08: ffffffffffc2f723 R09: 0000000000021980
R10: 000000009f80bc26 R11: ffff9aa2fb8210a8 R12: 0000000000000003
R13: ffffffff8d6bbd18 R14: 0000000000000003 R15: 0000000000000000
 do_idle+0x228/0x270
 cpu_startup_entry+0x6f/0x80
 start_kernel+0x507/0x52a
 secondary_startup_64+0xa4/0xb0
---[ end trace 36debad57a9a426e ]---
------------[ cut here ]------------

Code:

WARNING: CPU: 0 PID: 213 at drivers/mtd/nand/raw/r852.c:746 r852_irq.cold.25+0xc/0x13 [r852]
Modules linked in: rfcomm fuse ctr ccm arc4 bnep ath5k btusb btrtl btbcm btintel ath bluetooth snd_hda_codec_hdmi s$
 i2c_algo_bit drm_kms_helper sdhci_pci cqhci sdhci drm uhci_hcd ehci_pci scsi_mod mmc_core psmouse firewire_ohci eh$
CPU: 0 PID: 213 Comm: systemd-journal Tainted: G        W         4.19.0-12-amd64 #1 Debian 4.19.152-1
Hardware name: LENOVO                         2746F2G/2746F2G   , BIOS 6AET58WW 05/29/2009
RIP: 0010:r852_irq.cold.25+0xc/0x13 [r852]
Code: f0 ff ff 48 c7 c7 e0 41 ac c0 89 74 24 04 e8 c7 f3 c1 cb 0f 0b 8b 74 24 04 e9 16 f2 ff ff 48 c7 c7 e0 41 ac c$
RSP: 0018:ffff9aa2fb803ef0 EFLAGS: 00010046
RAX: 0000000000000024 RBX: ffff9aa2f8856300 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff9aa2fb8166b8 RDI: ffff9aa2fb8166b8
RBP: ffff9aa2f88563dc R08: 0000000000000472 R09: 0000000000000004
R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000086
R13: ffff9aa2fb803f5c R14: 0000000000000000 R15: 0000000000000000
FS:  00007fc382af2940(0000) GS:ffff9aa2fb800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fc380691000 CR3: 0000000076552000 CR4: 00000000000406f0
Call Trace:
 <IRQ>
 __handle_irq_event_percpu+0x46/0x190
 handle_irq_event_percpu+0x30/0x80
 handle_irq_event+0x3c/0x5c
 handle_fasteoi_irq+0xa3/0x160
 handle_irq+0x1f/0x30
 do_IRQ+0x49/0xe0
 common_interrupt+0xf/0xf
 </IRQ>
RIP: 0010:___bpf_prog_run+0x25b/0xf20
Code: 43 01 48 0f bf 53 02 48 83 c3 08 48 89 c1 c0 e8 04 83 e1 0f 0f b6 c0 48 8b 4c cd 00 48 8b 44 c5 00 88 04 11 e$
RSP: 0018:ffffbb198054bce8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffd8
RAX: ffffffff8c7850fa RBX: ffff9aa2f30eed40 RCX: 0000000000000000
RDX: 0000000040000000 RSI: 0000000000000095 RDI: 000000007fff0000
RBP: ffffbb198054bd28 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: ffffffff8d23b700 R14: 0000000000000000 R15: 0000000000000000
 ? ___bpf_prog_run+0x25a/0xf20
 ? __bpf_prog_run32+0x39/0x60
 ? seccomp_run_filters+0x5c/0xb0
 ? generic_update_time+0xb6/0xd0
 ? file_update_time+0xed/0x130
 ? __seccomp_filter+0x44/0x4a0
 ? __handle_mm_fault+0xdcf/0x11f0
 ? syscall_trace_enter+0x192/0x2b0
 ? do_syscall_64+0xf0/0x110
 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
---[ end trace 36debad57a9a426f ]---
------------[ cut here ]------------

From what I found online it seems there might be a card reader at fault which I don't need anyway, so I would now just disable it.
But since I don't actually know anything I'll wait for what you think before I make a mess of things again.

wpeckham · 10-29-2020, 07:07 AM

That does not enlighten me, so I will wait with you.

ondoho · 10-31-2020, 04:32 AM

If that is what is causing your troubles.

Consider age of kernel version vs. age of hardware - the hardware must be significantly older than the kernel!
Debian Stable is very stable, but also very conservative (some might say outdated).

In other words, a backported kernel might recognize new hardware.

Also the log file tells you which kernel modules & which hardware is involved, you can pinpoint this a little better, start with

Code:

lspci -vv
dpkg -L <insert:current_kernel_version>

And see if you find things that relate to the first lines of that recurring log entry.

Ellster · 11-09-2020, 08:08 AM

Thanks!

The very last lines I got from that were:

Code:

Subsystem: Lenovo xD-Picture Card Controller
        Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Interrupt: pin B routed to IRQ 17
        Region 0: [virtual] Memory at febfe800 (32-bit, non-prefetchable) [size=256]
        Capabilities: [80] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=2 PME-
        Kernel driver in use: r852
        Kernel modules: r852

Since the log contained a Warning about a driver at r852 I'm thinking this is related which would point at the card reader again?

The dpkg command gave me a package not installed in return, I used the current kernel version as determinded with uname -r. Not sure if I did something wrong here?

ondoho · 11-10-2020, 12:52 AM

No no no no no, that's not what I meant at all.
Please re-read my instructions. It's not well structured, each sentence stands for itself, some of it refers to output you already provided etc.
In any case, posting only the last line of which command we do not know is pointless.

Ellster · 11-16-2020, 07:54 AM

Apologies. Life has been busy so I posted a bit hastily last time.

The above readout was what I got from the lspci command. I also checked the rest of that readout but couldn't identify any other paragraph that mentioned something from the logs. That one just immediately jumped out at me as the last on the page (so first visible) and because of the r852 driver also mentioned in the logs.

It is possible that this is an age issue, since it's a fairly new computer (I only worked with at least 10 year old hardware before so wasn't aware up to now that that might be an issue) so I'll be looking into backports now.

Anyway I have now disabled the card reader through BIOS and it seems to have stopped the overflow.

Thank you all for your help!