Directory has become inaccessible

lucmove · 09-22-2020, 05:09 PM

I was downloading mail from all my mailboxes when the email application crashed just after downloading mail from one of the mailboxes.

Actually, the email application froze so I had to kill it.

I ran it again and checked the mailboxes again, and the email application froze again. And again. And again.

Upon investigation, I found that the directory where the messages of that mailbox are stored is inaccessible:

Code:

# /home/luc/Mail> ls -ls gmail1
[very long list of files]

# /home/luc/Mail> ls -ls gmail2
Killed

"Killed" is the actual output. I don't know what it means.

Note the #, i.e. it is inaccessible even as root.

SpaceFM (file manager) can open gmail1, but not gmail2. It says it's opening, but it takes forever and I give up. But right-clicking those directories and selecting Properties seems to work:

gmail1: 540MB, 11664 files, 14 folders
gmail2: 463MB, 6249 files, 7 folders

I also tried PCManFM and had the same results.

What is happening to that directory and how can I fix it?

TIA

dugan · 09-22-2020, 05:15 PM

Anything interesting in "dmesg" after "ls" gets killed?

lucmove · 09-22-2020, 05:34 PM

Quote:

Originally Posted by dugan

Anything interesting in "dmesg" after "ls" gets killed?

Code:

[35335.564211] BUG: unable to handle kernel paging request at 000000000003f8ca
[35335.568435] IP: [<ffffffffa661c433>] __d_lookup_rcu+0x63/0x180
[35335.572625] PGD 0 
[35335.576780] Oops: 0000 [#37] SMP
[35335.581025] Modules linked in: [all my modules]
[35335.621144] task: ffff9b06d8a63100 task.stack: ffffb58142fec000
[35335.625414] RIP: 0010:[<ffffffffa661c433>]  [<ffffffffa661c433>] __d_lookup_rcu+0x63/0x180
[35335.625415] RSP: 0018:ffffb58142fefc60  EFLAGS: 00010202
[35335.625416] RAX: 0000000000000004 RBX: 000000000003f8ce RCX: ffffb5814001c000
[35335.625417] RDX: ffffb58142fefcc4 RSI: ffffb58142fefd90 RDI: ffff9b0512006180
[35335.625418] RBP: 0000000000000000 R08: ffffb58142fefcc4 R09: 0000000000000004
[35335.625419] R10: 000000006a3fda9f R11: 0000000000000000 R12: ffff9b0512006180
[35335.625420] R13: 000000046a3fda9f R14: ffff9b052e040024 R15: ffff9b07156e13e0
[35335.625421] FS:  00007fa192aa7f40(0000) GS:ffff9b071fb00000(0000) knlGS:0000000000000000
[35335.625422] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[35335.625423] CR2: 000000000003f8ca CR3: 0000000177ec8000 CR4: 00000000001406e0
[35335.625424] Stack:
[35335.625427]  ffffb58142fefd90 ffffb58142fefd80 0000000000000000 ffffb58142fefd80
[35335.625429]  0000000000000000 ffffb58142fefd18 ffff9b0512006180 ffffb58142fefd10
[35335.625434]  ffff9b07156e13e0 ffffffffa660d2e2 ffffb58142fefd0c ffffb58142fefd80
[35335.625434] Call Trace:
[35335.625449]  [<ffffffffa660d2e2>] ? lookup_fast+0x52/0x2e0
[35335.625451]  [<ffffffffa660e394>] ? walk_component+0x44/0x320
[35335.625454]  [<ffffffffa660f647>] ? path_lookupat+0x67/0x120
[35335.625456]  [<ffffffffa6612001>] ? filename_lookup+0xb1/0x180
[35335.625458]  [<ffffffffa65fdeca>] ? __check_object_size+0xfa/0x1d8
[35335.625462]  [<ffffffffa6756908>] ? strncpy_from_user+0x48/0x160
[35335.625464]  [<ffffffffa6611c3a>] ? getname_flags+0x6a/0x1e0
[35335.625466]  [<ffffffffa6606d49>] ? vfs_fstatat+0x59/0xb0
[35335.625467]  [<ffffffffa66072fd>] ? SYSC_newlstat+0x2d/0x60
[35335.625469]  [<ffffffffa660bdf2>] ? path_put+0x12/0x20
[35335.625472]  [<ffffffffa6629005>] ? path_getxattr+0x75/0xb0
[35335.625476]  [<ffffffffa6a0637b>] ? system_call_fast_compare_end+0xc/0x9b
[35335.625497] Code: 48 83 e3 fe 0f 84 92 00 00 00 4c 89 e8 45 89 ea 49 89 d0 48 c1 e8 20 48 89 34 24 49 89 fc 49 89 c1 eb 08 48 8b 1b 48 85 db 74 71 <8b> 6b fc 4c 3b 63 10 75 ef 48 83 7b 08 00 74 e8 83 e5 fe 41 f6 
[35335.625500] RIP  [<ffffffffa661c433>] __d_lookup_rcu+0x63/0x180
[35335.625500]  RSP <ffffb58142fefc60>
[35335.625501] CR2: 000000000003f8ca
[35335.625547] ---[ end trace 7c7d972ec4895c48 ]---

dugan · 09-22-2020, 05:48 PM

Uh, wow. I'd certainly say that counts as interesting...

I'd definitely run a memtest after seeing this, and not try to do any more debugging until I've actually ruled out bad RAM.

lucmove · 09-22-2020, 05:51 PM

What kind of test would you run and, assuming memory is bad, why does it affect that directory only?

dugan · 09-22-2020, 05:54 PM

https://www.memtest86.com/

This was the standard test for faulty RAM last time I checked, which admittedly was a while ago.

lucmove · 09-22-2020, 05:57 PM

assuming memory is bad, why does it affect that directory only?

dugan · 09-22-2020, 05:59 PM

Quote:

Originally Posted by lucmove

assuming memory is bad, why does it affect that directory only?

That question is unanswerable.

If it turns out that your memory is good, then "why does that it affect that directory only" becomes extremely interesting.

dugan · 09-22-2020, 06:23 PM

You also need to scan the hard drive for bad sectors/bad blocks. That's definitely another potential cause, and it's more directly relevant to "why is this directory the only one affected?"

I'll just take the liberty of posting one link:

https://www.tecmint.com/check-linux-...rs-bad-blocks/

This is much more likely to be a hardware issue than to be a kernel bug (which, really, is the alternative explanation).

lucmove · 09-22-2020, 06:50 PM

Thanks. I am very familiar with badblocks. I don't like it because I once bought a new hard disk and decided to check it with badblocks which accused about thirty-odd bad blocks. I had the disk returned/replaced and the new one had many bad blocks too! The vendor refused to replace it again and I sucked it up, but ended up using that hard disk for more than ten years without a single problem. I still have it and it works. I just hardly ever use it anymore because I bought much larger disks and outgrew it.

Now, I just rebooted and the directory is working normally. The mail application isn't freezing anymore either, all the messages are there. Nothing turns up in dmesg either. The problem seems to be gone.

I should just point out that the machine froze when I issued the reboot command and I had to hard reset it. I ran memtest and no error was detect. Then I finally logged in again and everything seems normal.

I have no idea what happened.

MadeInGermany · 09-23-2020, 06:33 PM

With a bad disk block you should get an I/O error message. But you got a paging error, that has to do with virtual memory, for example bad RAM.
Do you have a swapfile? Was it manipulated while in use by the kernel?

Code:

free -m

scasey · 09-23-2020, 06:38 PM

Quote:

Originally Posted by lucmove

Now, I just rebooted and the directory is working normally. The mail application isn't freezing anymore either, all the messages are there. Nothing turns up in dmesg either. The problem seems to be gone.

I should just point out that the machine froze when I issued the reboot command and I had to hard reset it. I ran memtest and no error was detect. Then I finally logged in again and everything seems normal.

I have no idea what happened.

My guess would be that the memory got corrupted when you aborted the copy. Rebooting flushed the memory.

lucmove · 09-24-2020, 01:01 AM

Quote:

Originally Posted by MadeInGermany

With a bad disk block you should get an I/O error message. But you got a paging error, that has to do with virtual memory, for example bad RAM.
Do you have a swapfile? Was it manipulated while in use by the kernel?

Code:

free -m

I haven't had a swap file or partition for almost 10 years. The system has been running smoothly all this time. If this problem was caused by lack of swap, it was the very first one.

MadeInGermany · 09-24-2020, 02:30 AM

Then I suspect a bug in the kernel, most likely the driver for your disk.
Look for updates!

Last but not least, there are possible hardware faults like a distortion on a power line, leading to random corruptions...

Once I met a series of hard disks with faulty embedded SRAM cache. Very nasty, all types of hangings, malfunctions, corruptions occurred. Finally we detected a bit flip in a corrupted data file. Contacted the vendor: the bit flip error was already suspected and examined. We got new disks

dugan · 09-24-2020, 04:31 PM

There are a lot of hardware issues that could have caused this. Overheating is another possibility.