Kernel and Ext3 Filesystem Errors at the same time -- difficult to diagnose.
Linux - ServerThis forum is for the discussion of Linux Software used in a server related context.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Kernel and Ext3 Filesystem Errors at the same time -- difficult to diagnose.
Thank you for reading.
Server is Slackware 12.2 using kernel 2.6.30.4 and working as a PDC Samba Server
4 gigs. of DDR2 800 RAM
Main HD with OS and user data is WDC 150 gig. Western Digital SATA II Velociraptor
Secondary HD holding archived backup is 640 gig. Western Digital SATA II.
We had storms in my area over the weekend and when I came in Monday morn. the server was down. The server is plugged into a nice APC UPS/surge protector so I wasn't too worried.
I started up the server and everything ran fine all day.
Monday evening after 5 pm when the employees left, I wanted to test whether I could recover data from my compressed backup archives.
This is when all the trouble began.
Everytime I tried to decompress my gzipped and/or bzipped archives, the process would fail. Example:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
root@a1:/archive/2009Oct19# ls -lh
total 11G
-rw-r--r-- 1 root root 11G 2009-10-19 21:46 mw.tar.bz2
root@a1:/archive/2009Oct19# tar -xjf mw.tar.bz2
tar: Skipping to next header
bzip2: Data integrity error when decompressing.
Input file = (stdin), output file = (stdout)
It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.
You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.
tar: Child returned status 2
tar: Exiting with failure status due to previous errors
root@a1:/archive/2009Oct19#
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
I tried to decompress a few other archives from other day's backups and all of them failed to complete, as well. Ugh, all my daily backups are useless.
So I decided to use scp to copy our live user data to another linux box in order to have at least one safe copy of the company's data.
Then I tried to write zeros to the secondary drive and was surprised to see this operation fail too:
root@a1:/# dd if=/dev/zero of=/dev/sdb bs=4096
Message from syslogd@a1 a1 kernel: Oops: 0002 [#2] SMP
Message from syslogd@a1
a1 kernel: last sysfs file: /sys/devices/virtual/block/ram9/range
Message from syslogd@a1
a1 kernel: Process cp (pid: 4783, ti=e0634000 task=f699f660 task.ti=e0634000)
Message from syslogd@a1
a1 kernel: EIP: [<c10f431b>] bit_spin_lock+0xf/0x1a SS:ESP 0068:e0635ab0
Message from syslogd@a1
a1 kernel: CR2: 0000000000000000
AND THE SERVER WAS FROZEN REQUIRING POWER OFF/ON. A few days later after successfully writing zeros to the first 10 megs of /dev/sdb , I figured I'd try reinstall the ext3 filesystem.
More failure:
oot@a1:/# mkfs -t ext3 /dev/sdb1
mke2fs 1.41.3 (12-Oct-2008)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
39075840 inodes, 156282320 blocks
7814116 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=0
4770 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000
Writing inode tables: 189/4770
Message from syslogd@a1 at Wed Oct 21 22:53:49 2009 ... a1 kernel: Oops: 0000 [#1] SMP
Message from syslogd@a1 at Wed Oct 21 22:53:49 2009 ...
a1 kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:11.0/host1/target1:0:0/1:0:0:0/block/sdb/size
Message from syslogd@a1 at Wed Oct 21 22:53:49 2009 ...
a1 kernel: Process kswapd0 (pid: 357, ti=f71fc000 task=f71b60a0 task.ti=f71fc000)
Message from syslogd@a1 at Wed Oct 21 22:53:49 2009 ...
a1 kernel: Stack:
Message from syslogd@a1 at Wed Oct 21 22:53:49 2009 ...
a1 kernel: 00000024 f1c07f4c c1081282 f71fdeac f1c07e9c 00000000 c1081683 0000000e
Message from syslogd@a1 at Wed Oct 21 22:53:49 2009 ...
a1 kernel: 00000018 ffffffff 00000000 00000000 00000000 c2a210e0 c2a210c0 f1c05bf8
Message from syslogd@a1 at Wed Oct 21 22:53:49 2009 ...
a1 kernel: Call Trace:
Message from syslogd@a1 at Wed Oct 21 22:53:49 2009 ...
a1 kernel: [<c107bfc0>] ? find_get_pages+0x50/0x7b
Message from syslogd@a1 at Wed Oct 21 22:53:49 2009 ...
a1 kernel: [<c1081282>] ? pagevec_lookup+0x17/0x1d
Message from syslogd@a1 at Wed Oct 21 22:53:49 2009 ...
a1 kernel: [<c1081683>] ? __invalidate_mapping_pages+0xdb/0xf0
Message from syslogd@a1 at Wed Oct 21 22:53:49 2009 ...
a1 kernel: [<c101a329>] ? common_interrupt+0x29/0x30
Message from syslogd@a1 at Wed Oct 21 22:53:49 2009 ...
a1 kernel: [<c140ade3>] ? _spin_lock+0x5/0x7
Message from syslogd@a1 at Wed Oct 21 22:53:49 2009 ...
a1 kernel: [<c108169f>] ? invalidate_mapping_pages+0x7/0x9
Message from syslogd@a1 at Wed Oct 21 22:53:49 2009 ...
a1 kernel: [<c10ac3ab>] ? shrink_icache_memory+0xc6/0x1a2
Message from syslogd@a1 at Wed Oct 21 22:53:49 2009 ...
a1 kernel: [<c1083308>] ? shrink_slab+0xdc/0x139
Message from syslogd@a1 at Wed Oct 21 22:53:49 2009 ...
a1 kernel: [<c108395f>] ? kswapd+0x2fe/0x3f5
Message from syslogd@a1 at Wed Oct 21 22:53:49 2009 ...
a1 kernel: [<c1081db2>] ? isolate_pages_global+0x0/0x186
Message from syslogd@a1 at Wed Oct 21 22:53:49 2009 ...
a1 kernel: [<c104fb30>] ? autoremove_wake_function+0x0/0x2b
Message from syslogd@a1 at Wed Oct 21 22:53:49 2009 ...
a1 kernel: [<c10379ed>] ? complete+0x28/0x36
Message from syslogd@a1 at Wed Oct 21 22:53:49 2009 ...
a1 kernel: [<c1083661>] ? kswapd+0x0/0x3f5
Message from syslogd@a1 at Wed Oct 21 22:53:49 2009 ...
a1 kernel: [<c104fa6d>] ? kthread+0x42/0x69
Message from syslogd@a1 at Wed Oct 21 22:53:49 2009 ...
a1 kernel: [<c104fa2b>] ? kthread+0x0/0x69
Message from syslogd@a1 at Wed Oct 21 22:53:49 2009 ...
a1 kernel: [<c101a6e7>] ? kernel_thread_helper+0x7/0x10
Message from syslogd@a1 at Wed Oct 21 22:53:49 2009 ...
a1 kernel: Code: 5f c3 8b 10 83 c8 ff f6 c2 01 0f 44 c2 c3 53 89 c1 8b 59 48 f6 c3 01 74 04 f3 90 eb f4 8b 41 40 8b 51 44 39 59 48 75 e9 5b c3 53 <8b> 50 04 8d 48 04 85 d2 74 13 8d 5a 01 89 d0 f0 0f b1 19 39 d0
Message from syslogd@a1 at Wed Oct 21 22:53:49 2009 ...
a1 kernel: EIP: [<c1079f8e>] page_cache_get_speculative+0x1/0x29 SS:ESP 0068:f71fde58
Message from syslogd@a1 at Wed Oct 21 22:53:49 2009 ...
a1 kernel: CR2: 0000000002000004
190/4770
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
At this point, I'm getting numb. Can't figure out what this is all about.
/var/log/syslog timestamps show the kernel definitely not happy:
Oct 21 22:53:49 a1 kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:11.0/host1/target1:0:0/1:0:0:0/block/sdb/size
Oct 21 22:53:49 a1 kernel: Modules linked in:
Oct 21 22:53:49 a1 kernel:
Oct 21 22:53:49 a1 kernel: Pid: 357, comm: kswapd0 Not tainted (2.6.30.4 #2) GA-MA78GM-S2H
Oct 21 22:53:49 a1 kernel: EIP: 0060:[<c1079f8e>] EFLAGS: 00010213 CPU: 0
Oct 21 22:53:49 a1 kernel: EIP is at page_cache_get_speculative+0x1/0x29
Oct 21 22:53:49 a1 kernel: EAX: 02000000 EBX: 02000000 ECX: c2a210c4 EDX: 02000000
Oct 21 22:53:49 a1 kernel: ESI: 00000002 EDI: 00000002 EBP: f71fdeac ESP: f71fde58
Oct 21 22:53:49 a1 kernel: DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Oct 21 22:53:49 a1 kernel: Process kswapd0 (pid: 357, ti=f71fc000 task=f71b60a0 task.ti=f71fc000)
Oct 21 22:53:49 a1 kernel: Stack:
Oct 21 22:53:49 a1 kernel: 02000000 c107bfc0 f1c07f50 0000000e 00000000 00000003 00000000 f71fdea4
Oct 21 22:53:49 a1 kernel: 00000024 f1c07f4c c1081282 f71fdeac f1c07e9c 00000000 c1081683 0000000e
Oct 21 22:53:49 a1 kernel: 00000018 ffffffff 00000000 00000000 00000000 c2a210e0 c2a210c0 f1c05bf8
Oct 21 22:53:49 a1 kernel: Call Trace:
Oct 21 22:53:49 a1 kernel: [<c107bfc0>] ? find_get_pages+0x50/0x7b
Oct 21 22:53:49 a1 kernel: [<c1081282>] ? pagevec_lookup+0x17/0x1d
Oct 21 22:53:49 a1 kernel: [<c1081683>] ? __invalidate_mapping_pages+0xdb/0xf0
Oct 21 22:53:49 a1 kernel: [<c101a329>] ? common_interrupt+0x29/0x30
Oct 21 22:53:49 a1 kernel: [<c140ade3>] ? _spin_lock+0x5/0x7
Oct 21 22:53:49 a1 kernel: [<c108169f>] ? invalidate_mapping_pages+0x7/0x9
Oct 21 22:53:49 a1 kernel: [<c10ac3ab>] ? shrink_icache_memory+0xc6/0x1a2
Oct 21 22:53:49 a1 kernel: [<c1083308>] ? shrink_slab+0xdc/0x139
Oct 21 22:53:49 a1 kernel: [<c108395f>] ? kswapd+0x2fe/0x3f5
Oct 21 22:53:49 a1 kernel: [<c1081db2>] ? isolate_pages_global+0x0/0x186
Oct 21 22:53:49 a1 kernel: [<c104fb30>] ? autoremove_wake_function+0x0/0x2b
Oct 21 22:53:49 a1 kernel: [<c10379ed>] ? complete+0x28/0x36
Oct 21 22:53:49 a1 kernel: [<c1083661>] ? kswapd+0x0/0x3f5
Oct 21 22:53:49 a1 kernel: [<c104fa6d>] ? kthread+0x42/0x69
Oct 21 22:53:49 a1 kernel: [<c104fa2b>] ? kthread+0x0/0x69
Oct 21 22:53:49 a1 kernel: [<c101a6e7>] ? kernel_thread_helper+0x7/0x10
Oct 21 22:53:49 a1 kernel: Code: 5f c3 8b 10 83 c8 ff f6 c2 01 0f 44 c2 c3 53 89 c1 8b 59 48 f6 c3 01 74 04 f3 90 eb f4 8b 41 40 8b 51$
Oct 21 22:53:49 a1 kernel: EIP: [<c1079f8e>] page_cache_get_speculative+0x1/0x29 SS:ESP 0068:f71fde58
Oct 21 22:53:49 a1 kernel: CR2: 0000000002000004
Oct 21 22:53:49 a1 kernel: ---[ end trace bfac5149c665b21a ]---
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
And to make a trifecta, I wondered if at this point I could even copy the live data-set on the main drive to a different directory on the same drive. No! --
root@a1:/home/foo# cp -r /mw /home/foo/2009Oct21/
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ... a1 kernel: Oops: 0002 [#2] SMP
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ...
a1 kernel: last sysfs file: /sys/devices/virtual/block/ram9/range
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ...
a1 kernel: Process cp (pid: 4783, ti=e0634000 task=f699f660 task.ti=e0634000)
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ...
a1 kernel: Stack:
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ...
a1 kernel: 00000000 e0635cc0 c29dd880 f25a854c 001200d2 e0635cc0 c1079fee c29dd880
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ...
a1 kernel: f25a854c 00000001 c1082a42 e0635b54 00000000 e0635c30 010e2bbe 00000000
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ...
a1 kernel: Call Trace:
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ...
a1 kernel: [<c10f4356>] ? journal_grab_journal_head+0xb/0x28
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ...
a1 kernel: [<c10f0917>] ? journal_try_to_free_buffers+0x49/0x16a
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ...
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ...
a1a1 kernel: [<c1082a42>] ? shrink_page_list+0x3b4/0x4fekernel: [<c1079fee>] ? try_to_release_page+0x38/0x4c
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ...
a1 kernel: [<c1081e4b>] ? isolate_pages_global+0x99/0x186
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ...
a1 kernel: [<c1082d80>] ? shrink_list+0x1f4/0x417
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ...
a1 kernel: [<c10e2aac>] ? ext3_mark_iloc_dirty+0x2a3/0x30c
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ...
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ...
a1a1 kernel: [<c10e3a2d>] ? ext3_get_blocks_handle+0x78d/0x7a9kernel: [<c10eddcf>] ? __ext3_journal_dirty_metadata+0x13/0x32
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ...
a1 kernel: [<c140ae5a>] ? _spin_lock_irq+0x12/0x15
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ...
a1 kernel: [<c10831b4>] ? shrink_zone+0x211/0x289
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ...
a1 kernel: [<c10834e1>] ? try_to_free_pages+0x17c/0x292
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ...
a1 kernel: [<c1081db2>] ? isolate_pages_global+0x0/0x186
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ...
a1 kernel: [<c107ea66>] ? __alloc_pages_internal+0x1f0/0x34b
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ...
a1 kernel: [<c107a054>] ? __page_cache_alloc+0xe/0x10
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ...
a1 kernel: [<c107c16a>] ? grab_cache_page_write_begin+0x4d/0x87
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ...
a1 kernel: [<c10e4965>] ? ext3_write_begin+0x67/0x19e
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ...
a1 kernel: [<c107b0f3>] ? generic_file_buffered_write+0xe1/0x272
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ...
a1 kernel: [<c107b68e>] ? __generic_file_aio_write_nolock+0x262/0x28c
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ...
a1 kernel: [<c10afca3>] ? mnt_want_write+0x43/0x4b
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ...
a1 kernel: [<c107bb4a>] ? generic_file_aio_read+0x492/0x4d9
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ...
a1 kernel: [<c107bdde>] ? generic_file_aio_write+0x5b/0xb2
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ...
a1 kernel: [<c10e14e1>] ? ext3_file_write+0x19/0x83
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ...
a1 kernel: [<c109e0c1>] ? do_sync_write+0xbe/0x105
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ...
a1 kernel: [<c10a0c02>] ? cp_new_stat64+0xc5/0xd7
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ...
a1 kernel: [<c104fb30>] ? autoremove_wake_function+0x0/0x2b
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ...
a1 kernel: [<c116ba89>] ? security_file_permission+0xc/0xd
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ...
a1 kernel: [<c109e003>] ? do_sync_write+0x0/0x105
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ...
a1 kernel: [<c109e76f>] ? vfs_write+0x73/0xa1
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ...
a1 kernel: [<c109e831>] ? sys_write+0x3c/0x60
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ...
a1 kernel: [<c1019d4d>] ? syscall_call+0x7/0xb
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ...
a1 kernel: EIP: [<c10f431b>] bit_spin_lock+0xf/0x1a SS:ESP 0068:e0635ab0
Message from syslogd@a1 at Wed Oct 21 23:02:21 2009 ...
a1 kernel: CR2: 0000000000000000
Killed
root@a1:/home/foo#
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Now it's Thurs. Oct. 22 and the server has been behaving strangely. Several times a day users will suddenly lose authentication with the samba server and not be able to access shares and data stores used by an accounting application.
What is the best course of action?
Do I simply need to do a fresh re-install?
Does this appear to be a hardware issue?
filesystem issue?
kernel issue?
I've been running a samba server on Slackware boxes since version 8 or 9 and I've never seen anything like this.
Any insights or guidance very much appreciated.
I don't see any evidence of a fsck or a memory check.
Did the UPS gracefully shutdown the server or did it fade out and (effectively) pull the rug out from under the server(s) ?. The oops on mkfs and problems whilst subsequently running would worry me - a lot.
I'd want to run tests for days to eliminate hardware - got a spare server or two ?.
I don't see any evidence of a fsck or a memory check.
Did the UPS gracefully shutdown the server or did it fade out and (effectively) pull the rug out from under the server(s) ?.
That's what I was wondering, but I may never know at this point. I don't have linux-based monitoring software that works with my UPS.
Quote:
Originally Posted by syg00
The oops on mkfs and problems whilst subsequently running would worry me - a lot. I'd want to run tests for days to eliminate hardware - got a spare server or two ?.
I should have posted about that but I thought I was already scaring too many away with the length of the post. Since I was unable to write zeros or build ext3 filesystem on /dev/sdb storage drive, I disconnected it and commented "out" the device mapping in /etc/fstab.
After disconnecting the storage drive, I rebooted the server using the Slack 12.2 Install Disc 1, and ran fsck /dev/sda2 on / . There were many fixes performed on superblocks, inodes, blocks, "multiply-shared" this and that --- too much to write down under time constraints. Yesterday (Friday), after all employees left, I did the same fsck 5 times in a row and it finishes in 1 or 2 seconds stating "Clean" and listed the number blocks and sectors on /dev/sda2 / (root filesystem).
As I write today (Saturday morn.), I'm ssh'ed into the server and thus far completing a successful tar -cvjf /home/foo/2009Oct24/samba_serv.tar.bz2 samba_dir/ .
It would appear everything is running normally again, but I have no idea why. Most would say, "It's obviously a problem with the storage drive." But I'm still left wondering why the fsck on the / (root partition on main drive) demonstrated a need to fix many errors there too. And why all the kernel errors too? I may never know.
But thanks Syg00, I really appreciate you taking the time to review the problem. If there are any other tests you think may be useful, please let me know.
Best regards.
I don't have linux-based monitoring software that works with my UPS.
I'd presume that meant a hard crash - especially if there was any work (I/O) going on.
Quote:
... ran fsck /dev/sda2 on / . There were many fixes performed on superblocks, inodes, blocks, "multiply-shared" this and that ...
That spells a broken filesystem or two. Unfortunately you'll never know how broken. Lots can be recovered by fsck, other things like cross linked chains are merely truncated, and the evidence tossed away (if you tell fsck to proceed). I don't even back the data up after such an event, because you can't have any confidence that all the data inode chains are valid.
Rebuild from backup would be my normal answer, but after a period of days that may be untenable.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.