LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Slackware (http://www.linuxquestions.org/questions/slackware-14/)
-   -   Slackware64: cp from JFS volume to NFS share = jfsCommit: page allocation failure. (http://www.linuxquestions.org/questions/slackware-14/slackware64-cp-from-jfs-volume-to-nfs-share-%3D-jfscommit-page-allocation-failure-757168/)

granth 09-23-2009 12:48 AM

Slackware64: cp from JFS volume to NFS share = jfsCommit: page allocation failure.
 
I recently upgraded to Slackware64 and I decided to connect it to my Solaris 10 NFS server. I'm using the default NFSv3 options and everything appears to be mounted properly, etc. I wanted to test the performance over my new gigabit switch, so I started with some simple tests.

Using KDE4's file manager, I got 25MiB/s over gigabit ether, copying a 5.4GB folder. This seemed decent; nothing superb, but acceptable. However, the machine did seem a bit sluggish while it was copying. The load was low, and the cpu was not pegged, but the RAM was about full, and the desktop apps were choppy yet responsive. I also noticed that deleting the file took a while and the local disk was very active, which was unexpected.

Next, I copied a 535MB folder using cp. System load was low, cpu was basically idle, and RAM usage seemed about the same. Desktop performance was stellar. I timed the copy, and it took
15m50s. WHAT!? That's ~0.56 MiB/s. Something is definitely borked here...

I checked dmesg and the error is shown below. It repeats itself many times. Any ideas what this means?
The transfer finished, and the files seemed valid. I didn't do a md5 check, but du reported the proper size.


Code:

jfsCommit: page allocation failure. order:0, mode:0x4020
Pid: 3106, comm: jfsCommit Tainted: P          2.6.29.6 #1
Call Trace:
 <IRQ>  [<ffffffff802847dd>] __alloc_pages_internal+0x3dd/0x4e0
 [<ffffffff802a9dfb>] __slab_alloc+0x6cb/0x720
 [<ffffffff804c71f7>] __netdev_alloc_skb+0x17/0x40
 [<ffffffff804c71f7>] __netdev_alloc_skb+0x17/0x40
 [<ffffffff802ab0eb>] __kmalloc_track_caller+0xbb/0x110
 [<ffffffff804c6472>] __alloc_skb+0x72/0x150
 [<ffffffff804c71f7>] __netdev_alloc_skb+0x17/0x40
 [<ffffffffa00d160f>] skge_poll+0x47f/0x7d0 [skge]
 [<ffffffff804cb282>] net_rx_action+0xe2/0x1a0
 [<ffffffffa066b7a4>] rm_isr+0x95/0x10a [nvidia]
 [<ffffffff80242103>] __do_softirq+0x93/0x140
 [<ffffffff8020ca3c>] call_softirq+0x1c/0x30
 [<ffffffff8020e495>] do_softirq+0x35/0x80
 [<ffffffff8020e6f3>] do_IRQ+0x83/0x110
 [<ffffffff8020c313>] ret_from_intr+0x0/0xa
 <EOI>  [<ffffffff80286160>] __set_page_dirty_nobuffers+0x0/0x1e0
 [<ffffffff80560034>] _spin_unlock_irqrestore+0x4/0x10
 [<ffffffffa0d2980a>] dbUpdatePMap+0x28a/0x460 [jfs]
 [<ffffffffa0d3585a>] txAllocPMap+0xda/0x120 [jfs]
 [<ffffffffa0d35e36>] txUpdateMap+0x196/0x270 [jfs]
 [<ffffffffa0d38390>] jfs_lazycommit+0x130/0x270 [jfs]
 [<ffffffff80236490>] default_wake_function+0x0/0x10
 [<ffffffffa0d38260>] jfs_lazycommit+0x0/0x270 [jfs]
 [<ffffffff80252527>] kthread+0x47/0x90
 [<ffffffff8020c93a>] child_rip+0xa/0x20
 [<ffffffff802524e0>] kthread+0x0/0x90
 [<ffffffff8020c930>] child_rip+0x0/0x20
Mem-Info:
DMA per-cpu:
CPU    0: hi:    0, btch:  1 usd:  0
CPU    1: hi:    0, btch:  1 usd:  0
DMA32 per-cpu:
CPU    0: hi:  186, btch:  31 usd: 182
CPU    1: hi:  186, btch:  31 usd:  60
Active_anon:154516 active_file:6003 inactive_anon:71347
 inactive_file:245227 unevictable:1 dirty:23351 writeback:4223 unstable:0
 free:2498 slab:13787 mapped:10070 pagetables:6107 bounce:0
DMA free:8008kB min:20kB low:24kB high:28kB active_anon:0kB inactive_anon:4kB active_file:1460kB inactive_file:528kB unevictable:0kB present:8008kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 2003 2003 2003
DMA32 free:1984kB min:5712kB low:7140kB high:8568kB active_anon:618064kB inactive_anon:285384kB active_file:22552kB inactive_file:980380kB unevictable:4kB present:2051180kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
DMA: 4*4kB 3*8kB 8*16kB 7*32kB 5*64kB 3*128kB 3*256kB 2*512kB 1*1024kB 0*2048kB 1*4096kB = 8008kB
DMA32: 429*4kB 6*8kB 2*16kB 0*32kB 1*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1988kB
258440 total pagecache pages
7022 pages in swap cache
Swap cache stats: add 130674, delete 123652, find 95674/98733
Free swap  = 2342640kB
Total swap = 2723008kB
523984 pages RAM
9543 pages reserved
45650 pages shared
489348 pages non-shared


Source files were stored on my /home JFS partition. My machine is an AMD X2, NF4 mobo, 2GB RAM, and WD Raptor hard drive. I'm not sure if these errors came up during the KDE copy, or the cp copy. I just assumed that the cp operation produced these faults since it ran so much slower. Removing the folder with rm -rf took 5m15 sec and no local disk activity was noticed.

I'll do some more testing tomorrow to isolate/reproduce the problem.
Any suggestions/ideas/advice?

granth 09-26-2009 11:52 PM

OK, it looks like the problem is using cp from JFS to NFS, and from NFS to JFS.

Maybe my system is running out of memory? I'm stumped. Has anyone else tried this?

disturbed1 09-27-2009 01:00 AM

Quote:

Originally Posted by granth (Post 3698219)
OK, it looks like the problem is using cp from JFS to NFS, and from NFS to JFS.

Maybe my system is running out of memory? I'm stumped. Has anyone else tried this?

I'm using JFS and NFS. NFS <--> JFS, no kernel dumps. Using stock Slackware Huge kernel, 2.6.30 and 2.6.31.x Never done Slackware kernel <--> Slackware kernel. Currently I'm only using 2.6.31.1, and one machine still at 2.6.31. As far as NFS/JFS goes, I noticed no difference between kernel versions.

If your machine is hitting a high load, I've seen the kernel tank before. For some reason, with recent kernels, there's been a huge load placed on the NFS server during writes (Writing to the NFS server) with pretty low throughput (30-40MB/s) compared to CIFS(Samba) (70-110MB/s). When I write to the NFS server I'm seeing loads of load average: 7.88, 6.99, 5.12. Reading from the NFS server is much better (70-110MB/s) with minimal load. I'd use only CIFS if I did more NFS writes instead of NFS reads.

RAM usage seems to be fine
Mem: 1541508k total, 1528264k used, 13264k free,
1370336k cached

NFSD will spawn as many threads as it thinks it needs. I've read some articles recently on this, and the opinions were all the same - don't mess with it :)

I Googled your oops, the results came up with some older bug reports that have all been reported as being fixed. The JFS devs are still active and very helpful. I'd fire a message off to them on their mailing list.


----edit----
I'm sure you've already done this, but -
Run a smart test on the drives.
Double check your mount options (noatime helps ;))

granth 09-27-2009 02:30 PM

Ok, I ran another test from a different disk and I got a different kernel dump. This was from swapper, not JFS. However, both times it was a page allocation error on the root disk. I guess I should note that I have an encrypted swap partition.

Also, the drive appears to pass the smart tests, but I don't understand the entire output, so I will post it below the kernel dump.



Code:

swapper: page allocation failure. order:0, mode:0x4020                                               
Pid: 0, comm: swapper Tainted: P          2.6.29.6 #1                                               
Call Trace:                                                                                         
 <IRQ>  [<ffffffff802847dd>] __alloc_pages_internal+0x3dd/0x4e0                                     
 [<ffffffff802a9dfb>] __slab_alloc+0x6cb/0x720                                                       
 [<ffffffff804c71f7>] __netdev_alloc_skb+0x17/0x40                                                   
 [<ffffffff804c71f7>] __netdev_alloc_skb+0x17/0x40                                                   
 [<ffffffff802ab0eb>] __kmalloc_track_caller+0xbb/0x110                                             
 [<ffffffff804c6472>] __alloc_skb+0x72/0x150                                                         
 [<ffffffff804c71f7>] __netdev_alloc_skb+0x17/0x40                                                   
 [<ffffffffa00ce60f>] skge_poll+0x47f/0x7d0 [skge]                                                   
 [<ffffffff804cb282>] net_rx_action+0xe2/0x1a0                                                       
 [<ffffffff80242103>] __do_softirq+0x93/0x140                                                       
 [<ffffffff8020ca3c>] call_softirq+0x1c/0x30                                                         
 [<ffffffff8020e495>] do_softirq+0x35/0x80                                                           
 [<ffffffff8020e6f3>] do_IRQ+0x83/0x110                                                             
 [<ffffffff8020c313>] ret_from_intr+0x0/0xa                                                         
 <EOI>  [<ffffffff80213472>] default_idle+0x42/0x50                                                 
 [<ffffffff8020a9a0>] cpu_idle+0x60/0xa0                                                             
Mem-Info:                                                                                           
DMA per-cpu:                                                                                         
CPU    0: hi:    0, btch:  1 usd:  0                                                               
CPU    1: hi:    0, btch:  1 usd:  0                                                               
DMA32 per-cpu:                                                                                       
CPU    0: hi:  186, btch:  31 usd: 152                                                               
CPU    1: hi:  186, btch:  31 usd: 162                                                               
Active_anon:4080 active_file:4093 inactive_anon:1392                                                 
 inactive_file:479203 unevictable:0 dirty:24346 writeback:145 unstable:0                             
 free:2529 slab:9294 mapped:342 pagetables:987 bounce:0                                             
DMA free:8008kB min:20kB low:24kB high:28kB active_anon:0kB inactive_anon:0kB active_file:8kB inactive_file:2060kB unevictable:0kB present:8008kB pages_scanned:0 all_unreclaimable? no                   
lowmem_reserve[]: 0 2003 2003 2003                                                                   
DMA32 free:2108kB min:5712kB low:7140kB high:8568kB active_anon:16320kB inactive_anon:5568kB active_file:16364kB inactive_file:1914752kB unevictable:0kB present:2051180kB pages_scanned:32 all_unreclaimable? no                                                                                               
lowmem_reserve[]: 0 0 0 0                                                                           
DMA: 0*4kB 3*8kB 5*16kB 7*32kB 4*64kB 4*128kB 5*256kB 3*512kB 0*1024kB 0*2048kB 1*4096kB = 8008kB   
DMA32: 82*4kB 9*8kB 2*16kB 0*32kB 1*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 0*2048kB 0*4096kB = 2032kB 
483968 total pagecache pages                                                                         
636 pages in swap cache                                                                             
Swap cache stats: add 51410, delete 50774, find 29665/31891                                         
Free swap  = 2702032kB                                                                               
Total swap = 2723008kB                                                                               
523984 pages RAM                                                                                     
9543 pages reserved                                                                                 
6040 pages shared                                                                                   
507209 pages non-shared                                                                             
swapper: page allocation failure. order:0, mode:0x4020                                               
Pid: 0, comm: swapper Tainted: P          2.6.29.6 #1                                               
Call Trace:                                                                                         
 <IRQ>  [<ffffffff802847dd>] __alloc_pages_internal+0x3dd/0x4e0                                     
 [<ffffffff802a9dfb>] __slab_alloc+0x6cb/0x720                                                       
 [<ffffffff804c71f7>] __netdev_alloc_skb+0x17/0x40                                                   
 [<ffffffff804c71f7>] __netdev_alloc_skb+0x17/0x40                                                   
 [<ffffffff802ab0eb>] __kmalloc_track_caller+0xbb/0x110                                             
 [<ffffffff804c6472>] __alloc_skb+0x72/0x150                                                         
 [<ffffffff804c71f7>] __netdev_alloc_skb+0x17/0x40                                                   
 [<ffffffffa00ce60f>] skge_poll+0x47f/0x7d0 [skge]                                                   
 [<ffffffff804cb282>] net_rx_action+0xe2/0x1a0                                                       
 [<ffffffff80242103>] __do_softirq+0x93/0x140                                                       
 [<ffffffff8020ca3c>] call_softirq+0x1c/0x30                                                         
 [<ffffffff8020e495>] do_softirq+0x35/0x80                                                           
 [<ffffffff8020e6f3>] do_IRQ+0x83/0x110                                                             
 [<ffffffff8020c313>] ret_from_intr+0x0/0xa                                                         
 <EOI>  [<ffffffff80213472>] default_idle+0x42/0x50
 [<ffffffff8020a9a0>] cpu_idle+0x60/0xa0
Mem-Info:
DMA per-cpu:
CPU    0: hi:    0, btch:  1 usd:  0
CPU    1: hi:    0, btch:  1 usd:  0
DMA32 per-cpu:
CPU    0: hi:  186, btch:  31 usd: 152
CPU    1: hi:  186, btch:  31 usd: 162
Active_anon:4080 active_file:4093 inactive_anon:1392
 inactive_file:479203 unevictable:0 dirty:24346 writeback:145 unstable:0
 free:2529 slab:9294 mapped:342 pagetables:987 bounce:0
DMA free:8008kB min:20kB low:24kB high:28kB active_anon:0kB inactive_anon:0kB active_file:8kB inactive_file:2060kB unevictable:0kB present:8008kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 2003 2003 2003
DMA32 free:2108kB min:5712kB low:7140kB high:8568kB active_anon:16320kB inactive_anon:5568kB active_file:16364kB inactive_file:1914752kB unevictable:0kB present:2051180kB pages_scanned:32 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
DMA: 0*4kB 3*8kB 5*16kB 7*32kB 4*64kB 4*128kB 5*256kB 3*512kB 0*1024kB 0*2048kB 1*4096kB = 8008kB
DMA32: 82*4kB 9*8kB 2*16kB 0*32kB 1*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 0*2048kB 0*4096kB = 2032kB
483968 total pagecache pages
636 pages in swap cache
Swap cache stats: add 51410, delete 50774, find 29665/31891
Free swap  = 2702032kB
Total swap = 2723008kB
523984 pages RAM
9543 pages reserved
6040 pages shared
507209 pages non-shared


UPDATE:
I ran the following commands after a smartctl "long" and "offline" test.

# smartctl -l selftest /dev/sda
Code:

smartctl version 5.38 [x86_64-slackware-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/                               

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error      00%    25099

-

# smartctl -a /dev/sda
Code:

smartctl version 5.38 [x86_64-slackware-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/                               

=== START OF INFORMATION SECTION ===
Model Family:    Western Digital Raptor family
Device Model:    WDC WD1500ADFD-00NLR1       
Serial Number:    WD-WMAP41240594             
Firmware Version: 20.07P20                   
User Capacity:    150,039,945,216 bytes       
Device is:        In smartctl database [for details use: -P show]
ATA Version is:  7                                             
ATA Standard is:  ATA/ATAPI-7 published, ANSI INCITS 397-2005   
Local Time is:    Sun Sep 27 13:28:57 2009 MDT                 
SMART support is: Available - device has SMART capability.     
SMART support is: Enabled                                       

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x85) Offline data collection activity
                                        was aborted by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.         
Self-test execution status:      (  0) The previous self-test routine completed       
                                        without error or no self-test has ever         
                                        been run.                                       
Total time to complete Offline                                                         
data collection:                (4783) seconds.                                       
Offline data collection                                                                 
capabilities:                    (0x7b) SMART execute Offline immediate.               
                                        Auto Offline data collection on/off support.   
                                        Suspend Offline collection upon new             
                                        command.                                       
                                        Offline surface scan supported.                 
                                        Self-test supported.                           
                                        Conveyance Self-test supported.                 
                                        Selective Self-test supported.                 
SMART capabilities:            (0x0003) Saves SMART data before entering               
                                        power-saving mode.                             
                                        Supports SMART auto save timer.                 
Error logging capability:        (0x01) Error logging supported.                       
                                        General Purpose Logging supported.             
Short self-test routine                                                                 
recommended polling time:        (  2) minutes.                                       
Extended self-test routine                                                             
recommended polling time:        (  72) minutes.                                       
Conveyance self-test routine                                                           
recommended polling time:        (  5) minutes.                                       
SCT capabilities:              (0x103f) SCT Status supported.                           
                                        SCT Feature Control supported.                 
                                        SCT Data Table supported.                       

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds: 
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x000b  200  199  051    Pre-fail  Always      -      0       
  3 Spin_Up_Time            0x0007  182  165  021    Pre-fail  Always      -      3933   
  4 Start_Stop_Count        0x0032  100  100  040    Old_age  Always      -      69     
  5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -      0       
  7 Seek_Error_Rate        0x000a  200  200  051    Old_age  Always      -      0       
  9 Power_On_Hours          0x0032  066  066  000    Old_age  Always      -      25100   
 10 Spin_Retry_Count        0x0012  100  253  051    Old_age  Always      -      0       
 11 Calibration_Retry_Count 0x0012  100  253  051    Old_age  Always      -      0       
 12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      61     
194 Temperature_Celsius    0x0022  108  099  000    Old_age  Always      -      39     
196 Reallocated_Event_Count 0x0032  200  200  000    Old_age  Always      -      0       
197 Current_Pending_Sector  0x0012  200  199  000    Old_age  Always      -      0       
198 Offline_Uncorrectable  0x0012  200  200  000    Old_age  Always      -      0       
199 UDMA_CRC_Error_Count    0x000a  200  253  000    Old_age  Always      -      0       
200 Multi_Zone_Error_Rate  0x0008  200  199  051    Old_age  Offline      -      0       

SMART Error Log Version: 1
ATA Error Count: 504 (device log contains only the most recent five errors)
        CR = Command Register [HEX]                                       
        FR = Features Register [HEX]                                     
        SC = Sector Count Register [HEX]                                 
        SN = Sector Number Register [HEX]                                 
        CL = Cylinder Low Register [HEX]                                 
        CH = Cylinder High Register [HEX]                                 
        DH = Device/Head Register [HEX]                                   
        DC = Device Command Register [HEX]                               
        ER = Error register [HEX]                                         
        ST = Status register [HEX]                                       
Powered_Up_Time is measured from power on, and printed as                 
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,                     
SS=sec, and sss=millisec. It "wraps" after 49.710 days.                   

Error 504 occurred at disk power-on lifetime: 17652 hours (735 days + 12 hours)
  When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH                             
  -- -- -- -- -- -- --                             
  40 51 00 7a ed 2d 40                             

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 a8 00 7a ed 2d 40 00  7d+01:32:55.750  READ FPDMA QUEUED 
  60 18 00 9a 45 3d 40 00  7d+01:32:55.750  READ FPDMA QUEUED 
  60 08 00 5a 2f 3d 40 00  7d+01:32:55.750  READ FPDMA QUEUED 
  60 08 00 12 2c 3d 40 00  7d+01:32:55.750  READ FPDMA QUEUED 
  60 08 00 12 0f 55 40 00  7d+01:32:55.750  READ FPDMA QUEUED 

Error 503 occurred at disk power-on lifetime: 17652 hours (735 days + 12 hours)
  When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH                             
  -- -- -- -- -- -- --                             
  40 51 00 7a ed 2d 40                             

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 a8 00 7a ed 2d 40 00  7d+01:24:08.550  READ FPDMA QUEUED 
  60 18 00 9a 45 3d 40 00  7d+01:24:08.550  READ FPDMA QUEUED 
  60 08 00 5a 2f 3d 40 00  7d+01:24:08.550  READ FPDMA QUEUED 
  60 08 00 12 2c 3d 40 00  7d+01:24:08.550  READ FPDMA QUEUED 
  60 08 00 12 0f 55 40 00  7d+01:24:08.550  READ FPDMA QUEUED 

Error 502 occurred at disk power-on lifetime: 17644 hours (735 days + 4 hours)
  When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH                             
  -- -- -- -- -- -- --                             
  40 51 00 7a ed 2d 40                             

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 a8 00 7a ed 2d 40 00  6d+17:12:25.950  READ FPDMA QUEUED 
  60 18 00 9a 45 3d 40 00  6d+17:12:25.950  READ FPDMA QUEUED 
  60 08 00 5a 2f 3d 40 00  6d+17:12:25.950  READ FPDMA QUEUED 
  60 08 00 12 2c 3d 40 00  6d+17:12:25.950  READ FPDMA QUEUED 
  60 08 00 12 0f 55 40 00  6d+17:12:25.950  READ FPDMA QUEUED 

Error 501 occurred at disk power-on lifetime: 17639 hours (734 days + 23 hours)
  When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH                             
  -- -- -- -- -- -- --                             
  40 51 00 7a ed 2d 40                             

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 a8 00 7a ed 2d 40 00  6d+11:55:20.000  READ FPDMA QUEUED 
  60 18 00 9a 45 3d 40 00  6d+11:55:20.000  READ FPDMA QUEUED 
  60 08 00 5a 2f 3d 40 00  6d+11:55:20.000  READ FPDMA QUEUED 
  60 08 00 12 2c 3d 40 00  6d+11:55:20.000  READ FPDMA QUEUED 
  60 08 00 12 0f 55 40 00  6d+11:55:20.000  READ FPDMA QUEUED

Error 500 occurred at disk power-on lifetime: 17639 hours (734 days + 23 hours)
  When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 18 52 ee 2d 40

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 18 52 ee 2d 40 00  6d+11:51:04.000  READ FPDMA QUEUED
  60 50 10 ea ed 2d 40 00  6d+11:51:04.000  READ FPDMA QUEUED
  60 08 08 b2 ed 2d 40 00  6d+11:51:04.000  READ FPDMA QUEUED
  60 10 00 82 ed 2d 40 00  6d+11:51:04.000  READ FPDMA QUEUED
  60 08 00 e7 b8 88 47 00  6d+11:51:04.000  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error      00%    25099        -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.



I hope I'm not losing another drive, but at least that would explain these errors. I lost a different storage drive two weeks ago, but that was obvious, as I was getting I/O errors and hard resets in the kernel log.

H_TeXMeX_H 09-27-2009 02:57 PM

You know it would really help if you would put that data in /code tags. So far it doesn't look like the drive is failing, if you wanted to make sure I would run:

Code:

smartctl -t long /dev/sda
Then wait for the test to run and check the results.

I would not use 'cp' to copy files over like that, try using tar instead. Also, if you still suspect a hardware problem check all the cables.

disturbed1 09-27-2009 03:55 PM

Something else that can cause this is by having extreme aggressive tweaks to the net filter. Too high of a receive/send window and memory allocation, too high of an MTU. This will cause the networking stack to work overtime, and flood the swapper while it's waiting for commits. Once it gets full with no where to put the data - poof.

If you look at -
PHP Code:

[<ffffffff804c71f7>] __netdev_alloc_skb+0x17/0x40
[<ffffffff804c71f7>] __netdev_alloc_skb+0x17/0x40
[<ffffffff804cb282>] net_rx_action+0xe2/0x1a0 


granth 09-27-2009 05:12 PM

Quote:

Something else that can cause this is by having extreme aggressive tweaks to the net filter.
No tweaks here. This is basically a stock Slackware64 system, running generic-2.6.29.6 kernel.

I am using an onboard skge NIC w/ SMC unmanaged gigabit switch.

disturbed1 09-27-2009 05:24 PM

Everything I've read points to network configuration problems.

Perhaps read through this thread (link) and give A.Morton's solution a shot.

granth 09-29-2009 05:16 PM

Before I try the solution mentioned above...

What about NFS mount options? Could rsize/wsize be the culprit?

nfs rw,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nointr,proto=tcp,timeo=600,retrans=2,sec=sys
(these are defaults)

disturbed1 09-30-2009 04:36 AM

Quote:

Originally Posted by granth (Post 3700994)
Before I try the solution mentioned above...

What about NFS mount options? Could rsize/wsize be the culprit?

nfs rw,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nointr,proto=tcp,timeo=600,retrans=2,sec=sys
(these are defaults)

Those are not defaults. NFSv3 has a max r/wsize of 1048576. It wouldn't hurt to try a lower value of 32768, or omit it altogether for troubleshooting.

Quote:

Originally Posted by man nfs
If an rsize value is not specified, or if the specified rsize value is larger than the maximum that either client or
server can support, the client and server negotiate the largest rsize value that they can both support.

About Andrew Morton's suggestion -
You type this in to get the value
sysctl vm.min_free_kbytes

And this to change it
sysctl vm.min_free_kbytes=65536

Or cat /proc/sys/vm/min_free_kbytes
echo 65536 > /proc/sys/vm/min_free_kbytes

These options do not stick after a reboot. To make them reload, you can save the value in /etc/sysctl.conf which is then called from the rc scripts.

granth 10-01-2009 05:49 PM

I left the NFS settings alone, and set vm.min_free_kbytes=16384

I haven't seen the warning/error since the change. If it comes up in the future, I will increase the vm.min_free_kbytes value again.


All times are GMT -5. The time now is 10:55 AM.