[SOLVED] Slackware64: cp from JFS volume to NFS share = jfsCommit: page allocation failure.
SlackwareThis Forum is for the discussion of Slackware Linux.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Slackware64: cp from JFS volume to NFS share = jfsCommit: page allocation failure.
I recently upgraded to Slackware64 and I decided to connect it to my Solaris 10 NFS server. I'm using the default NFSv3 options and everything appears to be mounted properly, etc. I wanted to test the performance over my new gigabit switch, so I started with some simple tests.
Using KDE4's file manager, I got 25MiB/s over gigabit ether, copying a 5.4GB folder. This seemed decent; nothing superb, but acceptable. However, the machine did seem a bit sluggish while it was copying. The load was low, and the cpu was not pegged, but the RAM was about full, and the desktop apps were choppy yet responsive. I also noticed that deleting the file took a while and the local disk was very active, which was unexpected.
Next, I copied a 535MB folder using cp. System load was low, cpu was basically idle, and RAM usage seemed about the same. Desktop performance was stellar. I timed the copy, and it took
15m50s. WHAT!? That's ~0.56 MiB/s. Something is definitely borked here...
I checked dmesg and the error is shown below. It repeats itself many times. Any ideas what this means?
The transfer finished, and the files seemed valid. I didn't do a md5 check, but du reported the proper size.
Source files were stored on my /home JFS partition. My machine is an AMD X2, NF4 mobo, 2GB RAM, and WD Raptor hard drive. I'm not sure if these errors came up during the KDE copy, or the cp copy. I just assumed that the cp operation produced these faults since it ran so much slower. Removing the folder with rm -rf took 5m15 sec and no local disk activity was noticed.
I'll do some more testing tomorrow to isolate/reproduce the problem.
Any suggestions/ideas/advice?
OK, it looks like the problem is using cp from JFS to NFS, and from NFS to JFS.
Maybe my system is running out of memory? I'm stumped. Has anyone else tried this?
I'm using JFS and NFS. NFS <--> JFS, no kernel dumps. Using stock Slackware Huge kernel, 2.6.30 and 2.6.31.x Never done Slackware kernel <--> Slackware kernel. Currently I'm only using 2.6.31.1, and one machine still at 2.6.31. As far as NFS/JFS goes, I noticed no difference between kernel versions.
If your machine is hitting a high load, I've seen the kernel tank before. For some reason, with recent kernels, there's been a huge load placed on the NFS server during writes (Writing to the NFS server) with pretty low throughput (30-40MB/s) compared to CIFS(Samba) (70-110MB/s). When I write to the NFS server I'm seeing loads of load average: 7.88, 6.99, 5.12. Reading from the NFS server is much better (70-110MB/s) with minimal load. I'd use only CIFS if I did more NFS writes instead of NFS reads.
RAM usage seems to be fine
Mem: 1541508k total, 1528264k used, 13264k free,
1370336k cached
NFSD will spawn as many threads as it thinks it needs. I've read some articles recently on this, and the opinions were all the same - don't mess with it
I Googled your oops, the results came up with some older bug reports that have all been reported as being fixed. The JFS devs are still active and very helpful. I'd fire a message off to them on their mailing list.
----edit----
I'm sure you've already done this, but -
Run a smart test on the drives.
Double check your mount options (noatime helps )
Last edited by disturbed1; 09-27-2009 at 01:03 AM.
Ok, I ran another test from a different disk and I got a different kernel dump. This was from swapper, not JFS. However, both times it was a page allocation error on the root disk. I guess I should note that I have an encrypted swap partition.
Also, the drive appears to pass the smart tests, but I don't understand the entire output, so I will post it below the kernel dump.
UPDATE:
I ran the following commands after a smartctl "long" and "offline" test.
# smartctl -l selftest /dev/sda
Code:
smartctl version 5.38 [x86_64-slackware-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 25099
-
# smartctl -a /dev/sda
Code:
smartctl version 5.38 [x86_64-slackware-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Raptor family
Device Model: WDC WD1500ADFD-00NLR1
Serial Number: WD-WMAP41240594
Firmware Version: 20.07P20
User Capacity: 150,039,945,216 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 published, ANSI INCITS 397-2005
Local Time is: Sun Sep 27 13:28:57 2009 MDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x85) Offline data collection activity
was aborted by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (4783) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 72) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x103f) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 200 199 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0007 182 165 021 Pre-fail Always - 3933
4 Start_Stop_Count 0x0032 100 100 040 Old_age Always - 69
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x000a 200 200 051 Old_age Always - 0
9 Power_On_Hours 0x0032 066 066 000 Old_age Always - 25100
10 Spin_Retry_Count 0x0012 100 253 051 Old_age Always - 0
11 Calibration_Retry_Count 0x0012 100 253 051 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 61
194 Temperature_Celsius 0x0022 108 099 000 Old_age Always - 39
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 200 199 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0012 200 200 000 Old_age Always - 0
199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 199 051 Old_age Offline - 0
SMART Error Log Version: 1
ATA Error Count: 504 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 504 occurred at disk power-on lifetime: 17652 hours (735 days + 12 hours)
When the command that caused the error occurred, the device was doing SMART Offline or Self-test.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 7a ed 2d 40
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 a8 00 7a ed 2d 40 00 7d+01:32:55.750 READ FPDMA QUEUED
60 18 00 9a 45 3d 40 00 7d+01:32:55.750 READ FPDMA QUEUED
60 08 00 5a 2f 3d 40 00 7d+01:32:55.750 READ FPDMA QUEUED
60 08 00 12 2c 3d 40 00 7d+01:32:55.750 READ FPDMA QUEUED
60 08 00 12 0f 55 40 00 7d+01:32:55.750 READ FPDMA QUEUED
Error 503 occurred at disk power-on lifetime: 17652 hours (735 days + 12 hours)
When the command that caused the error occurred, the device was doing SMART Offline or Self-test.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 7a ed 2d 40
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 a8 00 7a ed 2d 40 00 7d+01:24:08.550 READ FPDMA QUEUED
60 18 00 9a 45 3d 40 00 7d+01:24:08.550 READ FPDMA QUEUED
60 08 00 5a 2f 3d 40 00 7d+01:24:08.550 READ FPDMA QUEUED
60 08 00 12 2c 3d 40 00 7d+01:24:08.550 READ FPDMA QUEUED
60 08 00 12 0f 55 40 00 7d+01:24:08.550 READ FPDMA QUEUED
Error 502 occurred at disk power-on lifetime: 17644 hours (735 days + 4 hours)
When the command that caused the error occurred, the device was doing SMART Offline or Self-test.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 7a ed 2d 40
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 a8 00 7a ed 2d 40 00 6d+17:12:25.950 READ FPDMA QUEUED
60 18 00 9a 45 3d 40 00 6d+17:12:25.950 READ FPDMA QUEUED
60 08 00 5a 2f 3d 40 00 6d+17:12:25.950 READ FPDMA QUEUED
60 08 00 12 2c 3d 40 00 6d+17:12:25.950 READ FPDMA QUEUED
60 08 00 12 0f 55 40 00 6d+17:12:25.950 READ FPDMA QUEUED
Error 501 occurred at disk power-on lifetime: 17639 hours (734 days + 23 hours)
When the command that caused the error occurred, the device was doing SMART Offline or Self-test.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 7a ed 2d 40
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 a8 00 7a ed 2d 40 00 6d+11:55:20.000 READ FPDMA QUEUED
60 18 00 9a 45 3d 40 00 6d+11:55:20.000 READ FPDMA QUEUED
60 08 00 5a 2f 3d 40 00 6d+11:55:20.000 READ FPDMA QUEUED
60 08 00 12 2c 3d 40 00 6d+11:55:20.000 READ FPDMA QUEUED
60 08 00 12 0f 55 40 00 6d+11:55:20.000 READ FPDMA QUEUED
Error 500 occurred at disk power-on lifetime: 17639 hours (734 days + 23 hours)
When the command that caused the error occurred, the device was doing SMART Offline or Self-test.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 18 52 ee 2d 40
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 08 18 52 ee 2d 40 00 6d+11:51:04.000 READ FPDMA QUEUED
60 50 10 ea ed 2d 40 00 6d+11:51:04.000 READ FPDMA QUEUED
60 08 08 b2 ed 2d 40 00 6d+11:51:04.000 READ FPDMA QUEUED
60 10 00 82 ed 2d 40 00 6d+11:51:04.000 READ FPDMA QUEUED
60 08 00 e7 b8 88 47 00 6d+11:51:04.000 READ FPDMA QUEUED
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 25099 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
I hope I'm not losing another drive, but at least that would explain these errors. I lost a different storage drive two weeks ago, but that was obvious, as I was getting I/O errors and hard resets in the kernel log.
You know it would really help if you would put that data in /code tags. So far it doesn't look like the drive is failing, if you wanted to make sure I would run:
Code:
smartctl -t long /dev/sda
Then wait for the test to run and check the results.
I would not use 'cp' to copy files over like that, try using tar instead. Also, if you still suspect a hardware problem check all the cables.
Something else that can cause this is by having extreme aggressive tweaks to the net filter. Too high of a receive/send window and memory allocation, too high of an MTU. This will cause the networking stack to work overtime, and flood the swapper while it's waiting for commits. Once it gets full with no where to put the data - poof.
What about NFS mount options? Could rsize/wsize be the culprit?
nfs rw,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nointr,proto=tcp,timeo=600,retrans=2,sec=sys
(these are defaults)
Those are not defaults. NFSv3 has a max r/wsize of 1048576. It wouldn't hurt to try a lower value of 32768, or omit it altogether for troubleshooting.
Quote:
Originally Posted by man nfs
If an rsize value is not specified, or if the specified rsize value is larger than the maximum that either client or
server can support, the client and server negotiate the largest rsize value that they can both support.
About Andrew Morton's suggestion -
You type this in to get the value sysctl vm.min_free_kbytes
And this to change it sysctl vm.min_free_kbytes=65536
Or cat /proc/sys/vm/min_free_kbytes
echo 65536 > /proc/sys/vm/min_free_kbytes
These options do not stick after a reboot. To make them reload, you can save the value in /etc/sysctl.conf which is then called from the rc scripts.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.