LinuxQuestions.org
Support LQ: Use code LQCO20 and save 20% on CrossOver Office
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware
User Name
Password
Linux - Hardware This forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?

Notices

Reply
 
LinkBack Search this Thread
Old 12-08-2011, 12:37 AM   #1
spwnt
Member
 
Registered: Jun 2006
Distribution: Linux Mint/Debian/Arch
Posts: 73

Rep: Reputation: 12
NAS Server Box keeps crashing


Alright so I have 6 3 TB drives connected in a software raid 5. The box is used as a nas/seed box/whatever other small thing I feel like using it as at the time. I have the OS (Ubuntu 10.04 Server 64 bit) running on a 16 GB usb flash drive and the RAID 5 nas part is mounted at /media/stuff. It's an AMD fusion CPU, dual core 1.6. I have the raid 5 dm-crypt encrypted and I have / dm-crypt encrypted (boot obviously isn't).


Anyways it has been crashing very frequently but only when I'm writing to the RAID 5. Like for example I was extracting a bunch of very large archive files and about 30 seconds in it crashed. The archives were on the raid 5 and I was extracting them to another place on the raid 5.

At first I though the cpu couldn't handle it since it was only 1.6 dual core and it had to calculate all of the parity and stuff, but then I ran mprime for like 20 minutes and it didn't crash or overheat, but as soon as I start doing very heavy writes to the raid 5 it crashes again.

I've even gone so far as completely reinstalling the OS from scratch and it is still happening. The other funny thing is this is just a very recent problem, it never used to happen.

Obviously you guys are going to need some log outputs and stuff, but I'm not sure what to show exactly, so just tell me what you need output from and I'll post it.

also dunno if it matters but this server is mostly headless so I've been doing most of this through cifs mounts and ssh.

Code:
@ubuntu-server:~$ sudo mdadm --detail /dev/md0 
/dev/md0:
        Version : 01.02
  Creation Time : Fri Oct 21 16:27:29 2011
     Raid Level : raid5
     Array Size : 14651325440 (13972.59 GiB 15002.96 GB)
  Used Dev Size : 5860530176 (5589.04 GiB 6001.18 GB)
   Raid Devices : 6
  Total Devices : 6
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Wed Dec  7 22:34:49 2011
          State : clean
 Active Devices : 6
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

           Name : debian-server:0
           UUID : e967892d:e5006f45:8c97fdb4:9e3eab2d
         Events : 182

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       1       8       49        1      active sync   /dev/sdd1
       2       8       65        2      active sync   /dev/sde1
       3       8       81        3      active sync   /dev/sdf1
       4       8        1        4      active sync   /dev/sda1
       5       8       17        5      active sync   /dev/sdb1
 
Old 12-08-2011, 10:22 AM   #2
spwnt
Member
 
Registered: Jun 2006
Distribution: Linux Mint/Debian/Arch
Posts: 73

Original Poster
Rep: Reputation: 12
so now I've tried to decompress the files on my nas over the network, but have the output come to my desktop instead of back on to the server to decrease the IO, but it's still crashing. Could it be that I'm running out of /tmp space or ram? I have 8 gigs of ram so I don't see how that could be the problem, and I also have like 9 gigs left on my usb stick for /tmp. If it matters I don't have a swap partition. I'm gunna try making a swap with a spare usb stick I have.

I wonder if the IO from the OS running from a usb stick can't keep up with the IO from the RAID 5? but I would imagine that the raid 5 would just throttle itself if that was the case? And if I run out of tmp space would it just default back to ram? because between /tmp and ram I have a total of like 18 gigs so... yea..
 
Old 12-08-2011, 10:25 AM   #3
_bsd
Member
 
Registered: Jan 2010
Location: Velveeta, USA
Distribution: Xen, Gentoo,Ubuntu,openSUSE,Debian,pfSense
Posts: 98

Rep: Reputation: 8
what do mean by "crashing"? locking up? kernel panic?

Try booting from either a livecd of your distro, or sysrescuecd and examine the integrity
of your raid, and of your filesystems.
 
Old 12-08-2011, 11:17 AM   #4
spwnt
Member
 
Registered: Jun 2006
Distribution: Linux Mint/Debian/Arch
Posts: 73

Original Poster
Rep: Reputation: 12
I don't know what type of crash it is. I can post any logs if you can tell me what to post, but what happens is it'll be running fine the it will just completely shut off immediately when I'm doing heavy writing to the raid 5. and it'll only crash writing to my raid 5. mdadm says there is no problem with my raid.

i'm doing a fsck of the raid 5 as we speak. it's ext 4 running over an encrypted lvm. i'll post what the outcome of that is. after that i'm gunna fsck my / partition.
 
Old 12-08-2011, 11:38 AM   #5
JonathanWilson
Member
 
Registered: Aug 2009
Location: Ilkeston, England
Distribution: ubuntu, xp, embeded
Posts: 74

Rep: Reputation: 1
So when you run a heavy workload on the raid the machine powers off? If so maybe the power supply isn't up to the task although I'd assume the fsck would cause the same problem. It might be a sata cable is failing, although i'd expect it report a failed drive in the array as apposed to shutting down. I suppose it could be the raid controller is failing and bringing down the system or the sata failing is bringing the controller with it.

Possible place to look is check the /var/log/messages file although if the machine is powering off it might not have time to log the error.
 
Old 12-08-2011, 12:04 PM   #6
spwnt
Member
 
Registered: Jun 2006
Distribution: Linux Mint/Debian/Arch
Posts: 73

Original Poster
Rep: Reputation: 12
fsck completed perfectly without any errors on my raid 5.


/var/log/messages doesn't report anything relevant it just says that eth0 is working or something to that effect.

I don't think its the power supply, because i ran mprime without any problems at all. what I'm going to try next is to create an 8 gigabyte image file inside of the raid 5 for swap and then another 25 gigabyte image file for /tmp, and if it stops crashing then I'll know the problem and find a better solution. i'll post back.
 
Old 12-08-2011, 12:26 PM   #7
spwnt
Member
 
Registered: Jun 2006
Distribution: Linux Mint/Debian/Arch
Posts: 73

Original Poster
Rep: Reputation: 12
so here's the latest news... on my desktop I ran

dd if=/dev/zero of=~/swap.swp bs=1024 count=8000k and it completed but when I tried to copy it over the network to the nas it crashed about 34 megabytes in, but then I did the same command on the server directly (creating it directly on the raid 5) and it completed at an average of 34mbs with no problems whatsoever.... so it doesn't seem to be a power supply or an I/O error, because it's only crashing when I start transfering files over the network at this point.
 
Old 12-08-2011, 12:47 PM   #8
_bsd
Member
 
Registered: Jan 2010
Location: Velveeta, USA
Distribution: Xen, Gentoo,Ubuntu,openSUSE,Debian,pfSense
Posts: 98

Rep: Reputation: 8
Intermittant "crashes", under load, indicates a possible PS issue. Mprime doesn't do much except draw current on the cpu.
The RAID devices are on the +12V rail, and running all disks adds load, don't dismiss this as a possible cause.

An actual lockup/black screen/freeze type crash should not occur under usual OS error conditions.

Since you're using software raid, you could also run memtest and verify no memory errors.

On the rare occasion when I've seen this kind of failure I start at the beginning, and follow the current.

Power Supply - test and or replace
CPU - re-seat, reattach cooler with fresh liquid silver
RAM - test, re-seat

Check all cabling, make sure everything's tight.

clear out old logs in /var/log - dmesg, kern.log, syslog, messages
Try again
If it crashes again, post (as attachments) those logs - dmesg, kern.log, syslog, messages
 
Old 12-09-2011, 09:00 PM   #9
spwnt
Member
 
Registered: Jun 2006
Distribution: Linux Mint/Debian/Arch
Posts: 73

Original Poster
Rep: Reputation: 12
well I deleted them and forced a crash on purpose. I tried to put spaces where I think the crash took place so it's easy to find.

i did memtest for like 45 minutes with 0 errors
Attached Files
File Type: txt dmesg.txt (55.4 KB, 9 views)
File Type: txt kern.log.txt (196.0 KB, 4 views)
File Type: txt messages.txt (164.7 KB, 5 views)
 
Old 12-09-2011, 09:00 PM   #10
spwnt
Member
 
Registered: Jun 2006
Distribution: Linux Mint/Debian/Arch
Posts: 73

Original Poster
Rep: Reputation: 12
can only upload 3 messages per post
Attached Files
File Type: txt syslog.txt (218.3 KB, 3 views)
 
Old 12-10-2011, 07:03 AM   #11
_bsd
Member
 
Registered: Jan 2010
Location: Velveeta, USA
Distribution: Xen, Gentoo,Ubuntu,openSUSE,Debian,pfSense
Posts: 98

Rep: Reputation: 8
You have call traces in the jbd2 module, which I don't believe are normal, could be a sign of a failing disk.

I would download the diagnostics CD from whichever manufacturer made the hard disks
Boot from diagnostics CD and test each drive, quick test first, then long test.

Since your problem only occurs when writing to the disks, that's a pretty good indication there's something amiss in the disks.
 
Old 02-07-2012, 05:47 PM   #12
spwnt
Member
 
Registered: Jun 2006
Distribution: Linux Mint/Debian/Arch
Posts: 73

Original Poster
Rep: Reputation: 12
so i'm still having the problem. i went out and bought a new power supply, and it still crashes randomly. and it's also crashing even if the raid isn't mounted, only when there is high i/o on the hard disk (or hard disks depending on if the raid is mounted, but it crashes either way.) so at this point I'm thinking it's either the distro i'm using, a certain package i'm using or the motherboard/cpu failing. i'm gunna try using another distro to see if that fixes it, but if it doesn't i'll be back again.
 
Old 02-07-2012, 09:00 PM   #13
spwnt
Member
 
Registered: Jun 2006
Distribution: Linux Mint/Debian/Arch
Posts: 73

Original Poster
Rep: Reputation: 12
it's still crashing... could it be that a 500 watt power supply isn't enough to power this box? it's running 7 hard disk drives, and it seems to fail when all of the drives are running at their max potential. so maybe it isn't an i/o problem... it could be that my power supply can't handle it?

edit: or could the problem be that i have 3 of my hard drives on 1 cable (rail?) coming out of the psu and then 2 plus a molex sata splitter powering the other 4 (on another cable group(rail?))?

i've been running mprime blend for about 20 minutes straight now with no problems, so i don't think its the cpu.

Last edited by spwnt; 02-07-2012 at 09:32 PM.
 
Old 02-08-2012, 09:03 AM   #14
_bsd
Member
 
Registered: Jan 2010
Location: Velveeta, USA
Distribution: Xen, Gentoo,Ubuntu,openSUSE,Debian,pfSense
Posts: 98

Rep: Reputation: 8
Have you run the diagnostics on the disks? SMART quick, long and the mfgr diags.
smartctl has the smart tests, hdparm might be of some use as well.
 
Old 02-10-2012, 09:56 PM   #15
spwnt
Member
 
Registered: Jun 2006
Distribution: Linux Mint/Debian/Arch
Posts: 73

Original Poster
Rep: Reputation: 12
Code:
smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     Hitachi HDS5C3030ALA630
Serial Number:    MJ1311YNG3VG3A
Firmware Version: MEAOA5C0
User Capacity:    3,000,592,982,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Wed Feb  8 23:10:15 2012 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x85)	Offline data collection activity
					was aborted by an interrupting command from host.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		 (36368) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   136   136   054    Pre-fail  Offline      -       103
  3 Spin_Up_Time            0x0007   199   199   024    Pre-fail  Always       -       277 (Average 425)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       108
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   132   132   020    Pre-fail  Offline      -       32
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       2628
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       108
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       162
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       162
194 Temperature_Celsius     0x0002   136   136   000    Old_age   Always       -       44 (Lifetime Min/Max 17/45)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 1
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 21 hours (0 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  10 51 00 00 00 00 00

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  37 d0 01 af a3 50 e0 00      08:31:30.645  SET MAX ADDRESS EXT
  27 d0 00 00 00 00 e0 00      08:31:30.645  READ NATIVE MAX ADDRESS EXT
  27 d0 00 00 00 00 e0 00      08:31:30.626  READ NATIVE MAX ADDRESS EXT
  b0 d0 00 00 4f c2 a0 00      08:31:24.209  SMART READ DATA
  b0 d4 00 7f 4f c2 a0 00      08:31:23.709  SMART EXECUTE OFF-LINE IMMEDIATE

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      2628         -
# 2  Short offline       Completed without error       00%      2618         -
# 3  Short offline       Completed without error       00%      2618         -
# 4  Short offline       Completed without error       00%      2597         -
# 5  Short offline       Completed without error       00%      1282         -
# 6  Short offline       Completed without error       00%        21         -
# 7  Short offline       Aborted by host               50%        21         -
# 8  Short offline       Completed without error       00%        12         -
# 9  Short offline       Completed without error       00%        12         -
#10  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
is the only one that gave an error, but it still says it passed.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Can RHEL be installed on NAS/SAN box? sushantchawla2005 Linux - Server 6 11-25-2010 01:00 PM
Out-of-the-box NAS that supports the SSH protocol? overge Linux - Networking 1 08-15-2006 11:41 AM
building a nas box irish rebel Linux - Hardware 1 07-07-2006 08:26 AM
Linux Home on Windows NAS Box Pravat Linux - Networking 1 04-02-2006 04:05 PM
Want to build my own NAS box HippieCat Linux - General 2 03-29-2005 10:18 PM


All times are GMT -5. The time now is 05:15 PM.

Main Menu
 
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: @linuxquestions
Open Source Consulting | Domain Registration