[SOLVED] Errors reported by the SMART Disk Monitoring Daemon

FEL · 11-17-2013, 04:30 AM

Hello friends,

I was a good boy and configured smartd to monitor my drives and send an email upon errors/warnings.

Well, as you might have guessed the first email has arrived and I need your help to figure out what to do.

My setup is:

2 x Seagate SV35 Series (3TB) in a RAID-1 array hosted by a 3ware SAS 9750-8i hardware RAID controller.

The email(s) contain:

Code:

SMART error (CurrentPendingSector) detected on host: 

The following warning/error was logged by the smartd daemon:

Device: /dev/twl0 [3ware_disk_03], 56 Currently unreadable (pending) sectors

[...]

SMART error (OfflineUncorrectableSector) detected on host:

The following warning/error was logged by the smartd daemon:

Device: /dev/twl0 [3ware_disk_03], 56 Offline uncorrectable sectors

[...]

SMART error (ErrorCount) detected on host:

The following warning/error was logged by the smartd daemon:

Device: /dev/twl0 [3ware_disk_03], ATA error count increased from 0 to 18

[...]

SMART error (CurrentPendingSector) detected on host:

The following warning/error was logged by the smartd daemon:

Device: /dev/twl0 [3ware_disk_03], 59 Currently unreadable (pending) sectors

[...]

SMART error (OfflineUncorrectableSector) detected on host:

The following warning/error was logged by the smartd daemon:

Device: /dev/twl0 [3ware_disk_03], 59 Offline uncorrectable sectors

... etc ...

SYSLOG obviously has the corresponding info:

Code:

Nov 17 06:54:54 host smartd[3667]: Device: /dev/twl0 [3ware_disk_02], self-test in progress, 40% remaining
Nov 17 06:54:55 host smartd[3667]: Device: /dev/twl0 [3ware_disk_03], 59 Currently unreadable (pending) sectors
Nov 17 06:54:55 host smartd[3667]: Device: /dev/twl0 [3ware_disk_03], 59 Offline uncorrectable sectors
Nov 17 06:54:55 host smartd[3667]: Device: /dev/twl0 [3ware_disk_03], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 58 to 53
Nov 17 06:54:55 host smartd[3667]: Device: /dev/twl0 [3ware_disk_03], SMART Usage Attribute: 194 Temperature_Celsius changed from 42 to 47
Nov 17 06:54:55 host smartd[3667]: Device: /dev/twl0 [3ware_disk_03], self-test in progress, 90% remaining
Nov 17 07:24:54 host smartd[3667]: Device: /dev/twl0 [3ware_disk_02], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 54 to 53
Nov 17 07:24:54 host smartd[3667]: Device: /dev/twl0 [3ware_disk_02], SMART Usage Attribute: 194 Temperature_Celsius changed from 46 to 47
Nov 17 07:24:54 host smartd[3667]: Device: /dev/twl0 [3ware_disk_02], self-test in progress, 30% remaining
Nov 17 07:24:54 host smartd[3667]: Device: /dev/twl0 [3ware_disk_03], 59 Currently unreadable (pending) sectors
Nov 17 07:24:54 host smartd[3667]: Device: /dev/twl0 [3ware_disk_03], 59 Offline uncorrectable sectors
Nov 17 07:24:54 host smartd[3667]: Device: /dev/twl0 [3ware_disk_03], self-test in progress, 80% remaining
Nov 17 07:54:54 host smartd[3667]: Device: /dev/twl0 [3ware_disk_02], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 53 to 54
Nov 17 07:54:54 host smartd[3667]: Device: /dev/twl0 [3ware_disk_02], SMART Usage Attribute: 194 Temperature_Celsius changed from 47 to 46
Nov 17 07:54:54 host smartd[3667]: Device: /dev/twl0 [3ware_disk_02], self-test in progress, 20% remaining
Nov 17 07:54:55 host smartd[3667]: Device: /dev/twl0 [3ware_disk_03], 67 Currently unreadable (pending) sectors (changed +8)
Nov 17 07:54:55 host smartd[3667]: Device: /dev/twl0 [3ware_disk_03], 67 Offline uncorrectable sectors (changed +8)
Nov 17 07:54:55 host smartd[3667]: Device: /dev/twl0 [3ware_disk_03], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 53 to 56
Nov 17 07:54:55 host smartd[3667]: Device: /dev/twl0 [3ware_disk_03], SMART Usage Attribute: 194 Temperature_Celsius changed from 47 to 44
Nov 17 07:54:55 host smartd[3667]: Device: /dev/twl0 [3ware_disk_03], previous self-test completed with error (read test element)
Nov 17 07:54:55 host smartd[3667]: Device: /dev/twl0 [3ware_disk_03], Self-Test Log error count increased from 0 to 1
Nov 17 07:54:55 host smartd[3667]: Sending warning via mail to x.y@gmail.com ...
Nov 17 07:54:56 host smartd[3667]: Warning via mail to x.y@gmail.com: successful
Nov 17 08:24:55 host smartd[3667]: Device: /dev/twl0 [3ware_disk_02], self-test in progress, 10% remaining
Nov 17 08:24:56 host smartd[3667]: Device: /dev/twl0 [3ware_disk_03], 67 Currently unreadable (pending) sectors
Nov 17 08:24:56 host smartd[3667]: Device: /dev/twl0 [3ware_disk_03], 67 Offline uncorrectable sectors
Nov 17 08:24:56 host smartd[3667]: Device: /dev/twl0 [3ware_disk_03], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 56 to 58
Nov 17 08:24:56 host smartd[3667]: Device: /dev/twl0 [3ware_disk_03], SMART Usage Attribute: 194 Temperature_Celsius changed from 44 to 42
Nov 17 08:54:54 host smartd[3667]: Device: /dev/twl0 [3ware_disk_02], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 54 to 55
Nov 17 08:54:54 host smartd[3667]: Device: /dev/twl0 [3ware_disk_02], SMART Usage Attribute: 194 Temperature_Celsius changed from 46 to 45
Nov 17 08:54:55 host smartd[3667]: Device: /dev/twl0 [3ware_disk_03], 67 Currently unreadable (pending) sectors
Nov 17 08:54:55 host smartd[3667]: Device: /dev/twl0 [3ware_disk_03], 67 Offline uncorrectable sectors
Nov 17 09:24:56 host smartd[3667]: Device: /dev/twl0 [3ware_disk_03], 67 Currently unreadable (pending) sectors
Nov 17 09:24:56 host smartd[3667]: Device: /dev/twl0 [3ware_disk_03], 67 Offline uncorrectable sectors
Nov 17 09:54:55 host smartd[3667]: Device: /dev/twl0 [3ware_disk_02], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 55 to 58
Nov 17 09:54:55 host smartd[3667]: Device: /dev/twl0 [3ware_disk_02], SMART Usage Attribute: 194 Temperature_Celsius changed from 45 to 42
Nov 17 09:54:55 host smartd[3667]: Device: /dev/twl0 [3ware_disk_02], previous self-test completed without error
Nov 17 09:54:56 host smartd[3667]: Device: /dev/twl0 [3ware_disk_03], 67 Currently unreadable (pending) sectors
Nov 17 09:54:56 host smartd[3667]: Device: /dev/twl0 [3ware_disk_03], 67 Offline uncorrectable sectors
Nov 17 09:54:56 host smartd[3667]: Device: /dev/twl0 [3ware_disk_03], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 58 to 59
Nov 17 09:54:56 host smartd[3667]: Device: /dev/twl0 [3ware_disk_03], SMART Usage Attribute: 194 Temperature_Celsius changed from 42 to 41
Nov 17 10:24:55 host smartd[3667]: Device: /dev/twl0 [3ware_disk_02], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 58 to 59
Nov 17 10:24:55 host smartd[3667]: Device: /dev/twl0 [3ware_disk_02], SMART Usage Attribute: 194 Temperature_Celsius changed from 42 to 41
Nov 17 10:24:56 host smartd[3667]: Device: /dev/twl0 [3ware_disk_03], 67 Currently unreadable (pending) sectors
Nov 17 10:24:56 host smartd[3667]: Device: /dev/twl0 [3ware_disk_03], 67 Offline uncorrectable sectors
Nov 17 10:54:55 host smartd[3667]: Device: /dev/twl0 [3ware_disk_03], 67 Currently unreadable (pending) sectors
Nov 17 10:54:55 host smartd[3667]: Device: /dev/twl0 [3ware_disk_03], 67 Offline uncorrectable sectors
Nov 17 11:24:56 host smartd[3667]: Device: /dev/twl0 [3ware_disk_03], 67 Currently unreadable (pending) sectors
Nov 17 11:24:56 host smartd[3667]: Device: /dev/twl0 [3ware_disk_03], 67 Offline uncorrectable sectors

Edit: ... And here's the output of smartctl:

Code:

smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     ST3000VX000-9YW166
Serial Number:    -
LU WWN Device Id: 5 000c50 04e5521f2
Firmware Version: CV13
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Sun Nov 17 11:55:46 2013 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 119)	The previous self-test completed having
					the read element of the test failed.
Total time to complete Offline 
data collection: 		(  575) seconds.
Offline data collection
capabilities: 			 (0x73) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x10b9)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   103   099   006    Pre-fail  Always       -       179120384
  3 Spin_Up_Time            0x0003   095   094   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       36
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   078   060   030    Pre-fail  Always       -       83068023
  9 Power_On_Hours          0x0032   090   090   000    Old_age   Always       -       9516
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       36
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   085   085   000    Old_age   Always       -       15
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       4295032833
189 High_Fly_Writes         0x003a   001   001   000    Old_age   Always       -       1152
190 Airflow_Temperature_Cel 0x0022   059   050   045    Old_age   Always       -       41 (Min/Max 39/48)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       35
193 Load_Cycle_Count        0x0032   058   058   000    Old_age   Always       -       85847
194 Temperature_Celsius     0x0022   041   050   000    Old_age   Always       -       41 (0 19 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       67
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       67
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 25 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 25 occurred at disk power-on lifetime: 9481 hours (395 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 80 ff ff ff 4f 00  13d+03:08:52.223  READ FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:08:33.173  WRITE FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:08:33.138  WRITE FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:08:33.118  WRITE FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:08:33.092  WRITE FPDMA QUEUED

Error 24 occurred at disk power-on lifetime: 9481 hours (395 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 80 ff ff ff 4f 00  13d+03:08:32.257  READ FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:08:13.213  WRITE FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:08:13.183  WRITE FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:08:13.160  WRITE FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:08:13.139  WRITE FPDMA QUEUED

Error 23 occurred at disk power-on lifetime: 9481 hours (395 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 80 ff ff ff 4f 00  13d+03:08:32.257  READ FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:08:13.213  WRITE FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:08:13.183  WRITE FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:08:13.160  WRITE FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:08:13.139  WRITE FPDMA QUEUED

Error 22 occurred at disk power-on lifetime: 9481 hours (395 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 80 ff ff ff 4f 00  13d+03:08:09.239  READ FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:07:50.192  WRITE FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:07:50.165  WRITE FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:07:50.144  WRITE FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:07:50.119  WRITE FPDMA QUEUED

Error 21 occurred at disk power-on lifetime: 9481 hours (395 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 80 ff ff ff 4f 00  13d+03:07:47.002  READ FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:07:27.953  WRITE FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:07:27.917  WRITE FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:07:27.897  WRITE FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:07:27.872  WRITE FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       70%      9512         1810718288
# 2  Short offline       Completed without error       00%      9486         -
# 3  Short offline       Completed without error       00%      9462         -
# 4  Short offline       Completed without error       00%      9438         -
# 5  Short offline       Completed without error       00%      9414         -
# 6  Short offline       Completed without error       00%      9390         -
# 7  Short offline       Completed without error       00%      9366         -
# 8  Extended offline    Completed without error       00%      9348         -
# 9  Short offline       Completed without error       00%      9318         -
#10  Short offline       Completed without error       00%      9294         -
#11  Short offline       Completed without error       00%      9270         -
#12  Short offline       Completed without error       00%      9246         -
#13  Short offline       Completed without error       00%      9222         -
#14  Short offline       Completed without error       00%      9198         -
#15  Extended offline    Completed without error       00%      9179         -
#16  Short offline       Completed without error       00%      9150         -
#17  Short offline       Completed without error       00%      9126         -
#18  Short offline       Completed without error       00%      9102         -
#19  Short offline       Completed without error       00%      9078         -
#20  Short offline       Completed without error       00%      9054         -
#21  Short offline       Completed without error       00%      9030         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

1) Basically, what is the next move here? Is this something that I could ignore, correct somehow, or should I simply replace the drive?

2) "SMART overall-health self-assessment test result: PASSED", isn't this kind of misleading?

3) Anyone know Seagate's RMA policy on SMART errors?

Thanks in advance,

Thomas

Ser Olmy · 11-17-2013, 07:42 AM

Use the 3Ware management tool to force a verify. That should cause the offending sectors to be rewritten and thus reallocated. When it's done, the Current_Pending_Sectors number should be 0, and there would typically be a corresponding increase in the Reallocated_Sector_Ct attribute.

If number of bad sectors keep growing, the drive should be replaced.

AFAIK, the S.M.A.R.T. status will remain as "PASSED" until a critical error occurs or one of the self tests fail.

As for warranty replacements, I know that if the Seagate diagnostic tool reports an error, whey will replace the drive. A few bad sectors that can be reallocated are typically not considered errors.

TobiSGD · 11-17-2013, 09:35 AM

Your case/airflow temperature is way to high, you should take care of that.

rknichols · 11-17-2013, 09:56 AM

Temperature specs for modern drives are nowhere near as restrictive as they were in the past. The ST3000VX000 is rated for operation with a maximum case temperature of 60C. Since the SMART attributes are showing the same number for Airflow temperature (attribute 190) and drive temperature (attribute 194), I doubt that it is really indicating the incoming air temperature. A temperature of 41C (indicated maximum of 48C) is certainly getting warm, but should not be a cause for major alarm, IMO.

FEL · 11-19-2013, 06:15 AM

Hello all,

Thanks for all the advice.

For some reason that I can't remember the 3w cli application did not allow me to "verify" the disk. But I did manage to rebuild the array and now things are looking alot better. With the reservation that I might have done things slightly wrong, here's my actions:

Check current status of units and drives:

Code:

sudo ./tw_cli /c0 show

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
------------------------------------------------------------------------------
u0 RAID-1 OK - - - 111.748 RiW ON 
u1 RAID-1 DEGRADED - - - 2793.96 RiW ON 

VPort Status Unit Size Type Phy Encl-Slot Model
------------------------------------------------------------------------------
p0 OK u0 111.79 GB SATA 0 - INTEL SSDSA2CW120G3 
p1 OK u0 111.79 GB SATA 1 - INTEL SSDSA2CW120G3 
p2 OK u1 2.73 TB SATA 2 - ST3000VX000-9YW166 
p3 ECC-ERROR u1 2.73 TB SATA 3 - ST3000VX000-9YW166

Make sure autorebuild is actually on:

Code:

sudo ./tw_cli /c0 set autorebuild=on
Setting Auto-Rebuild Policy on /c0 to on ... Done.

Remove the erroneous drive:

Code:

sudo ./tw_cli /c0/p3 remove
Removing /c0/p3 will take the disk offline.
Do you want to continue ? Y|N [N]: y
Removing port /c0/p3 ... Done.

Check that the drive is actually removed from the unit:

Code:

sudo ./tw_cli /c0 show

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
------------------------------------------------------------------------------
u0 RAID-1 OK - - - 111.748 RiW ON 
u1 RAID-1 DEGRADED - - - 2793.96 RiW ON 

VPort Status Unit Size Type Phy Encl-Slot Model
------------------------------------------------------------------------------
p0 OK u0 111.79 GB SATA 0 - INTEL SSDSA2CW120G3 
p1 OK u0 111.79 GB SATA 1 - INTEL SSDSA2CW120G3 
p2 OK u1 2.73 TB SATA 2 - ST3000VX000-9YW166

Issue a rescan to re-add the drive:

Code:

sudo ./tw_cli /c0 rescan
Rescanning controller /c0 for units and drives ...Done.
Found the following unit(s): [none].
Found the following drive(s): [/c0/p3].

Check current units/drives again:

Code:

sudo ./tw_cli /c0 show

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
------------------------------------------------------------------------------
u0 RAID-1 OK - - - 111.748 RiW ON 
u1 RAID-1 REBUILDING 1% - - 2793.96 RiW ON 

VPort Status Unit Size Type Phy Encl-Slot Model
------------------------------------------------------------------------------
p0 OK u0 111.79 GB SATA 0 - INTEL SSDSA2CW120G3 
p1 OK u0 111.79 GB SATA 1 - INTEL SSDSA2CW120G3 
p2 OK u1 2.73 TB SATA 2 - ST3000VX000-9YW166 
p3 DEGRADED u1 2.73 TB SATA 3 - ST3000VX000-9YW166

... Wait a few hours and check again:

Code:

sudo ./tw_cli /c0 show

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-1    OK             -       -       -       111.748   RiW    ON     
u1    RAID-1    OK             -       -       -       2793.96   RiW    ON     

VPort Status         Unit Size      Type  Phy Encl-Slot    Model
------------------------------------------------------------------------------
p0    OK             u0   111.79 GB SATA  0   -            INTEL SSDSA2CW120G3 
p1    OK             u0   111.79 GB SATA  1   -            INTEL SSDSA2CW120G3 
p2    OK             u1   2.73 TB   SATA  2   -            ST3000VX000-9YW166  
p3    OK             u1   2.73 TB   SATA  3   -            ST3000VX000-9YW166

Syslog is now looking a lot better:

Code:

sudo cat /var/log/syslog | grep smartd
Nov 19 06:54:55 host smartd[3667]: Device: /dev/twl0 [3ware_disk_03], previous self-test completed without error

According to the product manual the drives support "Uncompromising reliability supports flexible surveillance design with case temperatures up to 70º C". So even though, and I agree, that ~60º is high, it should be OK.

And for the record, here's the current output of smartctl:

Code:

sudo smartctl -a -d 3ware,3 /dev/twl0
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     ST3000VX000-9YW166
Serial Number:    -
LU WWN Device Id: 5 000c50 04e5521f2
Firmware Version: CV13
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Tue Nov 19 13:23:39 2013 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(  575) seconds.
Offline data collection
capabilities: 			 (0x73) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x10b9)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   103   099   006    Pre-fail  Always       -       179136296
  3 Spin_Up_Time            0x0003   095   094   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       36
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   078   060   030    Pre-fail  Always       -       83086100
  9 Power_On_Hours          0x0032   090   090   000    Old_age   Always       -       9565
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       36
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   085   085   000    Old_age   Always       -       15
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       4295032833
189 High_Fly_Writes         0x003a   001   001   000    Old_age   Always       -       1556
190 Airflow_Temperature_Cel 0x0022   059   050   045    Old_age   Always       -       41 (Min/Max 39/48)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       35
193 Load_Cycle_Count        0x0032   058   058   000    Old_age   Always       -       85943
194 Temperature_Celsius     0x0022   041   050   000    Old_age   Always       -       41 (0 19 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 25 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 25 occurred at disk power-on lifetime: 9481 hours (395 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 80 ff ff ff 4f 00  13d+03:08:52.223  READ FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:08:33.173  WRITE FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:08:33.138  WRITE FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:08:33.118  WRITE FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:08:33.092  WRITE FPDMA QUEUED

Error 24 occurred at disk power-on lifetime: 9481 hours (395 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 80 ff ff ff 4f 00  13d+03:08:32.257  READ FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:08:13.213  WRITE FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:08:13.183  WRITE FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:08:13.160  WRITE FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:08:13.139  WRITE FPDMA QUEUED

Error 23 occurred at disk power-on lifetime: 9481 hours (395 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 80 ff ff ff 4f 00  13d+03:08:32.257  READ FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:08:13.213  WRITE FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:08:13.183  WRITE FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:08:13.160  WRITE FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:08:13.139  WRITE FPDMA QUEUED

Error 22 occurred at disk power-on lifetime: 9481 hours (395 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 80 ff ff ff 4f 00  13d+03:08:09.239  READ FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:07:50.192  WRITE FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:07:50.165  WRITE FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:07:50.144  WRITE FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:07:50.119  WRITE FPDMA QUEUED

Error 21 occurred at disk power-on lifetime: 9481 hours (395 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 80 ff ff ff 4f 00  13d+03:07:47.002  READ FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:07:27.953  WRITE FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:07:27.917  WRITE FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:07:27.897  WRITE FPDMA QUEUED
  61 00 01 ff ff ff 4f 00  13d+03:07:27.872  WRITE FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      9558         -
# 2  Short offline       Completed without error       00%      9534         -
# 3  Extended offline    Completed: read failure       70%      9512         1810718288
# 4  Short offline       Completed without error       00%      9486         -
# 5  Short offline       Completed without error       00%      9462         -
# 6  Short offline       Completed without error       00%      9438         -
# 7  Short offline       Completed without error       00%      9414         -
# 8  Short offline       Completed without error       00%      9390         -
# 9  Short offline       Completed without error       00%      9366         -
#10  Extended offline    Completed without error       00%      9348         -
#11  Short offline       Completed without error       00%      9318         -
#12  Short offline       Completed without error       00%      9294         -
#13  Short offline       Completed without error       00%      9270         -
#14  Short offline       Completed without error       00%      9246         -
#15  Short offline       Completed without error       00%      9222         -
#16  Short offline       Completed without error       00%      9198         -
#17  Extended offline    Completed without error       00%      9179         -
#18  Short offline       Completed without error       00%      9150         -
#19  Short offline       Completed without error       00%      9126         -
#20  Short offline       Completed without error       00%      9102         -
#21  Short offline       Completed without error       00%      9078         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

I have filed a RMA with the retailer. Perhaps they offer a replacement.

Smartd is scheduled to run test on the drive and if more offline sectors appear I will replace it.