LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware
User Name
Password
Linux - Hardware This forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?

Notices


Reply
  Search this Thread
Old 06-21-2020, 05:14 AM   #1
bodge99
Member
 
Registered: Oct 2018
Location: Ashington, Northumberland
Distribution: Artix, Slackware, Devuan etc. No systemd!
Posts: 368

Rep: Reputation: 66
Shingled hard drive problems. Total data loss


Hi,

I've recently had a few problems with shingled drives.. specifically:

Seagate BarraCuda 2 TB. Intel i7 laptop running Artix Linux.

Toshiba L200 1TB. Intel i3 laptop running Slackware Linux.

Western Digital Red 4TB 3.5" NAS (Raid 5).

Both laptops have a primary SSD system drive with a secondary (data) drive placed in a DVD style internal caddy. I've been using these caddies for a few years now, with no problems whatsoever. The currently used drives were installed within the last year as I wanted more space.

Anyway..

The WD drive has been kicked out of the array and reported as "failing". This drive has had about 200 hours use. The drive data appears to be intact and perfectly valid.
It would appear that this drive is shingled whereas the other drives are not. The drive tests as "perfect" with excellent SMART data.

Seagate Barracuda: I was reading a PDF when the system froze. On restart, the drive appeared empty when viewed with gparted.

Toshiba: L200: This is used in a bedroom based media computer. I was watching a film when the playback just froze. As with the Seagate drive, the drive now appears totally empty.

I'm not concerned about losing the media collection as I have decent backups.. but I am concerned that both drives appear to have lost everything.

I next removed the Toshiba drive from the laptop and connected it up to one of my development machines for analysis.
Using several tools the best I could recover was the partition & directory structure (Ext4 on GPT).. No files whatsoever.

I then repeated the exercise with the Seagate drive. As before, the best that I could recover was the partition & directory structure. Again no files recoverable at all.. Not even a single file name...

Both laptop drives test perfectly O.K. I've replaced the laptop data drives with what was fitted earlier. These are working perfectly.

Has anybody else seen this behaviour??

Bodge99
 
Old 06-21-2020, 05:54 AM   #2
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,511

Rep: Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377
I consider myself a hardware guy used to dealing with hardware failures, but
  • I have never heard of a shingled drive. Whassat?
  • I have never seen such a list of hardware failures. The last drive jettisoned here was 11 or 12 years old and failed because of an argument between itself and the socket it was plugged into. There is a 1TB whose case went intermittent but that had lived a very hard life in software houses.
  • I'd nearly expect a floppy disk to last 200 hours. Certainly a usb drive should.
  • I would look for a common cause - unusual settings, containers, heat, anything nonstandard, possible malware or BIOS bug.
  • How do your regular disk checks turn out?

The failure mode is also unusual - you lose the filesystem. There's no disk error. The drive didn't really die. Rather the directory structure was surgically extracted, costing you the data. That sounds like malware.Is there another filesystem you can use with better recovery options?
 
Old 06-21-2020, 06:49 AM   #3
bodge99
Member
 
Registered: Oct 2018
Location: Ashington, Northumberland
Distribution: Artix, Slackware, Devuan etc. No systemd!
Posts: 368

Original Poster
Rep: Reputation: 66
Hi,

Thanks for the reply..

See https://www.theregister.com/2016/02/..._the_shingles/

and https://www.extremetech.com/computin...etic-recording

I now understand that shingled drives are not really recommended for NAS use, The potential (re)write delays can cause the drive to appear as "failing". This drive is going to be returned to the vendor... It is actually rated as a NAS drive!

The laptop working "environments" are perfectly normal. I'm paranoid about operating temperatures. I check regularly and log everything.

I'm confident that there are no unusual settings, heat problems or anything else non-standard.. Definitely no malware..
Both laptops were installed from secure sources and are updated from an airgapped NAS (no direct net access on this hardware & anything on its local network.)

This leaves machine or drive firmware bugs.. I'm reasonably happy with the quality of the laptop UEFI firmwares.. I've got some experience with modding/correcting errors in UEFI firmware etc. I haven't found any particular problems with either laptops firmware.. But similar failures on two different makes of shingled drive?? umm..!!

I've run further tests on both drives (yes, I did make drive images before performing any tests).. No files are recoverable with anything..
I even tried some Windows tools on a clean windows 7 install (a **real** blast from the past for me!). These either found just the partition and directory structure or nothing at all.

I've now reformatted both drives (Ext4 on GPT) and am currently stress testing both of them.

Finally, normal (weekly) manual disk checks indicated no problems at all and the drive firmwares are the latest available..

Bodge99

Last edited by bodge99; 06-21-2020 at 12:37 PM.
 
Old 06-21-2020, 06:53 AM   #4
Ser Olmy
Senior Member
 
Registered: Jan 2012
Distribution: Slackware
Posts: 3,347

Rep: Reputation: Disabled
I'd have to agree with business_kid when it comes to your mysteriously disappearing partition tables and file systems. It doesn't sound like any issue I've ever seen in my 35+ years of experience with hard drive storage and failures. Something very odd is going on.

With regards to the WD drive being kicked from the array, would this by any chance be a software RAID?

It's not unusual for timeouts to occur with such setups. This can occur when the S.M.A.R.T. mechanism tries to recover a bad sector, or possibly when a SMR (shingled) drive has to rewrite several overlapping sectors.

For non-SMR drives, disabling automatic error recovery ("SCT Error Recovery Control" or "Time-Limited Error Recovery") with smartctl might do the trick. Incidentally, that is what all hardware RAID controllers do.

Not all desktop drives support this, however, and in that case increasing the timeout value in /sys/block/<device_name>/device/timeout is the only workaround. This may cause a system to become unresponsive for longer when a disk issue occurs, but it dramatically decreases the chances of a busy drive being ejected from an array.

SMR drives have the additional issue that multiple sectors must be rewritten during normal sector rewrites, and if one or more of the sectors in question have errors, well, expect things to take time. This article claims that delays of up to 10 minutes(!) can occur while an SMR drive sorts out issues with overlapping sectors.

SMR drives really only work well in "write once" (archive) scenarios.
 
Old 06-21-2020, 06:58 AM   #5
bodge99
Member
 
Registered: Oct 2018
Location: Ashington, Northumberland
Distribution: Artix, Slackware, Devuan etc. No systemd!
Posts: 368

Original Poster
Rep: Reputation: 66
Hi,

Ser Olmy: The NAS is a hardware one. I built it from a "gifted" server motherboard. It has had flawless uptime for more than 4 years..

Bodge99
 
Old 06-21-2020, 07:04 AM   #6
Ser Olmy
Senior Member
 
Registered: Jan 2012
Distribution: Slackware
Posts: 3,347

Rep: Reputation: Disabled
What type of RAID controller does this motherboard use?

Anyway, I believe even the best RAID controllers will choke on the excessive recovery delays of some SMR drives. WD has a particularly shoddy reputation in that regard.
 
Old 06-21-2020, 07:17 AM   #7
bodge99
Member
 
Registered: Oct 2018
Location: Ashington, Northumberland
Distribution: Artix, Slackware, Devuan etc. No systemd!
Posts: 368

Original Poster
Rep: Reputation: 66
Hi,

The controller card is a Dell PERC Storage controller.. I think it's a H720 or H730 (unsure without checking).
This was also a gift from someone after I'd managed to recover his Daughter's PhD thesis from a Windows crash two days before it was due to be submitted..

Bodge99
 
Old 06-22-2020, 04:46 AM   #8
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,511

Rep: Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377
Might I suggest that while we're posting you make the journal external and hop on ebay for a usb caddy to put your drive into? It might be a better way.
 
Old 06-22-2020, 07:37 AM   #9
bodge99
Member
 
Registered: Oct 2018
Location: Ashington, Northumberland
Distribution: Artix, Slackware, Devuan etc. No systemd!
Posts: 368

Original Poster
Rep: Reputation: 66
Hi,

I've got several USB caddies.. I do find them quite useful for "sneakernet" purposes..

I've had a closer look at the drive images at sector level.. Superblocks seem to exist but everything after the total inode count onwards is set to 'FF'.. Sectors that should contain file data are zeroed out.

Both drives have been stress tested for nearly 24 hrs now.. (I'll leave them for 48 hrs.). Everything is fine, so far.

Bodge99
 
Old 06-22-2020, 12:19 PM   #10
bodge99
Member
 
Registered: Oct 2018
Location: Ashington, Northumberland
Distribution: Artix, Slackware, Devuan etc. No systemd!
Posts: 368

Original Poster
Rep: Reputation: 66
Hi,

I've looked further at the backup superblocks on both disk images.. Some of the "later" superblocks show some (apparently) valid data..

O.K. I've now made a decision. I now don't trust shingled drives **AT ALL**. I've double checked the type of drives that I have in everything here. Every large drive that I own (about 14 drives >2TB, apart from my NAS drives, bar one) are shingled.. I'll be removing these over the next couple of months and replacing them with either non-shingled or SSD drives. The costs will be a little "painful" but I prefer data security over capacity, even if this means that I have to use a greater number of devices overall.

The WD NAS drive has been accepted for return & replacement by the seller as "not fit for purpose". Unfortunately, this means that my primary NAS is out of action for a week or so.

Bodge99
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Monitoring data passing on USB bus & copying data (dumping) to hard drive (forking an identical data stream) cilbuper Programming 4 12-11-2017 01:58 PM
Loss Of Network Connection After Power Loss etpoole60 Linux - Networking 4 11-02-2014 07:55 PM
Loss Of Network Connection After Power Loss etpoole60 Linux - Virtualization and Cloud 2 10-27-2014 03:14 PM
Advice on File System for Shingled Hard Drive? Stephen P. Morgan Linux - Desktop 1 07-22-2014 08:13 PM
Network Connection Loss And USB Connection Loss. Novatian Linux - Desktop 1 11-07-2008 02:09 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware

All times are GMT -5. The time now is 03:35 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration