LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Slackware (https://www.linuxquestions.org/questions/slackware-14/)
-   -   Critical SSD/RAID0 (?) bug under kernel 4.0.0, 4.0.1, 4.0.2 and 3.18.14 (?) (https://www.linuxquestions.org/questions/slackware-14/critical-ssd-raid0-bug-under-kernel-4-0-0-4-0-1-4-0-2-and-3-18-14-a-4175543131/)

moisespedro 05-20-2015 10:16 AM

Critical SSD/RAID0 (?) bug under kernel 4.0.0, 4.0.1, 4.0.2 and 3.18.14 (?)
 
It seems those versions have a major bug that causes ext4 data corruption, resulting in data loss.

4.0.3 fixes this issue.

EDIT: After a follow-up on new posts I've had to change the title, please read the whole thread. Previously the title stated it was only an ext4 bug and only under 4.0.0, 4.0.1, 4.0.2.

drmozes 05-20-2015 11:33 AM

Quote:

Originally Posted by moisespedro (Post 5365062)
It seems those versions have a major bug that causes ext4 data corruption, resulting in data loss.

4.0.3 fixes this issue.

Yep I can testify - it destroyed one of my ARM build machines. I thought it was perhaps the SSD but suspected otherwise.

GazL 05-20-2015 04:12 PM

Anyone got any hard details on this? Most the articles seem to be linking back to the reports on Arch and Phoronix forums.

The ext4 fix in 4.0.3 was also included in the 3.10.78 patch and the issue seems to date back quite a ways , so I don't know why this only seems to be being reported for 4.0.y? Judging by this post by Ted Ts'o, he seems to be suggesting that this ext4 fix isn't the likely culprit. What I did notice from a quick scan of the forum posts where people seem to be reporting problems is that most appear to be using SSDs, or raid, (or both). Someone on the Arch forum has suggested that this TRIM related fix might have something to do with the issue, but who knows. It's all a little vague right now.

I've used all the kernels from 4.0.0 -> 4.0.4 and not had any issues here, but then I'm not using an SSD and I'm not doing anything fancy that would likely trigger the issue Ted was talking about. Hopefully things will become clearer.

GazL 05-21-2015 06:59 AM

This one seems likely:
https://bugzilla.kernel.org/show_bug.cgi?id=98501

The original commit for "md/raid0: fix bug with chunksize not a power of 2." which looks to be at fault also made it into at least 3.14.41 and 3.19.7, in addition to 4.0.2. This issue has not been addressed in any stable branches as of yet.

I couldn't see any signs of this making it into 3.10.y so, anyone still following that branch ought to be safe (at least, as far as you can be given that 'stable' doesn't seem to mean a great deal when talking about linux kernel branches). Grabbing 3.10.79 for the ext4 corruption bug fix might be worth considering if you're still on 3.10.y.


Stu, were you using raid0 by any chance?

drmozes 05-21-2015 07:10 AM

Quote:

Originally Posted by GazL (Post 5365491)

Stu, were you using raid0 by any chance?

Nope just a regular ext4 FS on an SSD. It caused data corruption twice which made me suspect the SSD (since it's brand new), but the SMART checks reports all suggest that it was healthy so I suspected it must be a kernel issue (it's not the first time I've seen an FS corruption line up with a recent kernel upgrade and lots of FS activity).
I have other build machines with Linux 4.0.2 but which were mostly idle - there were no issues on those.

mlslk31 05-21-2015 09:13 AM

There's a fix notated in the ChangeLog for kernel 3.18.14. It's best to search for "ext4" and read what's there. Any summary by me would get something wrong.

moesasji 05-21-2015 03:21 PM

Quote:

Originally Posted by mlslk31 (Post 5365548)
There's a fix notated in the ChangeLog for kernel 3.18.14. It's best to search for "ext4" and read what's there. Any summary by me would get something wrong.

If you look at the explanation from the main kernel maintainer for ext4 it doesn't appear to be a bug that is easy to hit. See http://www.gossamer-threads.com/list...176274#2176274, so the changelog sounds a lot worse than it is in reality.

moisespedro 05-21-2015 03:53 PM

It seems the problem affects RAID0 more.

j_v 05-21-2015 06:21 PM

@moisespedro
Thanks for highlighting this issue. This is definitely worth investigating.

Also thanks to everyone who has given input here. It's nice to have some starting points and intitial reading to get familiar with the problem(s).

veerain 05-22-2015 04:49 AM

The corruption occurs if ext4 is used with RAID0. It hs been fixed by upstream developer and fix would be available with 4.1 or 4.0.4 and other LTS kernels.

GazL 05-22-2015 04:51 AM

Except, Stuart wasn't using raid0 and he was hit by corruption, as posted above. This is why I'm inclined to believe that there's multiple issues at play. Most people affected do seem to be using SSDs though.

BTW, 4.0.4 has been out a few days. I think you probably mean 4.0.5

moesasji 05-23-2015 02:38 AM

Quote:

Originally Posted by veerain (Post 5365935)
The corruption occurs if ext4 is used with RAID0. It hs been fixed by upstream developer and fix would be available with 4.1 or 4.0.4 and other LTS kernels.

To be precise: this appears to be a bug in the raid0 that surfaces in a trim operation for any filesystem that supports trim according to the kernel-bug report: https://bugzilla.kernel.org/show_bug.cgi?id=98501 Which is why it primarily surfaces on ext4 combined with an SSD drive.

Anyway it looks like multiple ext4 bugs are getting fixed. This one doesn't appear to have been fixed in the LTS branch yet if I search for the commit-id in the bugreport.

Didier Spaier 05-23-2015 03:02 AM

@moisespedro: assuming this is the same bug, I suggest that you modify this thread's title that could be:
Quote:

WARNING: Software Raid 0 on SSD's and discard corrupts data
as it probably doesn't affect only ext4 file systems.
I am referring to this message from Holger Kiehl on LKML.

Also, at time of writing according to this comment from Eric Work the fix for this bug has been merged into Linus' tree as commit a81157768a00e8cf8a7b43b5ea5cac931262374f but that doesn't mean that the 4.0 branch has been fixed.

But if it's not actually the same problem, then being aware of both is a good thing...

GazL 05-23-2015 05:14 AM

Softpedia have just posted an item saying that 3.18.14 LTS is out with fixes for these issues. Unless, I'm mistaken 3.18.14 is actually introducing the April 10th md/raid0 commit that appears to be behind much of the trouble.

I don't know if any slackers are following the 3.18 branch, but if you are, be very, very, wary.

moesasji 05-23-2015 05:34 AM

Quote:

Originally Posted by GazL (Post 5366340)
Softpedia have just posted an item saying that 3.18.14 LTS is out with fixes for these issues. Unless, I'm mistaken 3.18.14 is actually introducing the April 10th md/raid0 commit that appears to be behind much of the trouble.

I don't know if any slackers are following the 3.18 branch, but if you are, be very, very, wary.

For clarity: the commit that is supposed to fix this issue is a81157768a00e8cf8a7b43b5ea5cac931262374f As far as I can tell that commit has not yet appeared in the 3.18 support branch. So Softpedia appears to be wrong here.

btw) Anyone on current would have this problem as the current branch runs with the 3.18 kernel.


All times are GMT -5. The time now is 03:24 PM.