Is BTRFS worth switching to from EXT4?

jpollard · 08-16-2016, 03:44 PM

Quote:

Originally Posted by chrismurphy

Please elaborate. First off the drive is supposed to detect its own errors and report them to Btrfs, which then might be able to fix the problem depending on what the error is. There is actually a common misconfiguration these days since drive manufacturers have changed drives to have rather obscenely long recovery times for marginally bad sectors, that can exceed the kernel command timer and cause a link reset rather than waiting long enough for the drive to actually present the data or report a read error. Second, Btrfs does have the ability to detect metadata and data errors, and correct for them of the metadata or data in question is replicated (raid1,5,6). Otherwise, no.

It didn't under my testing. But then, I already knew btrfs was experimental. A simple test is to zero some data blocks on a btrfs volume (you do have to go directly to the drive to do than) - then see if it recovers the damaged file.

My tests with raid5 showed it did not. It REPORTED that it had - but the data remained corrupted.

Quote:

Did you file a bug? I've used it quite extensively without problems up until recently when I switched to Samba shares, also without problems.

No - raid 5 was totally experimental, and I couldn't even get a dump of the hang. The system just stopped.

BTW, the mailing lists show it is still hanging sometimes even now. It doesn't hang consistently (at least not for me). It can run weeks before it happens - the only common element has been the use of btrfs. Remove that - no hangs.

chrismurphy · 08-16-2016, 03:46 PM

Quote:

Originally Posted by Richard Cranium

I'll point out that LVM allows you to create a snapshot volume that freezes the view of a logical volume in time. It is designed for creating backups; it is not designed for being a backup.

Conventional LVM thick provisioned snapshots are tedious to setup and really slow. The new thin provisioning stuff is fast, and totally obviates the need to ever shrink a file system - just fstrim it.

Quote:

Many of the other use cases mentioned in the thread would not be supported by LVM (RAID being an exception, but I have not personally tested that).

LVM for a while now has front end support for the md driver in the kernel; same as what mdadm uses but is managed by lvm instead of mdadm. The neat thing is you can setup RAID levels per LV. There are some limitations combining LVM thin provisioning and RAID, it may be that raid 4,5,6 isn't supported but I know you can setup mirrored LVs (thick provisioning) and then lvconvert them to thin, so they're both thin and mirrored.

Quote:

Migrating data from one hard drive to another while the system is operating is a breeze with LVM; I don't know if btrfs can help with that or not.

It's a lot easier and faster, and part of the normal allocation code path so it's not like it's a special case thing. You can either 'device add' then 'device remove' which causes block groups to migrate from one device to another; or the consolidated command 'replace'. If you already have more than the minimum number of devices and free space, you can just do 'device remove'. It basically combines file system shrink, lvreduce, vgreduce, and pvmove all in one command.

chrismurphy · 08-16-2016, 03:51 PM

Quote:

Originally Posted by syg00

Like I said, my photos are on btrfs RAID5 - consistency is as necessary as backup for me.

Word of caution, there is a scrub bug where bad data gets fixed from good parity and then the good parity is overwritten with wrong parity; and also the parity itself is not checksummed. Also should you ever have to do a device replace, use 'dev add' then 'dev rem' rather than 'replace start' there are some bugs with replace and raid56.

So yeah, backups.

chrismurphy · 08-16-2016, 04:03 PM

Quote:

Originally Posted by jpollard

As for bitrot, I believe that is handled by the md raid5, not as integrated with the filesystem though.

No. First in normal use, parity isn't even read in, the data strips are all that get read and if the drive doesn't report a read error, md assumes the data is good, and so does the file system (depending on what's wrong it might detect metadata corruptions, which is the XFS default now). Second, during a scrub check or repair, it's ambiguous which is wrong: data strip or parity strip; it's the same problem as 2 copies raid1 with a mismatch. For raid6 it's possible to detect corruption, locate it, and do a rebuild, but it's computationally expensive and I don't know anyone who's implemented it. It's not implemented by the md module.

chrismurphy · 08-16-2016, 04:16 PM

Quote:

Originally Posted by jpollard

The problem is that the problems in btrfs CAUSE bitrot... You can't identify the failures without doing an extensive checksum and comparison.

The only corruption reproducer I have for quite a long time now (months maybe over a year) is the raid5 scrub causing wrong parity *if* a data stripe element is already corrupt. The data strip element corruption is detected, and fixed by reconstruction with good parity, and the bad strip replaced with that reconstruction. But somehow the good parity is then replaced with bad parity. I don't know the nature of the bad parity, but it's wrong to call it bitrot, because it's a lot different, it's not a bit flip. If you were to do two scrubs in a row, the problem would fix itself; the bad parity would only become a factor if there were the loss of a sector (bad sector or loss of a device), where btrfs would rebuild from bad parity and then detect the resulting reconstruction doesn't match csum, so you'd get a csum error along with an i/o error since Btrfs won't had over data it thinks is corrupt. If you try to use btrfs rescue to force the extraction of the data despite the csum error, it would of course hand back garbage since the parity is bad.

Anyway, you can't really call this Btrfs introducing bitrot, it's a mischaracterization. It's a bug, and an eye opening one at that, but it requires several other things to go wrong first before Btrfs goes on to first fix the problem and then contribute to the problem.

chrismurphy · 08-16-2016, 04:32 PM

Quote:

Originally Posted by jpollard

It didn't under my testing. But then, I already knew btrfs was experimental. A simple test is to zero some data blocks on a btrfs volume (you do have to go directly to the drive to do than) - then see if it recovers the damaged file.

My tests with raid5 showed it did not. It REPORTED that it had - but the data remained corrupted.

What kernel and btrfs-progs version? Because in the initial raid56 implementation it was known and documented that it would detect such corruption, reconstruct from parity, send the reconstructed data to the application layer, and then does NOT write the fix back to disk. Meaning subsequent reads always were rebuilds. The only way to fix this with that implementation was to balance the entire volume. It was kernel 3.19 that brought the fixup write back to the bad sector(s) for normal reads as well as scrubs.

Quote:

No - raid 5 was totally experimental, and I couldn't even get a dump of the hang. The system just stopped.

Ok well this sounds like a kernel from the Pleistocene in Btrfs terms. It's 2016 so maybe try testing a 4.4 kernel or newer before saying something doesn't work.

Quote:

BTW, the mailing lists show it is still hanging sometimes even now. It doesn't hang consistently (at least not for me). It can run weeks before it happens - the only common element has been the use of btrfs. Remove that - no hangs.

The reason why this is not helpful is because it ignores the complexity of all file systems, and what all of them struggle to achieve for their entire useful life, and that's how to handle edge cases. Btrfs is sufficiently new that it has many more edge cases to be found than other file systems. But the idea even Btrfs raid5 inconsistently occasionally hangs is almost certainly wrong because I use it and it doesn't hang for me. The difference in my wording is, "for me" i.e. my workload. So your workload is doing something that Btrfs isn't handling well. If you want it to work better, then you have to learn how to gather the right information and file a bug, otherwise that particular edge case isn't going to get fixed unless someone else runs into it and reports it.

For hangs, my suggestion is if you get the "blocked for more than 120 seconds" kernel message then use sysrq+w, then dump journalctl -bk to a file and attach that to the bug report and then also to the mailing list. If you get a hang without messages then use sysrq+t and then dump the journal. Alternative to journalctl, you can boot with log_buf_len=1M and just use dmesg. The thing is sysrq+w and t tend to fill up the kernel message buffer with the default size, so if you make it bigger, dmesg works fine.

jpollard · 08-16-2016, 05:13 PM

Quote:

Originally Posted by chrismurphy

What kernel and btrfs-progs version? Because in the initial raid56 implementation it was known and documented that it would detect such corruption, reconstruct from parity, send the reconstructed data to the application layer, and then does NOT write the fix back to disk. Meaning subsequent reads always were rebuilds. The only way to fix this with that implementation was to balance the entire volume. It was kernel 3.19 that brought the fixup write back to the bad sector(s) for normal reads as well as scrubs.

So kernel 4.x should bee good to go.

Quote:

Ok well this sounds like a kernel from the Pleistocene in Btrfs terms. It's 2016 so maybe try testing a 4.4 kernel or newer before saying something doesn't work.

From the mailing lists - it appears to happen occasionally happening (as of July).

Quote:

The reason why this is not helpful is because it ignores the complexity of all file systems, and what all of them struggle to achieve for their entire useful life, and that's how to handle edge cases. Btrfs is sufficiently new that it has many more edge cases to be found than other file systems. But the idea even Btrfs raid5 inconsistently occasionally hangs is almost certainly wrong because I use it and it doesn't hang for me. The difference in my wording is, "for me" i.e. my workload. So your workload is doing something that Btrfs isn't handling well. If you want it to work better, then you have to learn how to gather the right information and file a bug, otherwise that particular edge case isn't going to get fixed unless someone else runs into it and reports it.

It didn't hang for me either until I exported the filesystem with NFS.

Quote:

For hangs, my suggestion is if you get the "blocked for more than 120 seconds" kernel message then use sysrq+w, then dump journalctl -bk to a file and attach that to the bug report and then also to the mailing list. If you get a hang without messages then use sysrq+t and then dump the journal. Alternative to journalctl, you can boot with log_buf_len=1M and just use dmesg. The thing is sysrq+w and t tend to fill up the kernel message buffer with the default size, so if you make it bigger, dmesg works fine.

Wish. The hangs I got were just that the system stopped. No response, no keyboard, no network, no logs. Nothing. The only recovery was from the reset button. The current hangs appear to be less total - and do have timeout error message, and the system isn't locked (based on the mailing list).

I was also unable to identify anything going on that could trigger it other than NFS.

Removing btrfs from use, and the problem vanished.

I do think btrfs has a LOT of promise. It just isn't quite ready yet. One feature that worked for me that wasn't quite stable was the conversion from ext4. I was really surprised that it worked. What did work was impressive. Error recovery ... not so impressive, but has promise.

I'll try it again in a VM sometime in December just for curiosity. If it really works, (and I have enough spare disks) I may try it out on real data.

chrismurphy · 08-16-2016, 11:56 PM

Quote:

Originally Posted by jpollard

So kernel 4.x should bee good to go.

In terms of the passive on read and active on scrub repair to stable media, yes. There is a side note bug with scrub if there's a bad strip while it will be fixed it looks like the parity strip is wrongly overwritten with bad parity; I'm not certain if it affects raid6 or only raid5.

Quote:

From the mailing lists - it appears to happen occasionally happening (as of July).

I'm not sure what that would be, do you have a URL?

Quote:

It didn't hang for me either until I exported the filesystem with NFS.

I've exported it with NFS and didn't have this problem. So what kernel version did you experience this with and did you try any older or newer kernels and if so which ones? There's something additional happening in your case, hardware setup, workload, versions, that's contributing to this or piles of people would be reporting it if it were something consistent.

Quote:

Wish. The hangs I got were just that the system stopped. No response, no keyboard, no network, no logs. Nothing. The only recovery was from the reset button. The current hangs appear to be less total - and do have timeout error message, and the system isn't locked (based on the mailing list).

You appear to be broadly painting Btrfs with issues from a handful of messages on the Btrfs list. That is a self-selecting list where problems are reported, and they're most frequently edge cases that not everyone is experiencing.

Complete system halts are serious bugs that should be reported to your distribution because it implicates other things, even if Btrfs were the instigator. If you're not familiar with netconsole or kdump you should at least file a distribution kernel bug with reproduce steps so someone else with more debug skill can look into it.

Quote:

I was also unable to identify anything going on that could trigger it other than NFS.

Removing btrfs from use, and the problem vanished.

Or NFS. And without a call trace it's inappropriate speculation to say the problem is sourced in NFS or Btrfs. Either could have a bug that instigates the problem. Things aren't always what they seem. As consistent and easy to reproduce as you're making these problems sound, it should be easy for you to write up sufficiently detailed reproduce steps for anyone to reproduce what you're experiencing.

Quote:

I do think btrfs has a LOT of promise. It just isn't quite ready yet. One feature that worked for me that wasn't quite stable was the conversion from ext4. I was really surprised that it worked. What did work was impressive. Error recovery ... not so impressive, but has promise.

Saying error recovery isn't impressive is grossly nonspecific. If we're talking about bit rot, torn or misdirected writes, latent sector errors, Btrfs consistently detects those errors and if there's replicated metadata or data available it will self-heal and continue on. That's not true of any other Linux file system right now.

If we're talking about persistent read or write errors, yes Btrfs continues to spew noisy messages rather than eject the bad drive from the volume to silence the errors. That's not a lack of error recovery, it's a lack of faulty device support. Btrfs can sometimes become unstable while all of those messages are flooding, but most of the time that just leads to it going read only.

And if they're bugs, well those are unintended, and good bug reports are needed so developers can isolate and fix the problem.

jpollard · 08-17-2016, 05:27 AM

Quote:

Originally Posted by chrismurphy

In terms of the passive on read and active on scrub repair to stable media, yes. There is a side note bug with scrub if there's a bad strip while it will be fixed it looks like the parity strip is wrongly overwritten with bad parity; I'm not certain if it affects raid6 or only raid5.

I'm not sure what that would be, do you have a URL?

Same ones provided - just search for NFS.

Quote:

I've exported it with NFS and didn't have this problem. So what kernel version did you experience this with and did you try any older or newer kernels and if so which ones? There's something additional happening in your case, hardware setup, workload, versions, that's contributing to this or piles of people would be reporting it if it were something consistent.

I didn't do a detailed debug. I was just testing to see if it would work for me. The only thing relatively in common was the combination of NFS and btrfs. No btrfs, no hangs.

Quote:

You appear to be broadly painting Btrfs with issues from a handful of messages on the Btrfs list. That is a self-selecting list where problems are reported, and they're most frequently edge cases that not everyone is experiencing.

Complete system halts are serious bugs that should be reported to your distribution because it implicates other things, even if Btrfs were the instigator. If you're not familiar with netconsole or kdump you should at least file a distribution kernel bug with reproduce steps so someone else with more debug skill can look into it.

As I said - it isn't consistent. When it takes a week or more before it happens (raid 5), it is very hard to reproduce. With raid 1 I only saw ONE after a month.

Quote:

Or NFS. And without a call trace it's inappropriate speculation to say the problem is sourced in NFS or Btrfs. Either could have a bug that instigates the problem. Things aren't always what they seem. As consistent and easy to reproduce as you're making these problems sound, it should be easy for you to write up sufficiently detailed reproduce steps for anyone to reproduce what you're experiencing.

As I said, the only thing in common was btrfs. No btrfs, no hang.

Quote:

Saying error recovery isn't impressive is grossly nonspecific. If we're talking about bit rot, torn or misdirected writes, latent sector errors, Btrfs consistently detects those errors and if there's replicated metadata or data available it will self-heal and continue on. That's not true of any other Linux file system right now.

Like I said - my testing was only to decide if it was stable enough for me to use. My testing was simple. With all the checksums (including data) were working, then it should recover if a group of blocks were zeroed by accident - as the blocks checksum would have failed.

Quote:

If we're talking about persistent read or write errors, yes Btrfs continues to spew noisy messages rather than eject the bad drive from the volume to silence the errors. That's not a lack of error recovery, it's a lack of faulty device support. Btrfs can sometimes become unstable while all of those messages are flooding, but most of the time that just leads to it going read only.

Activity with good data, no messages. Just a unconditional hang.

Quote:

And if they're bugs, well those are unintended, and good bug reports are needed so developers can isolate and fix the problem.

Of course. But I wasn't trying to debug the filesystem. Just decide if it was stable enough for my use.