XFS unmounts itself

rs232 · 01-12-2022, 05:06 AM

I have a raid5 volume formatted in XFS (3x 8TB disks) that heve been working just fine for 4 years now.

Since two days ago I get a notification via email (openmediavault) that the volume has been dismounted.

Code:

Status failed Service mountpoint_srv_dev-disk-by-label-DATA 

	Date:        Wed, 12 Jan 2022 08:08:30
	Action:      alert
	Host:        NAS
	Description: status failed (1) -- mountpoint: /srv/dev-disk-by-label-DATA: Input/output error

I tried what I know already, dismount all the other binding to the volume, run xfs_repair and remounted ok (even after a reboot)

The thing is.... it keeps disconnecting! This is what my syslog is telling me:

Code:

Jan 12 08:07:43 NAS kernel: [95441.856183] CPU: 3 PID: 1438 Comm: syncthing Not tainted 4.18.0-0.bpo.1-amd64 #1 Debian 4.18.6-1~bpo9+1
Jan 12 08:07:43 NAS kernel: [95441.856184] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 12/12/2018
Jan 12 08:07:43 NAS kernel: [95441.856185] Call Trace:
Jan 12 08:07:43 NAS kernel: [95441.856191]  dump_stack+0x5c/0x7b
Jan 12 08:07:43 NAS kernel: [95441.856233]  xfs_trans_cancel+0x116/0x140 [xfs]
Jan 12 08:07:43 NAS kernel: [95441.856274]  xfs_create+0x41d/0x640 [xfs]
Jan 12 08:07:43 NAS kernel: [95441.856316]  xfs_generic_create+0x241/0x2e0 [xfs]
Jan 12 08:07:43 NAS kernel: [95441.856321]  path_openat+0x141c/0x14d0
Jan 12 08:07:43 NAS kernel: [95441.856325]  do_filp_open+0x99/0x110
Jan 12 08:07:43 NAS kernel: [95441.856329]  ? vfs_statx+0x73/0xe0
Jan 12 08:07:43 NAS kernel: [95441.856331]  ? vfs_statx+0x73/0xe0
Jan 12 08:07:43 NAS kernel: [95441.856333]  ? __check_object_size+0x98/0x1a0
Jan 12 08:07:43 NAS kernel: [95441.856335]  ? do_sys_open+0x12e/0x210
Jan 12 08:07:43 NAS kernel: [95441.856337]  do_sys_open+0x12e/0x210
Jan 12 08:07:43 NAS kernel: [95441.856340]  do_syscall_64+0x55/0x110
Jan 12 08:07:43 NAS kernel: [95441.856343]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jan 12 08:07:43 NAS kernel: [95441.856346] RIP: 0033:0x4b5c2a
Jan 12 08:07:43 NAS kernel: [95441.856346] Code: e8 fb 4f fb ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 4c 8b 54 24 28 4c 8b 44 24 30 4c 8b 4c 24 38 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 40 ff ff ff ff 48 c7 44 24 48
Jan 12 08:07:43 NAS kernel: [95441.856373] RSP: 002b:000000c0009ad250 EFLAGS: 00000206 ORIG_RAX: 0000000000000101
Jan 12 08:07:43 NAS kernel: [95441.856375] RAX: ffffffffffffffda RBX: 000000c000045800 RCX: 00000000004b5c2a
Jan 12 08:07:43 NAS kernel: [95441.856376] RDX: 00000000000800c2 RSI: 000000c00122a360 RDI: ffffffffffffff9c
Jan 12 08:07:43 NAS kernel: [95441.856377] RBP: 000000c0009ad2e0 R08: 0000000000000000 R09: 0000000000000000
Jan 12 08:07:43 NAS kernel: [95441.856378] R10: 00000000000001a4 R11: 0000000000000206 R12: 000000c00122a360
Jan 12 08:07:43 NAS kernel: [95441.856379] R13: 0000000000000001 R14: 000000c0002e8000 R15: ffffffffffffffff
Jan 12 08:07:43 NAS kernel: [95441.856382] XFS (sdc1): xfs_do_force_shutdown(0x8) called from line 1018 of file /build/linux-GVmoCH/linux-4.18.6/fs/xfs/xfs_trans.c.  Return address = 000000008dcb83c7

Once it gets dismounted it looks like this:

Code:

root@NAS:~# ls -la /mnt/
ls: cannot access '/mnt/raid5': Input/output error
total 24
drwxr-xr-x   4 root root   4096 Nov 20 12:24 .
drwxr-xr-x  24 root root   4096 Oct 24  2019 ..
drwxrwsrwx 184 ftp  users 16384 Nov  1 02:05 5TB
d?????????   ? ?    ?         ?            ? raid5
root@NAS:~# ls -la /mnt/raid5/
ls: cannot access '/mnt/raid5/': Input/output error

The funny thing is: xfs_repair doesn't find anything (any more)

Code:

root@NAS:~# xfs_repair /dev/sdc1
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
        - agno = 13
        - agno = 14
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 3
        - agno = 0
        - agno = 2
        - agno = 4
        - agno = 1
        - agno = 6
        - agno = 5
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
        - agno = 13
        - agno = 14
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
Note - quota info will be regenerated on next quota mount.
done
root@NAS:~#

An the raid controller tells me the volume is completely OK.

I'm not sure what to try next?

syg00 · 01-12-2022, 05:49 AM

Quote:

Originally Posted by rs232

This is what my syslog is telling me:

Without seeing the whole thing, looks like a kernel oops. And you neglected to mention you're running VMWare - do you think that could conceivably be not relevant ?.

Quote:

I'm not sure what to try next?

Restore the filesystem from your most current backup.

rs232 · 01-12-2022, 05:56 AM

Quote:

Originally Posted by syg00

Without seeing the whole thing, looks like a kernel oops. And you neglected to mention you're running VMWare - do you think that could conceivably be not relevant ?.Restore the filesystem from your most current backup.

It's indeed vmware and the volume pure passthrough, not sure it does have a role in this issue though. Even if it caused it I don't see why it can't be fixed because of it?

If restoring the filesystem would be "easy" I would have already formatted and restored, the truth is if I wipe this filesystem restoring 14TB of data over the Internet (since this the backup is remote) will take forever. I'm keeping this as my very last resort.

I was hoping for an advanced repair option... BTW I did cleared the log via xfs_repair -L /dev/sdc1 and it seems to have complete correctly. Still once the volume is used it unmounts itself.

Is there any official board where I can report this issue? Perhaps the XFS folks are interested in this case?

Thanks!

syg00 · 01-12-2022, 06:11 AM

I don't use xfs but having attended a few presentations by some of the devs, they seem confident in their product. I'm sure they'll happily accept a bugzilla - start at xfs.org.

uteck · 01-12-2022, 09:40 AM

Have you ran any of the Smart tests to make sure all the drives are healthy? Maybe one is starting to fail and causing the unmount?

Does the log show anything past the last line you posted

Quote:

fs_do_force_shutdown(0x8) called from line 1018

I think that will help narrow the problem down.

rs232 · 01-12-2022, 10:05 AM

Quote:

Originally Posted by uteck

Have you ran any of the Smart tests to make sure all the drives are healthy? Maybe one is starting to fail and causing the unmount?

Does the log show anything past the last line you posted I think that will help narrow the problem down.

Going back into the logs... these are the relevant ones plus an extra couple of lines:

Code:

Jan 12 11:41:39 NAS kernel: [ 3157.268337] XFS (sdc1): Quotacheck needed: Please wait.
Jan 12 11:41:53 NAS kernel: [ 3170.557181] XFS (sdc1): Quotacheck: Done.
Jan 12 11:41:58 NAS kernel: [ 3175.512269] XFS (sdc1): Unmounting Filesystem
Jan 12 11:41:59 NAS kernel: [ 3176.957550] XFS (sdc1): Mounting V4 Filesystem
Jan 12 11:41:59 NAS kernel: [ 3177.163211] XFS (sdc1): Ending clean mount
Jan 12 11:42:01 NAS kernel: [ 3179.130696] XFS (sdc1): Unmounting Filesystem
Jan 12 11:42:03 NAS kernel: [ 3180.798027] XFS (sdc1): Mounting V4 Filesystem
Jan 12 11:42:03 NAS kernel: [ 3180.921496] XFS (sdc1): Ending clean mount
Jan 12 11:47:22 NAS kernel: [ 3498.175610] CPU: 5 PID: 5404 Comm: rsync Not tainted 4.18.0-0.bpo.1-amd64 #1 Debian 4.18.6-1~bpo9+1
Jan 12 11:47:22 NAS kernel: [ 3498.175616] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 12/12/2018
Jan 12 11:47:22 NAS kernel: [ 3498.175622] Call Trace:
Jan 12 11:47:22 NAS kernel: [ 3498.175629]  dump_stack+0x5c/0x7b
Jan 12 11:47:22 NAS kernel: [ 3498.175688]  xfs_trans_cancel+0x116/0x140 [xfs]
Jan 12 11:47:22 NAS kernel: [ 3498.175736]  xfs_create+0x41d/0x640 [xfs]
Jan 12 11:47:22 NAS kernel: [ 3498.175780]  xfs_generic_create+0x241/0x2e0 [xfs]
Jan 12 11:47:22 NAS kernel: [ 3498.175808]  ? d_splice_alias+0x139/0x3f0
Jan 12 11:47:22 NAS kernel: [ 3498.175812]  path_openat+0x141c/0x14d0
Jan 12 11:47:22 NAS kernel: [ 3498.175816]  do_filp_open+0x99/0x110
Jan 12 11:47:22 NAS kernel: [ 3498.175820]  ? __check_object_size+0x98/0x1a0
Jan 12 11:47:22 NAS kernel: [ 3498.175823]  ? do_sys_open+0x12e/0x210
Jan 12 11:47:22 NAS kernel: [ 3498.175825]  do_sys_open+0x12e/0x210
Jan 12 11:47:22 NAS kernel: [ 3498.175829]  do_syscall_64+0x55/0x110
Jan 12 11:47:22 NAS kernel: [ 3498.175832]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jan 12 11:47:22 NAS kernel: [ 3498.175836] RIP: 0033:0x7f6652f836f0
Jan 12 11:47:22 NAS kernel: [ 3498.175837] Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 83 3d 19 30 2c 00 00 75 10 b8 02 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 fe 9d 01 00 48 89 04 24
Jan 12 11:47:22 NAS kernel: [ 3498.175875] RSP: 002b:00007ffc53860668 EFLAGS: 00000246 ORIG_RAX: 0000000000000002
Jan 12 11:47:22 NAS kernel: [ 3498.175877] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f6652f836f0
Jan 12 11:47:22 NAS kernel: [ 3498.175879] RDX: 0000000000000180 RSI: 00000000000000c2 RDI: 00007ffc538628d0
Jan 12 11:47:22 NAS kernel: [ 3498.175881] RBP: 000000000003a2f8 R08: 000000000000ffff R09: 67756c702e707061
Jan 12 11:47:22 NAS kernel: [ 3498.175882] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffc53862942
Jan 12 11:47:22 NAS kernel: [ 3498.175884] R13: 8421084210842109 R14: 00000000000000c2 R15: 00007f6653011540
Jan 12 11:47:22 NAS kernel: [ 3498.175888] XFS (sdc1): xfs_do_force_shutdown(0x8) called from line 1018 of file /build/linux-GVmoCH/linux-4.18.6/fs/xfs/xfs_trans.c.  Return address = 00000000ddf97241
Jan 12 11:57:22 NAS kernel: [ 4096.545746] XFS (sdc1): Unmounting Filesystem
Jan 12 11:57:26 NAS kernel: [ 4100.114664] XFS (sdc1): Mounting V4 Filesystem
Jan 12 11:57:26 NAS kernel: [ 4100.253103] XFS (sdc1): Starting recovery (logdev: internal)
Jan 12 11:57:26 NAS kernel: [ 4100.514540] XFS (sdc1): Ending recovery (logdev: internal)
Jan 12 11:57:27 NAS kernel: [ 4101.650846] XFS (sdc1): Unmounting Filesystem
Jan 12 11:59:48 NAS kernel: [ 4242.068098] XFS (sdc1): Mounting V4 Filesystem
Jan 12 11:59:48 NAS kernel: [ 4242.220766] XFS (sdc1): Ending clean mount
Jan 12 11:59:48 NAS kernel: [ 4242.221275] XFS (sdc1): Quotacheck needed: Please wait.
Jan 12 12:00:01 NAS kernel: [ 4255.614280] XFS (sdc1): Quotacheck: Done.
Jan 12 12:14:02 NAS [703]: END SERVICE
Jan 12 12:14:02 NAS kernel: [ 5094.239877] XFS (sdc1): Unmounting Filesystem

Code:

Jan 12 12:14:21 NAS kernel: [    5.675213] vmxnet3 0000:03:00.0 ens160: NIC Link is Up 10000 Mbps
Jan 12 12:14:21 NAS VGAuthService[714]: Core dump limit set to -1
Jan 12 12:14:22 NAS kernel: [    5.911070] Bluetooth: Core ver 2.22
Jan 12 12:14:22 NAS kernel: [    5.911091] NET: Registered protocol family 31
Jan 12 12:14:22 NAS kernel: [    5.911093] Bluetooth: HCI device and connection manager initialized
Jan 12 12:14:22 NAS kernel: [    5.911097] Bluetooth: HCI socket layer initialized
Jan 12 12:14:22 NAS kernel: [    5.911099] Bluetooth: L2CAP socket layer initialized
Jan 12 12:14:22 NAS kernel: [    5.911111] Bluetooth: SCO socket layer initialized
Jan 12 12:14:22 NAS kernel: [    6.113483] vmxnet3 0000:03:00.0 ens160: intr type 3, mode 0, 9 vectors allocated
Jan 12 12:14:22 NAS kernel: [    6.114539] vmxnet3 0000:03:00.0 ens160: NIC Link is Up 10000 Mbps
Jan 12 12:14:22 NAS kernel: [    6.872807] CPU: 6 PID: 1112 Comm: Plex Media Serv Not tainted 4.18.0-0.bpo.1-amd64 #1 Debian 4.18.6-1~bpo9+1
Jan 12 12:14:22 NAS kernel: [    6.872808] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 12/12/2018
Jan 12 12:14:22 NAS kernel: [    6.872809] Call Trace:
Jan 12 12:14:22 NAS kernel: [    6.872815]  dump_stack+0x5c/0x7b
Jan 12 12:14:22 NAS kernel: [    6.872856]  xfs_trans_cancel+0x116/0x140 [xfs]
Jan 12 12:14:22 NAS kernel: [    6.872896]  xfs_create+0x41d/0x640 [xfs]
Jan 12 12:14:22 NAS kernel: [    6.872936]  xfs_generic_create+0x241/0x2e0 [xfs]
Jan 12 12:14:22 NAS kernel: [    6.872939]  ? lookup_dcache+0x17/0x60
Jan 12 12:14:22 NAS kernel: [    6.872942]  vfs_mkdir+0x10c/0x1a0
Jan 12 12:14:22 NAS kernel: [    6.872944]  do_mkdirat+0xd3/0x110
Jan 12 12:14:22 NAS kernel: [    6.872948]  do_syscall_64+0x55/0x110
Jan 12 12:14:22 NAS kernel: [    6.872950]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jan 12 12:14:22 NAS kernel: [    6.872953] RIP: 0033:0x7fe5d99fbccf
Jan 12 12:14:22 NAS kernel: [    6.872954] Code: ff ff b9 00 01 00 00 e9 9c fc ff ff 48 89 f2 48 89 fe bf 9c ff ff ff b9 00 01 00 00 e9 c6 fd ff ff 89 f6 b8 53 00 00 00 0f 05 <9b> 48 89 c7 e9 c8 20 fd ff 48 63 ff 89 d2 b8 02 01 00 00 0f 05 9b
Jan 12 12:14:22 NAS kernel: [    6.872981] RSP: 002b:00007ffe8b068228 EFLAGS: 00000282 ORIG_RAX: 0000000000000053
Jan 12 12:14:22 NAS kernel: [    6.872983] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fe5d99fbccf
Jan 12 12:14:22 NAS kernel: [    6.872984] RDX: 00007ffe8b068410 RSI: 00000000000001ff RDI: 00007fe5d9975f30
Jan 12 12:14:22 NAS kernel: [    6.872985] RBP: 00007ffe8b0682f0 R08: 00007fe5d80cd5f0 R09: 00007fe5d80cd5f0
Jan 12 12:14:22 NAS kernel: [    6.872985] R10: 0000000000000008 R11: 0000000000000282 R12: 00007ffe8b0683f8
Jan 12 12:14:22 NAS kernel: [    6.872986] R13: 00007ffe8b068480 R14: 00007ffe8b0683f8 R15: 00007ffe8b068410
Jan 12 12:14:22 NAS kernel: [    6.872990] XFS (sdc1): xfs_do_force_shutdown(0x8) called from line 1018 of file /build/linux-GVmoCH/linux-4.18.6/fs/xfs/xfs_trans.c.  Return address = (____ptrval____)
Jan 12 12:14:22 NAS kernel: [    6.882981] softdog: initialized. soft_noboot=0 soft_margin=60 sec soft_panic=0 (nowayout=0)
Jan 12 12:15:12 NAS kernel: [   55.341695] random: crng init done
Jan 12 12:15:12 NAS kernel: [   55.341701] random: 7 urandom warning(s) missed due to ratelimiting
Jan 12 12:28:08 NAS kernel: [  830.370355] XFS (sdc1): Unmounting Filesystem
Jan 12 12:28:24 NAS kernel: [  845.947134] XFS (sdc1): Mounting V4 Filesystem
Jan 12 12:28:24 NAS kernel: [  846.028406] XFS (sdc1): Ending clean mount
Jan 12 12:39:14 NAS kernel: [ 1493.988758] XFS (sdc1): Unmounting Filesystem
Jan 12 13:19:33 NAS kernel: [ 3908.870122] XFS (sdc1): Mounting V4 Filesystem
Jan 12 13:19:34 NAS kernel: [ 3909.036072] XFS (sdc1): Ending clean mount
Jan 12 13:19:34 NAS kernel: [ 3909.036592] XFS (sdc1): Quotacheck needed: Please wait.
Jan 12 13:19:47 NAS kernel: [ 3922.451620] XFS (sdc1): Quotacheck: Done.

uteck · 01-12-2022, 10:14 AM

Well that was not as helpful as I was hopping for. Sometimes it will list a memory error or something specific that caused the unmount.
The other option is to look at that line of code and see what it refers to. Maybe that will give an idea of the cause.

rs232 · 01-12-2022, 10:18 AM

Quote:

Originally Posted by uteck

Well that was not as helpful as I was hopping for. Sometimes it will list a memory error or something specific that caused the unmount.
The other option is to look at that line of code and see what it refers to. Maybe that will give an idea of the cause.

The issue with looking at the code is to find the correct source file! I googles the file name 1018 (as per my error) is a quoted line.

Another version (from Torvard's github) has this as line 1018:

/*
* Commit the current transaction.
* If this commit failed, then it'd just unlock those items that
* are not marked ihold. That also means that a filesystem shutdown
* is in progress. The caller takes the responsibility to cancel
* the duplicate transaction that gets returned.
*/
error = __xfs_trans_commit(trans, true);
if (error)
return error;