We are having issue with Red Hat Enterprise Linux Server release 6.5 (Santiago)

thoufic · 08-29-2016, 04:58 AM

Hello Guys,

In our database Red Hat linux 6.5 server, we facing issue like Filesystem disappeared suddenly.

Getting below error in my putty session while we facing issue,

"kernel:journal commit I/O error"

Aug 29 09:42:12 localhost dhclient[42261]: Sending on Socket/fallback
Aug 29 09:42:12 localhost dhclient[42261]: DHCPDISCOVER on em2 to 255.255.255.255 port 67 interval 5 (xid=0x27fe43b6)
Aug 29 09:42:17 localhost dhclient[42261]: DHCPDISCOVER on em2 to 255.255.255.255 port 67 interval 9 (xid=0x27fe43b6)
Aug 29 09:42:26 localhost dhclient[42261]: DHCPDISCOVER on em2 to 255.255.255.255 port 67 interval 15 (xid=0x27fe43b6)
Aug 29 09:42:36 localhost kernel: rport-2:0-0: blocked FC remote port time out: removing target and saving binding
Aug 29 09:42:36 localhost kernel: sd 2:0:0:2: rejecting I/O to offline device
Aug 29 09:42:36 localhost kernel: sd 2:0:0:2: rejecting I/O to offline device
Aug 29 09:42:36 localhost kernel: sd 2:0:0:2: rejecting I/O to offline device
Aug 29 09:42:36 localhost kernel: sd 2:0:0:3: rejecting I/O to offline device
Aug 29 09:42:36 localhost kernel: sd 2:0:0:3: rejecting I/O to offline device
Aug 29 09:42:36 localhost kernel: sd 2:0:0:3: rejecting I/O to offline device
Aug 29 09:42:36 localhost kernel: Aborting journal on device sdf-8.
Aug 29 09:42:36 localhost kernel: sd 2:0:0:3: rejecting I/O to offline device
Aug 29 09:42:36 localhost kernel: EXT4-fs error (device sdf): ext4_journal_start_sb: Detected aborted journal
Aug 29 09:42:36 localhost kernel: EXT4-fs (sdf):
Aug 29 09:42:36 localhost kernel: rport-2:0-1: blocked FC remote port time out: removing target and saving binding
Aug 29 09:42:36 localhost kernel: Remounting filesystem read-only
Aug 29 09:42:36 localhost kernel: JBD2: Detected IO errors while flushing file data on sdg-8
Aug 29 09:42:36 localhost kernel:
Aug 29 09:42:36 localhost kernel: sd 2:0:0:2: rejecting I/O to offline device
Aug 29 09:42:36 localhost kernel: JBD2: I/O error detected when updating journal superblock for sdf-8.
Aug 29 09:42:36 localhost kernel: EXT4-fs (sdg): delayed block allocation failed for inode 5006 at logical offset 2285 with max blocks 1 with error -5
Aug 29 09:42:36 localhost kernel: Aborting journal on device sdg-8.
Aug 29 09:42:36 localhost kernel:
Aug 29 09:42:36 localhost kernel: This should not happen!! Data will be lost
Aug 29 09:42:36 localhost kernel: EXT4-fs error (device sdg) in ext4_da_writepages: IO failure
Aug 29 09:42:36 localhost kernel: EXT4-fs error (device sdg): ext4_journal_start_sb: Detected aborted journal
Aug 29 09:42:36 localhost kernel: EXT4-fs (sdg): Remounting filesystem read-only
Aug 29 09:42:36 localhost kernel: sd 2:0:0:3: rejecting I/O to offline device

sundialsvcs · 08-29-2016, 10:49 AM

It would appear that you have a damaged file system.

Since, as a Red Hat purchaser, you have access to technical support, I suggest that you contact them directly.

It is highly probable that this device is malfunctioning.

TB0ne · 08-29-2016, 10:49 AM

Quote:

Originally Posted by thoufic

Hello Guys,
In our database Red Hat linux 6.5 server, we facing issue like Filesystem disappeared suddenly. Getting below error in my putty session while we facing issue,

"kernel:journal commit I/O error"

Aug 29 09:42:12 localhost dhclient[42261]: Sending on Socket/fallback
Aug 29 09:42:12 localhost dhclient[42261]: DHCPDISCOVER on em2 to 255.255.255.255 port 67 interval 5 (xid=0x27fe43b6)
Aug 29 09:42:17 localhost dhclient[42261]: DHCPDISCOVER on em2 to 255.255.255.255 port 67 interval 9 (xid=0x27fe43b6)
Aug 29 09:42:26 localhost dhclient[42261]: DHCPDISCOVER on em2 to 255.255.255.255 port 67 interval 15 (xid=0x27fe43b6)
Aug 29 09:42:36 localhost kernel: rport-2:0-0: blocked FC remote port time out: removing target and saving binding
Aug 29 09:42:36 localhost kernel: sd 2:0:0:2: rejecting I/O to offline device
Aug 29 09:42:36 localhost kernel: sd 2:0:0:2: rejecting I/O to offline device
Aug 29 09:42:36 localhost kernel: sd 2:0:0:2: rejecting I/O to offline device
Aug 29 09:42:36 localhost kernel: sd 2:0:0:3: rejecting I/O to offline device
Aug 29 09:42:36 localhost kernel: sd 2:0:0:3: rejecting I/O to offline device
Aug 29 09:42:36 localhost kernel: sd 2:0:0:3: rejecting I/O to offline device
Aug 29 09:42:36 localhost kernel: Aborting journal on device sdf-8.
Aug 29 09:42:36 localhost kernel: sd 2:0:0:3: rejecting I/O to offline device
Aug 29 09:42:36 localhost kernel: EXT4-fs error (device sdf): ext4_journal_start_sb: Detected aborted journal
Aug 29 09:42:36 localhost kernel: EXT4-fs (sdf):
Aug 29 09:42:36 localhost kernel: rport-2:0-1: blocked FC remote port time out: removing target and saving binding
Aug 29 09:42:36 localhost kernel: Remounting filesystem read-only
Aug 29 09:42:36 localhost kernel: JBD2: Detected IO errors while flushing file data on sdg-8
Aug 29 09:42:36 localhost kernel:
Aug 29 09:42:36 localhost kernel: sd 2:0:0:2: rejecting I/O to offline device
Aug 29 09:42:36 localhost kernel: JBD2: I/O error detected when updating journal superblock for sdf-8.
Aug 29 09:42:36 localhost kernel: EXT4-fs (sdg): delayed block allocation failed for inode 5006 at logical offset 2285 with max blocks 1 with error -5
Aug 29 09:42:36 localhost kernel: Aborting journal on device sdg-8.
Aug 29 09:42:36 localhost kernel:
Aug 29 09:42:36 localhost kernel: This should not happen!! Data will be lost
Aug 29 09:42:36 localhost kernel: EXT4-fs error (device sdg) in ext4_da_writepages: IO failure
Aug 29 09:42:36 localhost kernel: EXT4-fs error (device sdg): ext4_journal_start_sb: Detected aborted journal
Aug 29 09:42:36 localhost kernel: EXT4-fs (sdg): Remounting filesystem read-only
Aug 29 09:42:36 localhost kernel: sd 2:0:0:3: rejecting I/O to offline device

You don't tell us anything about your hardware, or where/how the disk(s) are connected, what you've done/tried, or when this error occurred. We can't guess. Is this a SAN? SATA? JBOD? RAID (what level/controller??)

Most importantly, since this is with RHEL 6, you should really call Red Hat support..you are PAYING FOR RHEL, aren't you????

unSpawn · 08-29-2016, 11:44 AM

Quote:

Originally Posted by thoufic

(..) Filesystem disappeared suddenly.

Code:

Aug 29 09:42:12 localhost dhclient[42261]: Sending on   Socket/fallback

Your DHCP client sent a DHCPDISCOVER three times. Seems like network troubleshooting comes first?.. Is this a virtual machine or real hardware? Any previous network problems or recent changes? Any adjacent servers in the same network segment experiencing trouble too?

TB0ne · 08-30-2016, 07:30 AM

Quote:

Originally Posted by unSpawn

Your DHCP client sent a DHCPDISCOVER three times. Seems like network troubleshooting comes first?.. Is this a virtual machine or real hardware? Any previous network problems or recent changes? Any adjacent servers in the same network segment experiencing trouble too?

Hmm...that, coupled with a disk error (?). OP, are you using ISCSI by any chance??

Medievalist · 08-30-2016, 08:05 AM

Your logs indicate that your system is losing contact with the physical disk devices and can't write. This is almost certainly NOT an operating system problem!

Once the devices fail write, their on-disk structures may become corrupted. It depends on exactly when the writes start failing; but you should always assume in such situations that your filesystem will be corrupt, and once you've fixed the underlying problem you should run fsck to repair the corruption. If you don't do this you'll strongly regret it.

One of the worst features - possibly THE worst feature - of the linux distro & kernel you are using is that it will always try to remount a disk that has failed write as readonly. So instead of the machine crashing and being obviously broken, it will pretend to still work, and end users will continue to try to write and things will spiral rapidly into a worse situation than if the machine had simply crashed.

Repair whatever communication path your disk devices rely on and this problem will go away.

jpollard · 08-30-2016, 08:57 AM

Quote:

Originally Posted by Medievalist

Your logs indicate that your system is losing contact with the physical disk devices and can't write. This is almost certainly NOT an operating system problem!

Once the devices fail write, their on-disk structures may become corrupted. It depends on exactly when the writes start failing; but you should always assume in such situations that your filesystem will be corrupt, and once you've fixed the underlying problem you should run fsck to repair the corruption. If you don't do this you'll strongly regret it.

One of the worst features - possibly THE worst feature - of the linux distro & kernel you are using is that it will always try to remount a disk that has failed write as readonly. So instead of the machine crashing and being obviously broken, it will pretend to still work, and end users will continue to try to write and things will spiral rapidly into a worse situation than if the machine had simply crashed.

If it is mounted read only no further damage will occur - and the user will not be able to write after the first failure, very little will continue to operate (perhaps some CPU only operations... but no writes to the failed disk.

Next, mounting read-only (if it succeeds) allows time to make an emergency backup to an alternate filesystem or other storage.

Quote:

Repair whatever communication path your disk devices rely on and this problem will go away.

Agree with that. It also would help to use some redundancy (raid and multiple communication channels).

Medievalist · 08-30-2016, 09:23 AM

Quote:

If it is mounted read only no further damage will occur - and the user will not be able to write after the first failure, very little will continue to operate (perhaps some CPU only operations... but no writes to the failed disk.

"No further damage will occur" to the disk volume, sure.

But when critically important operations, such as logging continuous data inputs from processes that cannot be reversed (like scientific experiments) or that require real-time responses in order to avoid loss of life (like reactor controls) can't function, it's best that the system either crash entirely and reboot or else start screaming its bloody head off. Making obscure entries in logs and remounting read-only (so that processes that READ still are running, and reacting as if old data were current, but processes that WRITE are not updating the old data) has always turned out to be a terrible idea in my experience. Especially in industrial process control!

As a sysadmin, it's best to turn that "feature" off. Don't read broken disks. As a programmer, do not assume that since you can read the disk that the data is up to date. Timestamp everything critical and be prepared for incoming data to suddenly cease.

jpollard · 08-30-2016, 10:24 AM

Quote:

Originally Posted by Medievalist

"No further damage will occur" to the disk volume, sure.

But when critically important operations, such as logging continuous data inputs from processes that cannot be reversed (like scientific experiments) or that require real-time responses in order to avoid loss of life (like reactor controls) can't function, it's best that the system either crash entirely and reboot or else start screaming its bloody head off. Making obscure entries in logs and remounting read-only (so that processes that READ still are running, and reacting as if old data were current, but processes that WRITE are not updating the old data) has always turned out to be a terrible idea in my experience. Especially in industrial process control!

If you don't have redundancy in your filesystems, networks, and systems with automatic failover... you deserve the failure you get. With such "critical systems" incompetence what you already have.

Quote:

As a sysadmin, it's best to turn that "feature" off. Don't read broken disks. As a programmer, do not assume that since you can read the disk that the data is up to date. Timestamp everything critical and be prepared for incoming data to suddenly cease.

It doesn't matter.

If a disk is failing then you DON'T want to write. If a disk is failing for a write you DON'T want to continue writing. read is inherently less sensitive. If the filesystem can't be remounted (which happens), your system is dead.

You are already SOL for "everything critical" in either case.