[SOLVED] Can HP StorageWorks 215 tape cause FC26 KDE GUI to be stuck?
Linux - HardwareThis forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Can HP StorageWorks 215 tape cause FC26 KDE GUI to be stuck?
Hello!
For quite a few years I perform full backups using a HP StorageWorks 215 (LTO 1) tape,
about 18 times so far and each time it worked flawlessly in the following way:
I open a "konsole", become "root" and call a Perl wrapper which calculates timestamp
and calls "tar" by a command similar to:
Another "konsole" is opened (also as "root") in the directory of the log file,
where I open the log file with "nedit" and from time to time I move the cursor over
"nedit", get "file modified by another program Reload/Cancel" prompt, click "reload" and
check what files it's backing up now. I also see there the prompts to insert the next volume
(it's a multi volume backup), when I see one I press "eject" on the tape drive panel, replace
the cartrige and press
"Enter" in the "konsole" where the backup command is running.
Also in the log file "konsole" I run from time to time a Perl script which calculates the sum
of file sizes listed in the log:
Code:
<root localhost>.../log>du_f -G full_2018_10_15_2143
Total files size: 6 G.
Except the first time it's just "Arrow up" followed by "Enter".
The time it takes to fill one LTO 1 (100 GB) cartrige is about 6 hours (it's a PC built in 2009:
AMD Phenom X4, ASUS M3A79T deluxe mboard, 16 GB RAM, ATI FireGL V5600 video card,
currently running FC26+KDE),
the number of volumes is about 10 so it's a multi-day process: start, wait 6 hours (or go to work
or to sleep), check "insert next volume, press Enter when ready" prompt in "nedit", eject used tape,
insert next, press "Enter" in the "console" of the "tar" command.
But the last time I attempted to perform the same backup process I've got problems:
the first 2 volume changes went OK but on the 3-rd one I moved the mouse pointer over the
"nedit", got the usual "file changed by other program, reload/cancell" prompt
but when I clicked on "Reload" - nothing happened, the prompt remained as if I didn't
click on it. I clicked on the "konsole" of the log file directory - it changed focus to
front, as expected, I pressed "Arrow up" - got the last command "du_f -G <log name>"
but when I pressed "Enter" - again, nothing happened.
I clicked the "close" button of "nedit" but it ignored me. So I attempted to "blindly"
change the tape - pressed "Eject", waited for the tape rewind to end and tape to eject,
replaced the tape, clicked on the "konsole" where the "tar" command is running -
it changed focus to "front" but when I pressed "Enter" on it - again, it had been ignored,
nothing happened.
At that point I discovered that I can't close any "konsole" (only focus changes of already
opened windows), can't open any new "konsole" (in fact - anything). So I decided
to shutdown - pressed "Eject", took the tape out (thank God, at least this worked),
attempted RMB on the desktop to bring the shutdown menu - nothing,
attempted going to text mode by Ctrl+Alt+F3 to shutdown from there - nothing.
As a last resort I had to brutally power down the PC.
After waiting some time and powering up I attempted again but this time I didn't even got
to the first cartrige replace prompt -
when I've got back to the PC after few hours I found it in the same semi-responsive state.
Again had to power down the PC and after rebooting and checking the last write time
of the log file I saw that it stopped being updated about 2 hours after the backup start.
The fact that initially the problem appeared before the 3-rd cartride change and later
started appearing during the first cartrige also points in the direction of the
backup process, in particular - the tape. No error LEDs on the tape drive, only the "Ready"
blinking during the backup and is "ON" steady when I get back to a "semi-stuck" GUI.
REAL TIME UPDATE: while I was writing this post I was also running another backup attempt.
After about 1.5 hours suddenly I noticed in the log some I/O errors like:
Code:
/home/kaza/GRAPHICS/PICS/D/Video0208.3gp
/home/kaza/GRAPHICS/PICS/D/Photo0416.jpg
/home/kaza/GRAPHICS/PICS/D/Photo0542.jpg
/home/kaza/GRAPHICS/PICS/D/Video0025.3gp
/bin/tar: /home/kaza/GRAPHICS/PICS/D/Video0025.3gp: File shrank by 4145152 bytes; padding with zeros
/home/kaza/GRAPHICS/PICS/D/Photo0271.jpg
/bin/tar: /home/kaza/GRAPHICS/PICS/D/Photo0271.jpg: Read error at byte 0, while reading 2048 bytes: Input/output error
/home/kaza/GRAPHICS/PICS/D/Photo0179.jpg
/bin/tar: /home/kaza/GRAPHICS/PICS/D/Photo0179.jpg: Read error at byte 0, while reading 3072 bytes: Input/output error
/home/kaza/GRAPHICS/PICS/D/Video0097.3gp
/bin/tar: /home/kaza/GRAPHICS/PICS/D/Video0097.3gp: Read error at byte 0, while reading 512 bytes: Input/output error
after some time I heared a beep from the PC (can't tell if from the tape drive or something
else), the "Ready" LED stopped blinking and turned "ON" steady.
I hurriedly "copy-pasted" this post to a text file before noticing the familiar "semi-stuck"
GUI. But after about a minute or so I noticed the neither the "top" window not "gkrellm"
graphs (always opened at the side of my desktop to see what's going) are being updated
and the mouse pointer dsappeared. Now the GUI was "totally-stuck".
In the past I always started a backup from inserting a cleaning cartrige into the tape drive
but this time I've got "tape error" LED which is probably "cleanings count timeout".
I ordered a new one on e-bay but since it's expected to take a month to arrive I attempted
to backup anyway. Am I being punished for this? The tape drive hasn't been used for few months,
it might've gathered some dust.
But the most important question: why a problem with the backup tape drive causes the GUI to be stuck,
first partially and then completely?
My kernel version is:
Code:
<localhost>.../NOA>uname -a
Linux localhost.localdomain 4.15.6-200.spi_gpio.fc26.x86_64 #1 SMP Sat Mar 3 02:13:51 IST 2018 x86_64 x86_64 x86_64 GNU/Linux
Looking at the kernel name reminds me that I've made a privet kernel version with SPI and GPIO
support enabled to allow me programming through USB an SPI chip (CH341A). Can it somehow affect
KDE GUI being "stuck" when tape drive has errors?
TIA for any ideas (except the obvious "don't start backup before heads had been cleaned"),
kaza.
Next time it happens copy /proc/meminfo to a file and also get a grab of top - just the first 20 line is probably enough.
Post them here. Take a copy of meminfo just after boot for comparison as well.
When this problem occurs the KDE eventually freezes up
so I guess I'll need to be lucky to catch the tape errors
just as they occur and before the whole KDE freezes up.
When tape errors occur and the KDE gradually "loses its bearings"
I don't remember I saw anything hogging close to 100% CPU (core)
but I'll try to capture the "top".
I'll do another backup try and see if I'll be fast enough to save the info
before KDE freezes up.
As a wild guess it seems to me that I/O errors of the SCSI tape
somehow prevent normal disk access of the OS (KDE), despite the fact that
OS resides on SAS disks connected to a different controller than the
tape. Here's the output of "lsscsi":
Code:
<localhost>.../temp>lsscsi
[0:0:0:0] cd/dvd HL-DT-ST DVDRAM GH24NSD1 LG00 /dev/sr0
[6:0:0:0] disk ICP KA_RAID_0 V1.0 /dev/sda
[6:0:1:0] disk ICP KA_RAID_1 V1.0 /dev/sdb
[6:1:0:0] disk FUJITSU MAU3147RC 0104 -
[6:1:1:0] disk FUJITSU MBA3147RC HPF1 -
[6:1:2:0] disk TOSHIBA MK2001TRKB 0105 -
[6:1:3:0] disk TOSHIBA MK2001TRKB 0105 -
[7:0:0:0] disk FUJITSU MAS3367NP 0104 /dev/sdc
[7:0:3:0] tape HP Ultrium 1-SCSI N2CG /dev/st0
The OS is installen on the pair of MAU3147RC connected as RAID 0
(MAS3367NP is some old data).
Well, the first experiment was a failure: for the first 2 hours the tape drive worked OK
so I stopped watching it continuosly and started "sampling" it once in few minutes.
After over 3 hours (damn Murphy with his laws...) I found the KDE almost stuck and needless
to say, no chance of copying the /proc/meminfo. I couldn't move any window and "copy paste"
the "top" window contents but from what was accidentally not obscured on it I could manually
copy the following:
I also noticed that the tape drive makes a movement sound for about 1 sec in duration
once in 4-5 minutes. Together with the "74.2%" CPU wasted on wait it looks like
the tape drive is endlessly attempting to get again and again some data it didn't
get the first time and the rest of KDE can't do anything while it happens.
(That's just my guess, feel free to correct me.)
I'll make another experiment, this time I'll write a Perl script which will
copy /proc/meminfo into a file with timestame in name once in 10 minutes
and also a "top" in batch mode will be saved to another file, also
with a timestamp in name.
Meanwhile we can only theoretize.
Does anyone know (or can point me to a knowledge source) how the LINUX
system accesses different disks/tapes connected to different
SCSI controllers (main disk - SAS, tape drive - SCSI) from the viewpoint
of queueing: is there a single gueue for requests of all disks+tape
or many queues working in parallel? Is there any configuration of this
which can be changed (even if it means kernel compilation configuration)?
Next time just grab your phone and take a photo - saves all the typing ...
Thanks for the interrupt percentage - I was worried that it may be high; that runs at a priority above all else. No CPU is "wasted on wait" - that is merely an accounting convenience, a subset of "idle", but may (does in your case) indicate non-completed I/O.
Whilst you're collecting data, you might as well get vmstat.
Each device gets its own q - SSDs may get multi-queuing per device if you have a current kernel.
OK, it took some time (life, etc.) to repeat
the experiment with periodically saved data (/proc/meminfo, top -b -n 1, dmesg -c)
and I don't see anything "sticking out" in the saved reports.
Again, after over 2 hours the tape stopped the backup operation, a short (~1 sec) beep
was audible (can it be some m-board error code?) and the GUI became first "partially
responsive" and after insuccessfull attempt to close the FireFox window (from past experiences -
that happened after attempt to close any window) the GUI became frozen.
The data save period was 10 minutes and the last timestamp was: 2018_10_26_1404.
Looking at the time stamp of the tar command log I see that it last write was at time 14:12
so 8 minutes passes from the last dump of the info to the system being stuck.
I also looked at /var/log/messages, I only see messages related to network manager
until slightly after 14:05:30 (3.5 min after the last write of tar log file)
I see a binary data in the file, probably that's when I powered down the PC
(couldn't do anything to properly shutdown, Ctrl+Alt+F3 didn't help and neither
the Alt+SysRq(Prt Scr)+R-E-I-S-U-B):
Code:
Oct 26 13:39:07 localhost dhclient[1166]: DHCPREQUEST on enp4s0 to 10.0.0.138 port 67 (xid=0x385ead74)
Oct 26 13:39:08 localhost dhclient[1166]: DHCPACK from 10.0.0.138 (xid=0x385ead74)
Oct 26 13:39:08 localhost NetworkManager[961]: <info> [1540550348.0764] dhcp4 (enp4s0): address 10.0.0.1
Oct 26 13:39:08 localhost NetworkManager[961]: <info> [1540550348.0769] dhcp4 (enp4s0): plen 24 (255.255.255.0)
Oct 26 13:39:08 localhost dbus-daemon[793]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' requested by ':1.9' (uid=0 pid=961 comm="/usr/sbin/NetworkManager --no-daemon " label="system_u:system_r:NetworkManager_t:s0")
Oct 26 13:39:08 localhost NetworkManager[961]: <info> [1540550348.0770] dhcp4 (enp4s0): gateway 10.0.0.138
Oct 26 13:39:08 localhost audit: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=NetworkManager-dispatcher comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Oct 26 13:39:08 localhost dhclient[1166]: bound to 10.0.0.1 -- renewal in 1556 seconds.
Oct 26 13:39:08 localhost NetworkManager[961]: <info> [1540550348.0770] dhcp4 (enp4s0): lease time 3600
Oct 26 13:39:08 localhost dbus-daemon[793]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Oct 26 13:39:08 localhost NetworkManager[961]: <info> [1540550348.0770] dhcp4 (enp4s0): nameserver '10.0.0.138'
Oct 26 13:39:08 localhost nm-dispatcher: req:1 'dhcp4-change' [enp4s0]: new request (6 scripts)
Oct 26 13:39:08 localhost NetworkManager[961]: <info> [1540550348.0770] dhcp4 (enp4s0): domain name 'Home'
Oct 26 13:39:08 localhost nm-dispatcher: req:1 'dhcp4-change' [enp4s0]: start running ordered scripts...
Oct 26 13:39:08 localhost NetworkManager[961]: <info> [1540550348.0770] dhcp4 (enp4s0): state changed bound -> bound
Oct 26 13:39:08 localhost systemd: Starting Network Manager Script Dispatcher Service...
Oct 26 13:39:08 localhost systemd: Started Network Manager Script Dispatcher Service.
Oct 26 13:39:18 localhost audit: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=NetworkManager-dispatcher comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Oct 26 13:53:10 localhost plasmashell: QXcbConnection: XCB error: 2 (BadValue), sequence: 37067, resource id: 98566161, major code: 141 (Unknown), minor code: 3
Oct 26 13:53:10 localhost plasmashell: QXcbConnection: XCB error: 2 (BadValue), sequence: 37133, resource id: 104857975, major code: 141 (Unknown), minor code: 3
Oct 26 14:01:27 localhost cupsd: REQUEST localhost - - "POST / HTTP/1.1" 200 183 Renew-Subscription successful-ok
Oct 26 14:01:59 localhost plasmashell: QXcbConnection: XCB error: 2 (BadValue), sequence: 40626, resource id: 98566161, major code: 141 (Unknown), minor code: 3
Oct 26 14:01:59 localhost plasmashell: QXcbConnection: XCB error: 2 (BadValue), sequence: 40644, resource id: 102760454, major code: 141 (Unknown), minor code: 3
Oct 26 14:01:59 localhost plasmashell: QXcbConnection: XCB error: 2 (BadValue), sequence: 40660, resource id: 104857975, major code: 141 (Unknown), minor code: 3
Oct 26 14:01:59 localhost plasmashell: QXcbConnection: XCB error: 2 (BadValue), sequence: 40680, resource id: 102760454, major code: 141 (Unknown), minor code: 3
Oct 26 14:02:00 localhost plasmashell: QXcbConnection: XCB error: 2 (BadValue), sequence: 40721, resource id: 104857975, major code: 141 (Unknown), minor code: 3
Oct 26 14:05:04 localhost dhclient[1166]: DHCPREQUEST on enp4s0 to 10.0.0.138 port 67 (xid=0x385ead74)
Oct 26 14:05:04 localhost dhclient[1166]: DHCPACK from 10.0.0.138 (xid=0x385ead74)
Oct 26 14:05:05 localhost NetworkManager[961]: <info> [1540551905.0447] dhcp4 (enp4s0): address 10.0.0.1
Oct 26 14:05:05 localhost NetworkManager[961]: <info> [1540551905.0451] dhcp4 (enp4s0): plen 24 (255.255.255.0)
Oct 26 14:05:05 localhost audit: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=NetworkManager-dispatcher comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Oct 26 14:05:05 localhost dbus-daemon[793]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' requested by ':1.9' (uid=0 pid=961 comm="/usr/sbin/NetworkManager --no-daemon " label="system_u:system_r:NetworkManager_t:s0")
Oct 26 14:05:05 localhost NetworkManager[961]: <info> [1540551905.0451] dhcp4 (enp4s0): gateway 10.0.0.138
Oct 26 14:05:05 localhost dhclient[1166]: bound to 10.0.0.1 -- renewal in 1392 seconds.
Oct 26 14:05:05 localhost NetworkManager[961]: <info> [1540551905.0451] dhcp4 (enp4s0): lease time 3600
Oct 26 14:05:05 localhost dbus-daemon[793]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Oct 26 14:05:05 localhost NetworkManager[961]: <info> [1540551905.0451] dhcp4 (enp4s0): nameserver '10.0.0.138'
Oct 26 14:05:05 localhost nm-dispatcher: req:1 'dhcp4-change' [enp4s0]: new request (6 scripts)
Oct 26 14:05:05 localhost NetworkManager[961]: <info> [1540551905.0451] dhcp4 (enp4s0): domain name 'Home'
Oct 26 14:05:05 localhost nm-dispatcher: req:1 'dhcp4-change' [enp4s0]: start running ordered scripts...
Oct 26 14:05:05 localhost NetworkManager[961]: <info> [1540551905.0452] dhcp4 (enp4s0): state changed bound -> bound
Oct 26 14:05:05 localhost systemd: Starting Network Manager Script Dispatcher Service...
Oct 26 14:05:05 localhost systemd: Started Network Manager Script Dispatcher Service.
Oct 26 14:05:15 localhost audit: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=NetworkManager-dispatcher comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Oct 26 14:05:30 localhost chronyd[837]: Source 2001:3a0:0:2006::87:123 replaced with 2a02:2a50:6::123
<nul>...<nul> <-- Few hundreds of bytes of binary data
Oct 26 14:05:30 localhost chronyd[837]: Source 2001:3a0:0:2006::87:123 replaced with 2a02:2a50:6::123
Oct 26 14:20:31 localhost kernel: Linux version 4.15.6-200.spi_gpio.fc26.x86_64 (kaza@localhost.localdomain) (gcc version 7.3.1 20180130 (Red Hat 7.3.1-2) (GCC)) #1 SMP Sat Mar 3 02:13:51 IST 2018
Oct 26 14:20:31 localhost kernel: Command line: BOOT_IMAGE=/vmlinuz-4.15.6-200.spi_gpio.fc26.x86_64 root=/dev/mapper/VolGroup-lv_root ro rd.lvm.lv=VolGroup/lv_root rd.lvm.lv=VolGroup/lv_swap rhgb quiet LANG=en_US.UTF-8
Oct 26 14:20:31 localhost kernel: x86/fpu: x87 FPU will use FXSAVE
Oct 26 14:20:31 localhost kernel: e820: BIOS-provided physical RAM map:
Oct 26 14:20:31 localhost kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009bfff] usable
Oct 26 14:20:31 localhost kernel: BIOS-e820: [mem 0x000000000009c000-0x000000000009ffff] reserved
Oct 26 14:20:31 localhost kernel: BIOS-e820: [mem 0x00000000000e4000-0x00000000000fffff] reserved
Oct 26 14:20:31 localhost kernel: BIOS-e820: [mem 0x0000000000100000-0x00000000cffaffff] usable
Oct 26 14:20:31 localhost kernel: BIOS-e820: [mem 0x00000000cffb0000-0x00000000cffbdfff] ACPI data
Oct 26 14:20:31 localhost kernel: BIOS-e820: [mem 0x00000000cffbe000-0x00000000cffdffff] ACPI NVS
Oct 26 14:20:31 localhost kernel: BIOS-e820: [mem 0x00000000cffe0000-0x00000000cfffffff] reserved
Oct 26 14:20:31 localhost kernel: BIOS-e820: [mem 0x00000000fff00000-0x00000000ffffffff] reserved
Oct 26 14:20:31 localhost kernel: BIOS-e820: [mem 0x0000000100000000-0x000000042fffffff] usable
Some more info: I made one more experiment, after installing "iotop" and adding its output
to the list of saved data every 10 minutes. I also changed the cartrige.
Again, after over 2 hours the tape stopped. This time I decided to do nothing for more
than 10 minutes hoping the periodic data save will make one more save and I'll see something in the
last saved group of files. After waiting for more than 10 minutes I decided, instead of like
previously dealing with "nedit" of the tar log file to bring to the top the konsole where
I called the "tar" command and pressed "Ctrl+C" - indeed, the tape made few movements and
after some time the prompt returned. I thought: "ok, maybe if the tar process no longer
requests I/O everything will be OK now". However, when I attempted "ls" in the same konsole
it hang and at the same time the windows of "top", "iotop" and "gkrellm" stopped being
updated (they were updating during the over 10 minutes period from the tape stopping).
The unobscured part of the "top" window again showed "wa" over 70%. In the "iotop" window
there was no activity. Also, since the tape stopped the line with the "tar" command didn't
appear in the "iotop" window.
Together with the frozen "top", "iotop" and "gkrellm" the whole GUI froze - again, not even
the Alt+SysRq+R-E-I-S-U-B could get it to reboot and again I had to power the PC down.
Afer rebooting I looked at the last saved files and saw that during the over 10 minues period
from tape stop to the GUI freeze the periodic files save wasn't performed: the last saved
file timestamp was earlier than the write time of the "tar" command log file.
So far the facts, now I'll speculate, correct me if I'm wrong.
It seems to me that at some point the tape makes some I/O request which can't be completed
(70+% of "wa"). Other processes can continue running as long as they don't need to access
the filesystem. That's why the screens of "top", "iotop", "gkrellm" continue to be updated.
But as soon as something needs to access the filesystem (like "ls") this new request is blocked
until the original tape I/O request completes (which never happens) and thus the whole GUI
freezes.
I wonder what would happen if instead of calling the "tar" command directly I'll call it
through "ionice -c 3" ?
It seems very strange to me that the queueing system of the I/O requests can hang if one device
I/O can't be completed, I would expect some "timeout" allowing later requests to be performed.
But I don't have enough knowledge of the LINUX disk I/O system
to answer that question.
Can anyone explain (or point to a knowledge source) how the LINUX disk I/O system queueing works?
I finally caught the error soon enough to save some relevant info and it doesn't look good.
I managed to run "dmesg -c>filename" just as I heared the beep but before
the GUI stuck and here's what I saw later after having to power down the PC and to reboot:
Code:
[ 4109.872050] sky2 0000:04:00.0 enp4s0: Link is down
[ 4127.824514] sky2 0000:04:00.0 enp4s0: Link is up at 100 Mbps, full duplex, flow control rx
[ 4139.334846] sky2 0000:04:00.0 enp4s0: Link is down
[ 4141.794760] sky2 0000:04:00.0 enp4s0: Link is up at 100 Mbps, full duplex, flow control rx
[ 8832.460042] sd 7:0:1:0: [sdb] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 8832.460053] sd 7:0:1:0: [sdb] tag#1 Sense Key : Hardware Error [current]
[ 8832.460057] sd 7:0:1:0: [sdb] tag#1 Add. Sense: Internal target failure
[ 8832.460061] sd 7:0:1:0: [sdb] tag#1 CDB: Read(16) 88 00 00 00 00 00 5c 58 3d e0 00 00 01 00 00 00
[ 8832.460065] print_req_error: critical target error, dev sdb, sector 1549286880
[ 8832.490713] sd 7:0:1:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 8832.490719] sd 7:0:1:0: [sdb] tag#0 Sense Key : Hardware Error [current]
[ 8832.490723] sd 7:0:1:0: [sdb] tag#0 Add. Sense: Internal target failure
[ 8832.490727] sd 7:0:1:0: [sdb] tag#0 CDB: Read(16) 88 00 00 00 00 00 5c 58 3e e0 00 00 01 00 00 00
[ 8832.490730] print_req_error: critical target error, dev sdb, sector 1549287136
[ 8854.477447] sd 7:0:1:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 8854.477455] sd 7:0:1:0: [sdb] tag#0 Sense Key : Hardware Error [current]
[ 8854.477459] sd 7:0:1:0: [sdb] tag#0 Add. Sense: Internal target failure
[ 8854.477463] sd 7:0:1:0: [sdb] tag#0 CDB: Read(16) 88 00 00 00 00 00 5c 58 3e 00 00 00 00 08 00 00
[ 8854.477466] print_req_error: critical target error, dev sdb, sector 1549286912
[ 8857.893105] blk_partition_remap: fail for partition 1
[ 8857.893113] blk_partition_remap: fail for partition 1
[ 8858.842183] blk_partition_remap: fail for partition 1
[ 8858.842192] blk_partition_remap: fail for partition 1
[ 8859.722773] scsi 7:1:3:0: Direct-Access TOSHIBA MK2001TRKB 0105 PQ: 0 ANSI: 5
[ 8859.724433] scsi 7:1:3:0: Attached scsi generic sg2 type 0
[ 8860.188272] blk_partition_remap: fail for partition 1
[ 8860.188280] blk_partition_remap: fail for partition 1
[ 8861.652640] blk_partition_remap: fail for partition 1
[ 8861.652648] blk_partition_remap: fail for partition 1
[ 8861.652889] blk_partition_remap: fail for partition 1
[ 8861.652894] blk_partition_remap: fail for partition 1
[ 8861.652898] blk_partition_remap: fail for partition 1
[ 8861.652902] blk_partition_remap: fail for partition 1
[ 8861.652905] blk_partition_remap: fail for partition 1
[ 8863.513851] blk_partition_remap: fail for partition 1
[ 8863.513857] blk_partition_remap: fail for partition 1
[ 8863.513859] blk_partition_remap: fail for partition 1
[ 8863.513860] blk_partition_remap: fail for partition 1
[ 8863.513864] Aborting journal on device dm-3-8.
[ 8863.513870] blk_partition_remap: fail for partition 1
[ 8863.513873] Buffer I/O error on dev dm-3, logical block 487096320, lost sync page write
[ 8863.513876] JBD2: Error -5 detected when updating journal superblock for dm-3-8.
[ 8870.622855] blk_partition_remap: fail for partition 1
[ 8870.622865] blk_partition_remap: fail for partition 1
[ 8870.622868] blk_partition_remap: fail for partition 1
[ 8870.622872] blk_partition_remap: fail for partition 1
[ 8870.622876] blk_partition_remap: fail for partition 1
[ 8870.622887] blk_partition_remap: fail for partition 1
[ 8870.622890] Buffer I/O error on dev dm-3, logical block 0, lost sync page write
[ 8870.622897] EXT4-fs error (device dm-3): ext4_journal_check_start:61: Detected aborted journal
[ 8870.622899] EXT4-fs (dm-3): Remounting filesystem read-only
[ 8870.622902] EXT4-fs (dm-3): previous I/O error to superblock detected
[ 8870.622904] blk_partition_remap: fail for partition 1
[ 8870.622905] Buffer I/O error on dev dm-3, logical block 0, lost sync page write
[ 8928.949870] blk_partition_remap: fail for partition 1
[ 8928.949878] blk_partition_remap: fail for partition 1
[ 8928.950020] blk_partition_remap: fail for partition 1
[ 8928.950026] blk_partition_remap: fail for partition 1
[ 8928.950030] blk_partition_remap: fail for partition 1
[ 8928.950033] blk_partition_remap: fail for partition 1
[ 8928.950037] blk_partition_remap: fail for partition 1
[ 8941.084894] blk_partition_remap: fail for partition 1
[ 8941.084914] blk_partition_remap: fail for partition 1
[ 8941.084918] blk_partition_remap: fail for partition 1
[ 8941.084922] blk_partition_remap: fail for partition 1
[ 8941.084926] blk_partition_remap: fail for partition 1
[ 8944.289808] scsi 7:0:1:0: Direct-Access ICP KA_RAID_1 V1.0 PQ: 0 ANSI: 2
[ 8944.290018] sd 7:0:1:0: [sdc] Very big device. Trying to use READ CAPACITY(16).
[ 8944.290036] sd 7:0:1:0: [sdc] 7801405440 512-byte logical blocks: (3.99 TB/3.63 TiB)
[ 8944.290047] sd 7:0:1:0: [sdc] Write Protect is off
[ 8944.290049] sd 7:0:1:0: [sdc] Mode Sense: 12 00 10 08
[ 8944.290067] sd 7:0:1:0: [sdc] Write cache: disabled, read cache: enabled, supports DPO and FUA
[ 8944.290203] sd 7:0:1:0: Attached scsi generic sg6 type 0
[ 8944.290450] sd 7:0:1:0: [sdc] Very big device. Trying to use READ CAPACITY(16).
[ 8944.325562] sdc: sdc1
[ 8944.325926] sd 7:0:1:0: [sdc] Very big device. Trying to use READ CAPACITY(16).
[ 8944.326030] sd 7:0:1:0: [sdc] Attached SCSI removable disk
[ 8944.618268] blk_partition_remap: fail for partition 1
[ 8944.618276] blk_partition_remap: fail for partition 1
[ 8944.618406] blk_partition_remap: fail for partition 1
[ 8944.618411] blk_partition_remap: fail for partition 1
[ 8944.618415] blk_partition_remap: fail for partition 1
[ 8944.618418] blk_partition_remap: fail for partition 1
[ 8944.618422] blk_partition_remap: fail for partition 1
[ 8946.304142] blk_partition_remap: fail for partition 1
[ 8946.304151] blk_partition_remap: fail for partition 1
[ 8946.304154] blk_partition_remap: fail for partition 1
[ 8946.304158] blk_partition_remap: fail for partition 1
[ 8946.304161] blk_partition_remap: fail for partition 1
[ 8946.304165] blk_partition_remap: fail for partition 1
[ 8946.304168] blk_partition_remap: fail for partition 1
[ 8946.304177] blk_partition_remap: fail for partition 1
[ 8946.304181] blk_partition_remap: fail for partition 1
[ 8947.897137] blk_partition_remap: fail for partition 1
[ 8947.897159] blk_partition_remap: fail for partition 1
[ 8947.897175] blk_partition_remap: fail for partition 1
[ 8947.897183] blk_partition_remap: fail for partition 1
[ 8947.897401] blk_partition_remap: fail for partition 1
So after searching for the source of the problem at the backup tape the real source of it in the HDD,
a hardware "RAID 0" array of two 2TB SAS disks.
I looked at the log of the "tar" command to see in which directories are the files
with problems, added few directories to the "tar" exclude list and re-run the backup tar command.
So far I'm at the 9-th cartrige and it seems the backup advances normally.
After I'll finish (and verify) the backup I'm planing to run the "fsck /dev/sdb" but I'm worried
with regard of: after I'll "umount" the partition of that disk, will I be able to mount it
again and will most of the files remain? I guess some file (where the damaged sectors reside)
will be damaged but will the rest be OK?
Well done for sticking with it - and getting that log. Wasn't making a lot of sense prior to that.
fsck fixes filesystems. Often the errors are "transient" in the sense that (temporary) files that were open at the time of failure didn't close, and need to be cleaned up. No harm done.
Sometimes you get cross-linked inodes (for example) that can mean truncated files after "fixing". And sometimes you get a total basket case when the whole lot goes down the toilet. So it can be a bit of a crap-shoot.
Fortunately it's usually benign - but given your history, you may not be that lucky. fsck (for ext4) will attempt to save file fragments in lost+found - which may help you or may just mislead you. In severe cases I just restore and eat any data loss if my backups aren't sufficiently recent.
As a general rule, if lost+found is empty, your data (all of it in that filesystem) is probably safe.
Yesterday I finished the backup and now I'm at the beginning of verify, I guess it'll take few days,
possible up to a week (life...). If the backup will be OK I'll run "fsck" "and may the SHWARTZ be with me" (where I
live there are lots of them). :-)
One thing isn't clear to me: when I'll run the "fsck" (after "umount") should I run it on the physical disk
at first, while verifying, on the 8-th (out of 12)
cartriges (each - about 4-5 hours) I've got I/O errors from the tape
(this time - the tape, not the disk). So I replaced the 8-th cartrige
and re-run the backup and the verify and finally, was able to "umount"
the partition in question and to run the "fsck". However, the "fsck"
output looks like it didn't fins any errors:
Code:
<root localhost>.../temp>lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 1 273.2G 0 disk
├─sda1 8:1 1 1G 0 part /boot
└─sda2 8:2 1 262.2G 0 part
├─VolGroup-lv_root 253:0 0 99.4G 0 lvm /
├─VolGroup-lv_swap 253:1 0 32G 0 lvm [SWAP]
└─VolGroup-lv_home 253:3 0 130.9G 0 lvm /home
sdb 8:16 1 3.6T 0 disk
└─sdb1 8:17 1 3.6T 0 part
└─VolGroup_graphics-lv_graphics 253:2 0 3.6T 0 lvm /home/user/GRAPHICS
sr0 11:0 1 1024M 0 rom
<root localhost>.../temp>umount /home/user/GRAPHICS
<root localhost>.../temp>lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 1 273.2G 0 disk
├─sda1 8:1 1 1G 0 part /boot
└─sda2 8:2 1 262.2G 0 part
├─VolGroup-lv_root 253:0 0 99.4G 0 lvm /
├─VolGroup-lv_swap 253:1 0 32G 0 lvm [SWAP]
└─VolGroup-lv_home 253:3 0 130.9G 0 lvm /home
sdb 8:16 1 3.6T 0 disk
└─sdb1 8:17 1 3.6T 0 part
└─VolGroup_graphics-lv_graphics 253:2 0 3.6T 0 lvm
sr0 11:0 1 1024M 0 rom
<root localhost>.../temp>lvdisplay
--- Logical volume ---
LV Path /dev/VolGroup_graphics/lv_graphics
LV Name lv_graphics
VG Name VolGroup_graphics
LV UUID eAorkr-Lo1l-qadG-b0nW-AjnY-RWSi-uPFdf5
LV Write Access read/write
LV Creation host, time localhost.localdomain, 2012-10-19 01:46:42 +0200
LV Status available
# open 0
LV Size 3.63 TiB
Current LE 952319
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 253:2
.
.
.
<root localhost>.../temp>fsck.ext4 /dev/VolGroup_graphics/lv_graphics
e2fsck 1.43.4 (31-Jan-2017)
/dev/VolGroup_graphics/lv_graphics: clean, 17126/243793920 files, 300877458/975174656 blocks
Why is that? I cleary remember (and have logs/dmesg )
that I had 3 disk errors stating that some sectors had a problem
so why the "fsck" doesn't indicate any problem found?
BTW, it took "fsck" about 1 Sec to check 4 TB partition, does this
short time reasonable?
OK, here's the end of "grep sdb /var/log/messages*", Oct 28 is around when I had the disk errors,
later dates are during the successful backup and the verify:
Code:
/var/log/messages-20181104:Oct 28 21:39:29 localhost kernel: print_req_error: critical target error, dev sdb, sector 7488278528
/var/log/messages-20181104:Oct 28 21:39:29 localhost kernel: sd 7:0:1:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
/var/log/messages-20181104:Oct 28 21:39:29 localhost kernel: sd 7:0:1:0: [sdb] tag#0 Sense Key : Hardware Error [current]
/var/log/messages-20181104:Oct 28 21:39:29 localhost kernel: sd 7:0:1:0: [sdb] tag#0 Add. Sense: Internal target failure
/var/log/messages-20181104:Oct 28 21:39:29 localhost kernel: sd 7:0:1:0: [sdb] tag#0 CDB: Read(16) 88 00 00 00 00 01 be 56 10 00 00 00 00 08 00 00
/var/log/messages-20181104:Oct 28 21:39:29 localhost kernel: print_req_error: critical target error, dev sdb, sector 7488278528
/var/log/messages-20181104:Nov 3 08:56:56 localhost kernel: sd 5:0:1:0: [sdb] Very big device. Trying to use READ CAPACITY(16).
/var/log/messages-20181104:Nov 3 08:56:56 localhost kernel: sd 5:0:1:0: [sdb] 7801405440 512-byte logical blocks: (3.99 TB/3.63 TiB)
/var/log/messages-20181104:Nov 3 08:56:56 localhost kernel: sd 5:0:1:0: [sdb] Write Protect is off
/var/log/messages-20181104:Nov 3 08:56:56 localhost kernel: sd 5:0:1:0: [sdb] Write cache: disabled, read cache: enabled, supports DPO and FUA
/var/log/messages-20181104:Nov 3 08:56:56 localhost kernel: sd 5:0:1:0: [sdb] Very big device. Trying to use READ CAPACITY(16).
/var/log/messages-20181104:Nov 3 08:56:56 localhost kernel: sdb: sdb1
/var/log/messages-20181104:Nov 3 08:56:56 localhost kernel: sd 5:0:1:0: [sdb] Very big device. Trying to use READ CAPACITY(16).
/var/log/messages-20181104:Nov 3 08:56:56 localhost kernel: sd 5:0:1:0: [sdb] Attached SCSI removable disk
/var/log/messages-20181104:Nov 3 08:57:11 localhost dracut-initqueue: Scanning devices sda2 sdb1 for LVM logical volumes VolGroup/lv_root VolGroup/lv_swap
/var/log/messages-20181104:Nov 3 08:57:19 localhost smartd[777]: Device: /dev/sdb, opened
/var/log/messages-20181104:Nov 3 08:57:19 localhost smartd[777]: Device: /dev/sdb, [ICP KA_RAID_1 V1.0], S/N: 8435327C, 3.99 TB
/var/log/messages-20181104:Nov 3 08:57:19 localhost smartd[777]: Device: /dev/sdb, is SMART capable. Adding to "monitor" list.
/var/log/messages-20181104:Nov 3 09:25:10 localhost kernel: sd 6:0:1:0: [sdb] Very big device. Trying to use READ CAPACITY(16).
/var/log/messages-20181104:Nov 3 09:25:10 localhost kernel: sd 6:0:1:0: [sdb] 7801405440 512-byte logical blocks: (3.99 TB/3.63 TiB)
/var/log/messages-20181104:Nov 3 09:25:10 localhost kernel: sd 6:0:1:0: [sdb] Write Protect is off
/var/log/messages-20181104:Nov 3 09:25:10 localhost kernel: sd 6:0:1:0: [sdb] Write cache: disabled, read cache: enabled, supports DPO and FUA
/var/log/messages-20181104:Nov 3 09:25:10 localhost kernel: sd 6:0:1:0: [sdb] Very big device. Trying to use READ CAPACITY(16).
/var/log/messages-20181104:Nov 3 09:25:11 localhost kernel: sdb: sdb1
/var/log/messages-20181104:Nov 3 09:25:11 localhost kernel: sd 6:0:1:0: [sdb] Very big device. Trying to use READ CAPACITY(16).
/var/log/messages-20181104:Nov 3 09:25:11 localhost kernel: sd 6:0:1:0: [sdb] Attached SCSI removable disk
/var/log/messages-20181104:Nov 3 09:25:25 localhost dracut-initqueue: Scanning devices sda2 sdb1 for LVM logical volumes VolGroup/lv_root VolGroup/lv_swap
/var/log/messages-20181104:Nov 3 09:25:36 localhost smartd[809]: Device: /dev/sdb, opened
/var/log/messages-20181104:Nov 3 09:25:36 localhost smartd[809]: Device: /dev/sdb, [ICP KA_RAID_1 V1.0], S/N: 8435327C, 3.99 TB
/var/log/messages-20181104:Nov 3 09:25:36 localhost smartd[809]: Device: /dev/sdb, is SMART capable. Adding to "monitor" list.
[code]
<root localhost.localdomain>.../root>pvdisplay
--- Physical volume ---
PV Name /dev/sdb1
VG Name VolGroup_graphics
PV Size 3.63 TiB / not usable 2.00 MiB
Allocatable yes (but full)
PE Size 4.00 MiB
Total PE 952319
Free PE 0
Allocated PE 952319
PV UUID W71uV5-m17D-qugv-1RRf-ESAn-wexC-O1ZHhE
--- Physical volume ---
PV Name /dev/sda2
VG Name VolGroup
PV Size 262.24 GiB / not usable 23.00 MiB
Allocatable yes (but full)
PE Size 32.00 MiB
Total PE 8391
Free PE 0
Allocated PE 8391
PV UUID bWsE7m-zeBE-5wCi-BoL2-dL1i-0tFg-ClCH0z
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.