[SOLVED] Can HP StorageWorks 215 tape cause FC26 KDE GUI to be stuck?

kaza · 10-15-2018, 04:31 PM

Hello!

For quite a few years I perform full backups using a HP StorageWorks 215 (LTO 1) tape,
about 18 times so far and each time it worked flawlessly in the following way:

I open a "konsole", become "root" and call a Perl wrapper which calculates timestamp
and calls "tar" by a command similar to:

Code:

<root localhost>.../backup>backup_tar -full
Running system command: "/bin/tar -X /root/backup/exclude.conf --exclude=/root/backup/log/full_2018_10_15_2143 --xattrs --numeric-owner --exclude-caches -cvpMf /dev/st0 / >& /root/backup/log/full_2018_10_15_2143"...

Another "konsole" is opened (also as "root") in the directory of the log file,
where I open the log file with "nedit" and from time to time I move the cursor over
"nedit", get "file modified by another program Reload/Cancel" prompt, click "reload" and
check what files it's backing up now. I also see there the prompts to insert the next volume
(it's a multi volume backup), when I see one I press "eject" on the tape drive panel, replace
the cartrige and press
"Enter" in the "konsole" where the backup command is running.
Also in the log file "konsole" I run from time to time a Perl script which calculates the sum
of file sizes listed in the log:

Code:

<root localhost>.../log>du_f -G full_2018_10_15_2143
Total files size: 6 G.

Except the first time it's just "Arrow up" followed by "Enter".

The time it takes to fill one LTO 1 (100 GB) cartrige is about 6 hours (it's a PC built in 2009:
AMD Phenom X4, ASUS M3A79T deluxe mboard, 16 GB RAM, ATI FireGL V5600 video card,
currently running FC26+KDE),
the number of volumes is about 10 so it's a multi-day process: start, wait 6 hours (or go to work
or to sleep), check "insert next volume, press Enter when ready" prompt in "nedit", eject used tape,
insert next, press "Enter" in the "console" of the "tar" command.

But the last time I attempted to perform the same backup process I've got problems:
the first 2 volume changes went OK but on the 3-rd one I moved the mouse pointer over the
"nedit", got the usual "file changed by other program, reload/cancell" prompt
but when I clicked on "Reload" - nothing happened, the prompt remained as if I didn't
click on it. I clicked on the "konsole" of the log file directory - it changed focus to
front, as expected, I pressed "Arrow up" - got the last command "du_f -G <log name>"
but when I pressed "Enter" - again, nothing happened.
I clicked the "close" button of "nedit" but it ignored me. So I attempted to "blindly"
change the tape - pressed "Eject", waited for the tape rewind to end and tape to eject,
replaced the tape, clicked on the "konsole" where the "tar" command is running -
it changed focus to "front" but when I pressed "Enter" on it - again, it had been ignored,
nothing happened.

At that point I discovered that I can't close any "konsole" (only focus changes of already
opened windows), can't open any new "konsole" (in fact - anything). So I decided
to shutdown - pressed "Eject", took the tape out (thank God, at least this worked),
attempted RMB on the desktop to bring the shutdown menu - nothing,
attempted going to text mode by Ctrl+Alt+F3 to shutdown from there - nothing.
As a last resort I had to brutally power down the PC.

After waiting some time and powering up I attempted again but this time I didn't even got
to the first cartrige replace prompt -
when I've got back to the PC after few hours I found it in the same semi-responsive state.
Again had to power down the PC and after rebooting and checking the last write time
of the log file I saw that it stopped being updated about 2 hours after the backup start.

The fact that initially the problem appeared before the 3-rd cartride change and later
started appearing during the first cartrige also points in the direction of the
backup process, in particular - the tape. No error LEDs on the tape drive, only the "Ready"
blinking during the backup and is "ON" steady when I get back to a "semi-stuck" GUI.

REAL TIME UPDATE: while I was writing this post I was also running another backup attempt.
After about 1.5 hours suddenly I noticed in the log some I/O errors like:

Code:

/home/kaza/GRAPHICS/PICS/D/Video0208.3gp
/home/kaza/GRAPHICS/PICS/D/Photo0416.jpg
/home/kaza/GRAPHICS/PICS/D/Photo0542.jpg
/home/kaza/GRAPHICS/PICS/D/Video0025.3gp
/bin/tar: /home/kaza/GRAPHICS/PICS/D/Video0025.3gp: File shrank by 4145152 bytes; padding with zeros
/home/kaza/GRAPHICS/PICS/D/Photo0271.jpg
/bin/tar: /home/kaza/GRAPHICS/PICS/D/Photo0271.jpg: Read error at byte 0, while reading 2048 bytes: Input/output error
/home/kaza/GRAPHICS/PICS/D/Photo0179.jpg
/bin/tar: /home/kaza/GRAPHICS/PICS/D/Photo0179.jpg: Read error at byte 0, while reading 3072 bytes: Input/output error
/home/kaza/GRAPHICS/PICS/D/Video0097.3gp
/bin/tar: /home/kaza/GRAPHICS/PICS/D/Video0097.3gp: Read error at byte 0, while reading 512 bytes: Input/output error

after some time I heared a beep from the PC (can't tell if from the tape drive or something
else), the "Ready" LED stopped blinking and turned "ON" steady.
I hurriedly "copy-pasted" this post to a text file before noticing the familiar "semi-stuck"
GUI. But after about a minute or so I noticed the neither the "top" window not "gkrellm"
graphs (always opened at the side of my desktop to see what's going) are being updated
and the mouse pointer dsappeared. Now the GUI was "totally-stuck".

In the past I always started a backup from inserting a cleaning cartrige into the tape drive
but this time I've got "tape error" LED which is probably "cleanings count timeout".
I ordered a new one on e-bay but since it's expected to take a month to arrive I attempted
to backup anyway. Am I being punished for this? The tape drive hasn't been used for few months,
it might've gathered some dust.

But the most important question: why a problem with the backup tape drive causes the GUI to be stuck,
first partially and then completely?

My kernel version is:

Code:

<localhost>.../NOA>uname -a
Linux localhost.localdomain 4.15.6-200.spi_gpio.fc26.x86_64 #1 SMP Sat Mar 3 02:13:51 IST 2018 x86_64 x86_64 x86_64 GNU/Linux

Looking at the kernel name reminds me that I've made a privet kernel version with SPI and GPIO
support enabled to allow me programming through USB an SPI chip (CH341A). Can it somehow affect
KDE GUI being "stuck" when tape drive has errors?

TIA for any ideas (except the obvious "don't start backup before heads had been cleaned"),
kaza.

sevendogsbsd · 10-17-2018, 04:22 PM

Duplicate post of this, sort of: https://www.linuxquestions.org/quest...ck-4175640596/

syg00 · 10-17-2018, 06:41 PM

Next time it happens copy /proc/meminfo to a file and also get a grab of top - just the first 20 line is probably enough.
Post them here. Take a copy of meminfo just after boot for comparison as well.

kaza · 10-18-2018, 02:31 PM

Thanks for the reply,

When this problem occurs the KDE eventually freezes up
so I guess I'll need to be lucky to catch the tape errors
just as they occur and before the whole KDE freezes up.

When tape errors occur and the KDE gradually "loses its bearings"
I don't remember I saw anything hogging close to 100% CPU (core)
but I'll try to capture the "top".

I'll do another backup try and see if I'll be fast enough to save the info
before KDE freezes up.

As a wild guess it seems to me that I/O errors of the SCSI tape
somehow prevent normal disk access of the OS (KDE), despite the fact that
OS resides on SAS disks connected to a different controller than the
tape. Here's the output of "lsscsi":

Code:

<localhost>.../temp>lsscsi
[0:0:0:0]    cd/dvd  HL-DT-ST DVDRAM GH24NSD1  LG00  /dev/sr0 
[6:0:0:0]    disk    ICP      KA_RAID_0        V1.0  /dev/sda 
[6:0:1:0]    disk    ICP      KA_RAID_1        V1.0  /dev/sdb 
[6:1:0:0]    disk    FUJITSU  MAU3147RC        0104  -        
[6:1:1:0]    disk    FUJITSU  MBA3147RC        HPF1  -        
[6:1:2:0]    disk    TOSHIBA  MK2001TRKB       0105  -        
[6:1:3:0]    disk    TOSHIBA  MK2001TRKB       0105  -        
[7:0:0:0]    disk    FUJITSU  MAS3367NP        0104  /dev/sdc 
[7:0:3:0]    tape    HP       Ultrium 1-SCSI   N2CG  /dev/st0

The OS is installen on the pair of MAU3147RC connected as RAID 0
(MAS3367NP is some old data).

As I wrote, it's just a wild guess.

OK, logging off and performing the experiment...

kaza · 10-20-2018, 02:31 PM

Well, the first experiment was a failure: for the first 2 hours the tape drive worked OK
so I stopped watching it continuosly and started "sampling" it once in few minutes.
After over 3 hours (damn Murphy with his laws...) I found the KDE almost stuck and needless
to say, no chance of copying the /proc/meminfo. I couldn't move any window and "copy paste"
the "top" window contents but from what was accidentally not obscured on it I could manually
copy the following:

Code:

up: 3:18
Tasks: 249 total, 2 running, 182 sleeping, 0 stopped, 0 zombie
%cpu: 0.6 us, 0.2 sy, 0.0 ni, 24.7 id, 74.2 wd, 0.1 hi, 0.0 si, 0.0 st
KiB Mem:  16424780 total, 172160 free, 1526880 used, 14725680 buff/cache
KiB swap: 33554428 total, 33554428 free, 0 used, 14524544 avail Mem

From the processes % cpu list:

kwin           2.0%
Xorg           1.3%
konsole        0.7%
kdeconnectd    0.3%
xembedsniproxy 0.3%
korgac         0.3%
akonadi_contr+ 0.3%
akonadi_birth+ 0.3%
akonadi_follo+ 0.3%
kworker/2:2H   0.3%
systemd        0.0%

I also noticed that the tape drive makes a movement sound for about 1 sec in duration
once in 4-5 minutes. Together with the "74.2%" CPU wasted on wait it looks like
the tape drive is endlessly attempting to get again and again some data it didn't
get the first time and the rest of KDE can't do anything while it happens.
(That's just my guess, feel free to correct me.)

I'll make another experiment, this time I'll write a Perl script which will
copy /proc/meminfo into a file with timestame in name once in 10 minutes
and also a "top" in batch mode will be saved to another file, also
with a timestamp in name.

Meanwhile we can only theoretize.

Does anyone know (or can point me to a knowledge source) how the LINUX
system accesses different disks/tapes connected to different
SCSI controllers (main disk - SAS, tape drive - SCSI) from the viewpoint
of queueing: is there a single gueue for requests of all disks+tape
or many queues working in parallel? Is there any configuration of this
which can be changed (even if it means kernel compilation configuration)?

TIA for ideas,
kaza.

syg00 · 10-20-2018, 06:01 PM

Next time just grab your phone and take a photo - saves all the typing ...

Thanks for the interrupt percentage - I was worried that it may be high; that runs at a priority above all else. No CPU is "wasted on wait" - that is merely an accounting convenience, a subset of "idle", but may (does in your case) indicate non-completed I/O.

Whilst you're collecting data, you might as well get vmstat.

Each device gets its own q - SSDs may get multi-queuing per device if you have a current kernel.

kaza · 10-26-2018, 07:13 AM

OK, it took some time (life, etc.) to repeat
the experiment with periodically saved data (/proc/meminfo, top -b -n 1, dmesg -c)
and I don't see anything "sticking out" in the saved reports.

Again, after over 2 hours the tape stopped the backup operation, a short (~1 sec) beep
was audible (can it be some m-board error code?) and the GUI became first "partially
responsive" and after insuccessfull attempt to close the FireFox window (from past experiences -
that happened after attempt to close any window) the GUI became frozen.

The data save period was 10 minutes and the last timestamp was: 2018_10_26_1404.
Looking at the time stamp of the tar command log I see that it last write was at time 14:12
so 8 minutes passes from the last dump of the info to the system being stuck.

The last /proc/meminfo:

Code:

MemTotal:       16424720 kB
MemFree:          153424 kB
MemAvailable:   14569136 kB
Buffers:          143932 kB
Cached:         14157476 kB
SwapCached:            0 kB
Active:          1405552 kB
Inactive:       14333024 kB
Active(anon):     755192 kB
Inactive(anon):   487188 kB
Active(file):     650360 kB
Inactive(file): 13845836 kB
Unevictable:          48 kB
Mlocked:              48 kB
SwapTotal:      33554428 kB
SwapFree:       33554428 kB
Dirty:              2560 kB
Writeback:             0 kB
AnonPages:       1437372 kB
Mapped:           530472 kB
Shmem:            119444 kB
Slab:             320276 kB
SReclaimable:     256700 kB
SUnreclaim:        63576 kB
KernelStack:       10912 kB
PageTables:        64224 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    41766788 kB
Committed_AS:    5906472 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      161472 kB
DirectMap2M:     5081088 kB
DirectMap1G:    11534336 kB

The last top:

Code:

top - 14:04:30 up  2:59,  4 users,  load average: 2.02, 1.72, 1.49
Tasks: 240 total,   2 running, 171 sleeping,   0 stopped,   0 zombie
%Cpu(s):  4.7 us,  0.8 sy,  0.0 ni, 93.4 id,  0.9 wa,  0.1 hi,  0.1 si,  0.0 st
KiB Mem : 16424720 total,   151184 free,  1651008 used, 14622528 buff/cache
KiB Swap: 33554428 total, 33554428 free,        0 used. 14567948 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 5440 kaza      20   0 9312356 584156 168004 S  52.9  3.6   4:23.97 firefox
 5509 kaza      20   0 2651412 460716 182280 S  17.6  2.8   4:18.17 Web Content
 1048 root      20   0  435780 114984  86776 S  11.8  0.7   2:45.53 Xorg
 3268 root      20   0  128808   3412   2644 D   5.9  0.0   1:22.58 tar
 6283 root      20   0  164072   4356   3692 R   5.9  0.0   0:00.01 top
    1 root      20   0  223924  10420   6720 S   0.0  0.1   0:01.65 systemd
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.01 kthreadd
    6 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 mm_percpu_wq
    7 root      20   0       0      0      0 S   0.0  0.0   0:00.05 ksoftirqd/0
    8 root      20   0       0      0      0 I   0.0  0.0   0:00.91 rcu_sched
    9 root      20   0       0      0      0 I   0.0  0.0   0:00.00 rcu_bh
   10 root      rt   0       0      0      0 S   0.0  0.0   0:00.00 migration/0
   11 root      rt   0       0      0      0 S   0.0  0.0   0:00.00 watchdog/0
   12 root      20   0       0      0      0 S   0.0  0.0   0:00.00 cpuhp/0
   13 root      20   0       0      0      0 S   0.0  0.0   0:00.00 cpuhp/1
   14 root      rt   0       0      0      0 S   0.0  0.0   0:00.00 watchdog/1
   15 root      rt   0       0      0      0 S   0.0  0.0   0:00.00 migration/1
   16 root      20   0       0      0      0 S   0.0  0.0   0:00.02 ksoftirqd/1
   19 root      20   0       0      0      0 S   0.0  0.0   0:00.00 cpuhp/2
   20 root      rt   0       0      0      0 S   0.0  0.0   0:00.00 watchdog/2
   21 root      rt   0       0      0      0 S   0.0  0.0   0:00.00 migration/2
   22 root      20   0       0      0      0 S   0.0  0.0   0:00.09 ksoftirqd/2
   25 root      20   0       0      0      0 S   0.0  0.0   0:00.00 cpuhp/3
   26 root      rt   0       0      0      0 S   0.0  0.0   0:00.00 watchdog/3
   27 root      rt   0       0      0      0 S   0.0  0.0   0:00.00 migration/3
   28 root      20   0       0      0      0 S   0.0  0.0   0:00.04 ksoftirqd/3
   31 root      20   0       0      0      0 S   0.0  0.0   0:00.00 kdevtmpfs
   32 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 netns
   33 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcu_tasks_kthre
   34 root      20   0       0      0      0 S   0.0  0.0   0:00.00 kauditd
   36 root      20   0       0      0      0 I   0.0  0.0   0:00.40 kworker/1:1
   37 root      20   0       0      0      0 S   0.0  0.0   0:00.00 oom_reaper
   38 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 writeback
   39 root      20   0       0      0      0 S   0.0  0.0   0:01.87 kcompactd0
   40 root      25   5       0      0      0 S   0.0  0.0   0:00.00 ksmd
   41 root      39  19       0      0      0 S   0.0  0.0   0:00.00 khugepaged
   42 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 crypto
   43 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 kintegrityd
   44 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 kblockd
   46 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 ata_sff
   47 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 md
   48 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 edac-poller
   49 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 devfreq_wq
   50 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 watchdogd
   52 root      20   0       0      0      0 S   0.0  0.0   0:06.16 kswapd0
  101 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 kthrotld
  103 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 acpi_thermal_pm
  104 root      20   0       0      0      0 S   0.0  0.0   0:00.00 scsi_eh_0
  105 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 scsi_tmf_0
  106 root      20   0       0      0      0 S   0.0  0.0   0:00.00 scsi_eh_1
  107 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 scsi_tmf_1
  108 root      20   0       0      0      0 S   0.0  0.0   0:00.00 scsi_eh_2
  109 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 scsi_tmf_2
  110 root      20   0       0      0      0 S   0.0  0.0   0:00.00 scsi_eh_3
  111 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 scsi_tmf_3
  117 root      20   0       0      0      0 I   0.0  0.0   0:00.00 kworker/3:2
  118 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 dm_bufio_cache
  119 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 ipv6_addrconf
  125 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 kstrp
  338 root      20   0       0      0      0 S   0.0  0.0   0:00.01 scsi_eh_4
  340 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 scsi_tmf_4
  342 root      20   0       0      0      0 S   0.0  0.0   0:00.01 scsi_eh_5
  343 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 scsi_tmf_5
  344 root      20   0       0      0      0 S   0.0  0.0   0:00.00 scsi_eh_6
  345 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 scsi_tmf_6
  346 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 firewire
  347 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 firewire_ohci
  355 root      20   0       0      0      0 S   0.0  0.0   0:00.00 scsi_eh_7
  357 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 scsi_tmf_7
  358 root      20   0       0      0      0 S   0.0  0.0   0:00.02 scsi_eh_8
  360 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 scsi_tmf_8
  361 root      20   0       0      0      0 S   0.0  0.0   0:00.01 scsi_eh_9
  362 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 scsi_tmf_9
  366 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 ttm_swap
  368 root      20   0       0      0      0 S   0.0  0.0   0:00.02 aacraid
  451 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 kdmflush
  452 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 bioset
  461 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 kdmflush
  462 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 bioset
  480 root      20   0       0      0      0 S   0.0  0.0   0:00.10 jbd2/dm-0-8
  481 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 ext4-rsv-conver
  564 root      20   0  156516  39052  38320 S   0.0  0.2   0:00.77 systemd-journal
  585 root      20   0   58032   7752   3696 S   0.0  0.0   0:00.78 systemd-udevd
  609 root       0 -20       0      0      0 I   0.0  0.0   0:00.71 kworker/1:1H
  660 root      20   0       0      0      0 S   0.0  0.0   0:00.00 saa7134[0]
  709 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 kdmflush
  710 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 bioset
  716 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 kdmflush
  717 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 bioset
  733 root      20   0       0      0      0 S   0.0  0.0   0:00.33 jbd2/dm-2-8
  734 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 ext4-rsv-conver
  736 root      20   0       0      0      0 S   0.0  0.0   0:00.02 jbd2/dm-3-8
  737 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 ext4-rsv-conver
  741 root      20   0       0      0      0 S   0.0  0.0   0:00.00 jbd2/sda1-8
  742 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 ext4-rsv-conver
  764 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 rpciod
  765 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 xprtiod
  768 root      16  -4   64048   1744   1328 S   0.0  0.0   0:00.01 auditd
  789 root      20   0   25448   4188   3480 S   0.0  0.0   0:00.02 smartd
  791 root      20   0  278072   8652   7640 S   0.0  0.1   0:00.03 sssd
  792 root      20   0  723736  21832  21104 S   0.0  0.1   0:00.35 rsyslogd
  793 dbus      20   0   80016   5964   4256 S   0.0  0.0   0:00.60 dbus-daemon
  799 root      20   0  204764   3376   2924 S   0.0  0.0   0:00.01 gssproxy
  805 root      20   0  284340   9048   7680 S   0.0  0.1   0:00.25 sssd_be
  807 root      20   0  353296  32496  13124 S   0.0  0.2   0:00.91 firewalld
  810 rtkit     21   1  186040   3340   3012 S   0.0  0.0   0:00.06 rtkit-daemon
  811 root      20   0   99928   3096   2788 S   0.0  0.0   0:00.51 irqbalance
  812 root      39  19   17140   1564   1420 S   0.0  0.0   0:00.00 alsactl
  815 avahi     20   0   72228   3640   3284 S   0.0  0.0   0:00.02 avahi-daemon
  816 root      20   0  415128   8540   6844 S   0.0  0.1   0:00.05 ModemManager
  827 root      20   0  463960   9344   7980 S   0.0  0.1   0:00.03 abrtd
  828 root      20   0  277656  10368   9124 S   0.0  0.1   0:00.25 sssd_nss
  837 chrony    20   0  122508   3632   3176 S   0.0  0.0   0:02.13 chronyd
  842 avahi     20   0   72228    356      0 S   0.0  0.0   0:00.00 avahi-daemon
  846 root      20   0   71360   7476   5480 S   0.0  0.0   0:00.11 systemd-logind
  847 root      20   0  417704   7680   6492 S   0.0  0.0   0:00.20 accounts-daemon
  889 polkitd   20   0  559128  16472  10548 S   0.0  0.1   0:00.41 polkitd
  897 root      20   0  762516  16092  14940 S   0.0  0.1   0:00.07 abrt-dump-journ
  898 root      20   0  768956   9724   8452 S   0.0  0.1   0:00.04 abrt-dump-journ
  899 root      20   0  768880  11020   9792 S   0.0  0.1   0:00.04 abrt-dump-journ
  961 root      20   0  793264  16692  13892 S   0.0  0.1   0:00.42 NetworkManager
  979 root      20   0  231732  10316   7940 S   0.0  0.1   0:00.07 cupsd
  991 root      20   0   42588   2380   2176 S   0.0  0.0   0:00.00 atd
  992 root      20   0  143312   3392   2772 S   0.0  0.0   0:00.32 crond
  995 root      20   0  314032  14664  13336 S   0.0  0.1   0:00.04 sddm
 1004 colord    20   0  428356  13264   8520 S   0.0  0.1   0:01.36 colord
 1166 root      20   0   77096   8564   6660 S   0.0  0.1   0:00.03 dhclient
 1238 root      20   0  477180   9348   7424 S   0.0  0.1   0:00.16 udisksd
 1244 root      20   0  424228   8856   7632 S   0.0  0.1   0:00.12 upowerd
 1268 root      20   0  324420  15384  13840 S   0.0  0.1   0:00.38 sddm-helper
 1272 kaza      20   0   82900   8040   6756 S   0.0  0.0   0:00.17 systemd
 1274 kaza      20   0  133028   3824     20 S   0.0  0.0   0:00.00 (sd-pam)
 1288 kaza      20   0  431380  34852  30456 S   0.0  0.2   0:00.34 kwalletd
 1289 kaza      20   0  140564   3088   2748 S   0.0  0.0   0:00.06 tcsh
 1309 kaza      20   0   73752   6100   4104 S   0.0  0.0   0:01.04 dbus-daemon
 1315 kaza      20   0   70160    788      0 S   0.0  0.0   0:00.01 ssh-agent
 1329 kaza      20   0  120032   3152   2876 S   0.0  0.0   0:00.00 startkde
 1356 kaza      20   0   13388    164      0 S   0.0  0.0   0:00.00 start_kdeinit
 1357 kaza      20   0  263452   8312   6152 S   0.0  0.1   0:00.05 kdeinit5
 1358 kaza      20   0  627632  35740  31484 S   0.0  0.2   0:00.68 klauncher
 1361 kaza      20   0 1641360  71792  58416 S   0.0  0.4   0:02.15 kded5
 1370 kaza      20   0  650680  33796  28936 S   0.0  0.2   0:00.59 kaccess
 1384 kaza      20   0  190216   5588   4548 S   0.0  0.0   0:00.16 dconf-service
 1386 kaza      20   0   88636   5368   4824 S   0.0  0.0   0:00.00 kwrapper5
 1387 kaza      20   0  754148  43704  38012 S   0.0  0.3   0:01.07 ksmserver
 1391 kaza      20   0  630112  35464  30036 S   0.0  0.2   0:00.66 kglobalaccel5
 1404 kaza      20   0  401252  11596  10064 S   0.0  0.1   0:00.07 mission-control
 1417 kaza      20   0  472428  18832  16808 S   0.0  0.1   0:00.42 kscreen_backend
 1441 kaza      20   0 3321220  92032  71092 S   0.0  0.6   1:28.86 kwin_x11
 1459 kaza      20   0  729336  44600  37776 S   0.0  0.3   0:01.32 kdeconnectd
 1460 kaza      20   0  963324  85892  65972 S   0.0  0.5   0:02.11 krunner
 1462 kaza      20   0 3992484 207080  99152 S   0.0  1.3   0:32.86 plasmashell
 1463 kaza      20   0  805160  39308  34600 S   0.0  0.2   0:00.56 polkit-kde-auth
 1464 kaza      20   0  638296  33984  29832 S   0.0  0.2   0:00.64 xembedsniproxy
 1478 kaza      20   0  351352  21428  18108 S   0.0  0.1   0:16.42 gkrellm
 1496 kaza       9 -11 1620020  15780  12184 S   0.0  0.1   0:12.25 pulseaudio
 1539 kaza      20   0   22624   2772   2612 S   0.0  0.0   0:00.00 gam_server
 1542 kaza      20   0 1021752  42008  34924 S   0.0  0.3   0:00.91 kactivitymanage
 1583 kaza      20   0  838068  73940  61980 S   0.0  0.5   0:12.54 konsole
 1584 kaza      20   0  690960  40776  33592 S   0.0  0.2   0:00.17 abrt-applet
 1586 kaza      20   0 1201092  90368  75448 S   0.0  0.6   0:01.24 korgac
 1587 kaza      20   0  814888  39964  35136 S   0.0  0.2   0:00.72 org_kde_powerde
 1619 kaza      20   0  346932   5832   5328 S   0.0  0.0   0:00.01 at-spi-bus-laun
 1628 kaza      20   0   72352   4564   3952 S   0.0  0.0   0:00.15 dbus-daemon
 1631 kaza      20   0  220596   5496   4880 S   0.0  0.0   0:00.58 at-spi2-registr
 1642 kaza      20   0  134576   3636   2840 S   0.0  0.0   0:00.03 tcsh
 1659 kaza      20   0  398016   6376   5664 S   0.0  0.0   0:00.01 gvfsd
 1661 kaza      20   0  164072   4316   3616 R   0.0  0.0   0:23.97 top
 1666 root      20   0  360536   9776   8456 S   0.0  0.1   0:00.06 abrt-dbus
 1716 kaza      20   0  637196  35604  31140 S   0.0  0.2   0:01.26 akonadi_control
 1720 kaza      20   0 2649920  32280  27976 S   0.0  0.2   0:00.32 akonadiserver
 1724 kaza      20   0  636264  67756  17464 S   0.0  0.4   0:03.66 mysqld
 1767 kaza      20   0  732156  41780  36956 S   0.0  0.3   0:00.69 akonadi_akonote
 1769 kaza      20   0 1234240  90760  75972 S   0.0  0.6   0:01.17 akonadi_archive
 1771 kaza      20   0  745184  43252  38228 S   0.0  0.3   0:00.76 akonadi_birthda
 1773 kaza      20   0  823712  44768  39644 S   0.0  0.3   0:01.37 akonadi_followu
 1775 kaza      20   0  841468  45404  40144 S   0.0  0.3   0:01.37 akonadi_ical_re
 1777 kaza      39  19  761776  45936  39804 S   0.0  0.3   0:00.81 akonadi_indexin
 1781 kaza      20   0  732148  41768  36948 S   0.0  0.3   0:00.70 akonadi_maildir
 1782 kaza      20   0  811204  42336  37352 S   0.0  0.3   0:01.28 akonadi_maildis
 1783 kaza      20   0 1281160  93500  78284 S   0.0  0.6   0:01.30 akonadi_mailfil
 1791 kaza      20   0  725196  40768  36040 S   0.0  0.2   0:00.66 akonadi_migrati
 1798 kaza      20   0 1072820  82396  68312 S   0.0  0.5   0:00.95 akonadi_newmail
 1800 kaza      20   0 1223656  88572  73884 S   0.0  0.5   0:01.74 akonadi_sendlat
 1807 kaza      20   0  836684  44488  39548 S   0.0  0.3   0:01.37 akonadi_vcard_r
 1810 kaza      20   0  762228  44004  38876 S   0.0  0.3   0:00.76 akonadi_vcard_r
 1825 kaza      20   0  636776  37704  33384 S   0.0  0.2   0:00.59 kuiserver5
 1904 root      20   0  554108  19752  16408 S   0.0  0.1   0:00.14 packagekitd
 2048 kaza      20   0  564532   9120   8188 S   0.0  0.1   0:00.07 xdg-desktop-por
 2053 kaza      20   0  715380  12240  10412 S   0.0  0.1   0:00.05 xdg-document-po
 2056 kaza      20   0  491448   9944   8256 S   0.0  0.1   0:00.02 xdg-permission-
 2073 kaza      20   0  603108  21548  18288 S   0.0  0.1   0:00.08 xdg-desktop-por
 2078 kaza      20   0 1445556  37496  30616 S   0.0  0.2   0:00.24 evolution-sourc
 2085 kaza      20   0  858616  29880  24892 S   0.0  0.2   0:00.09 goa-daemon
 2093 kaza      20   0  526420  13120  11312 S   0.0  0.1   0:00.29 goa-identity-se
 2098 kaza      20   0 1078472  49176  39676 S   0.0  0.3   0:00.38 evolution-alarm
 2103 kaza      20   0  854588  80820  67652 S   0.0  0.5   0:05.62 konsole
 2107 kaza      20   0 1053008  42196  33204 S   0.0  0.3   0:00.26 evolution-calen
 2115 kaza      20   0  135152   4268   2888 S   0.0  0.0   0:00.17 tcsh
 2145 kaza      20   0 1212464  38708  30372 S   0.0  0.2   0:00.20 evolution-calen
 2155 kaza      20   0 1037380  37336  30796 S   0.0  0.2   0:00.22 evolution-addre
 2158 kaza      20   0 1408884  40588  30544 S   0.0  0.2   0:00.25 evolution-calen
 2186 kaza      20   0 1205792  37676  31280 S   0.0  0.2   0:00.17 evolution-addre
 2203 kaza      20   0  474268   8816   7776 S   0.0  0.1   0:00.07 gnome-keyring-d
 2688 root      20   0  268924   9604   7704 S   0.0  0.1   0:00.01 su
 2697 root      20   0  143348   4172   3004 S   0.0  0.0   0:00.10 tcsh
 2977 kaza      20   0  839744  76176  63852 S   0.0  0.5   0:03.29 konsole
 2982 kaza      20   0  134580   3532   2712 S   0.0  0.0   0:00.04 tcsh
 3009 root      20   0  268924   9648   7756 S   0.0  0.1   0:00.02 su
 3018 root      20   0  143348   4216   3048 S   0.0  0.0   0:00.08 tcsh
 3068 root      20   0       0      0      0 I   0.0  0.0   0:00.00 kworker/1:2
 3106 kaza      20   0  839372  75728  63624 S   0.0  0.5   0:03.39 konsole
 3111 kaza      20   0  134580   3576   2756 S   0.0  0.0   0:00.04 tcsh
 3138 root      20   0  268924   9568   7672 S   0.0  0.1   0:00.01 su
 3147 root      20   0  142952   3796   2988 S   0.0  0.0   0:00.04 tcsh
 3215 root      20   0  136868   4608   4172 S   0.0  0.0   0:00.01 periodic_status
 3263 root      20   0  136996   4644   4072 S   0.0  0.0   0:00.00 backup_tar
 3267 root      20   0  120032   2580   2404 S   0.0  0.0   0:00.00 sh
 3276 root      20   0  212352  29588   9764 S   0.0  0.2   0:04.52 nedit
 3642 root      20   0       0      0      0 I   0.0  0.0   0:00.35 kworker/3:0
 4283 kaza      20   0  186040   5792   5260 S   0.0  0.0   0:00.02 gconfd-2
 4623 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 kworker/1:2H
 4681 root       0 -20       0      0      0 I   0.0  0.0   0:00.29 kworker/3:1H
 4889 root       0 -20       0      0      0 I   0.0  0.0   0:00.21 kworker/2:1H
 4960 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 kworker/3:2H
 5171 root       0 -20       0      0      0 I   0.0  0.0   0:00.10 kworker/0:2H
 5338 root      20   0       0      0      0 I   0.0  0.0   0:00.15 kworker/2:0
 5437 root      20   0       0      0      0 I   0.0  0.0   0:00.24 kworker/0:1
 5694 kaza      20   0 2001556  85788  57868 S   0.0  0.5   0:00.72 Web Content
 5809 root      20   0       0      0      0 I   0.0  0.0   0:00.00 kworker/0:0
 5863 root      20   0       0      0      0 I   0.0  0.0   0:00.05 kworker/u8:0
 6076 root      20   0       0      0      0 I   0.0  0.0   0:00.01 kworker/u8:2
 6114 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 kworker/0:0H
 6115 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 kworker/2:2H
 6128 root      20   0       0      0      0 I   0.0  0.0   0:00.00 kworker/2:2

The "dmesg -c" output remained empty.

I also looked at /var/log/messages, I only see messages related to network manager
until slightly after 14:05:30 (3.5 min after the last write of tar log file)
I see a binary data in the file, probably that's when I powered down the PC
(couldn't do anything to properly shutdown, Ctrl+Alt+F3 didn't help and neither
the Alt+SysRq(Prt Scr)+R-E-I-S-U-B):

Code:

Oct 26 13:39:07 localhost dhclient[1166]: DHCPREQUEST on enp4s0 to 10.0.0.138 port 67 (xid=0x385ead74)
Oct 26 13:39:08 localhost dhclient[1166]: DHCPACK from 10.0.0.138 (xid=0x385ead74)
Oct 26 13:39:08 localhost NetworkManager[961]: <info>  [1540550348.0764] dhcp4 (enp4s0):   address 10.0.0.1
Oct 26 13:39:08 localhost NetworkManager[961]: <info>  [1540550348.0769] dhcp4 (enp4s0):   plen 24 (255.255.255.0)
Oct 26 13:39:08 localhost dbus-daemon[793]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' requested by ':1.9' (uid=0 pid=961 comm="/usr/sbin/NetworkManager --no-daemon " label="system_u:system_r:NetworkManager_t:s0")
Oct 26 13:39:08 localhost NetworkManager[961]: <info>  [1540550348.0770] dhcp4 (enp4s0):   gateway 10.0.0.138
Oct 26 13:39:08 localhost audit: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=NetworkManager-dispatcher comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Oct 26 13:39:08 localhost dhclient[1166]: bound to 10.0.0.1 -- renewal in 1556 seconds.
Oct 26 13:39:08 localhost NetworkManager[961]: <info>  [1540550348.0770] dhcp4 (enp4s0):   lease time 3600
Oct 26 13:39:08 localhost dbus-daemon[793]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Oct 26 13:39:08 localhost NetworkManager[961]: <info>  [1540550348.0770] dhcp4 (enp4s0):   nameserver '10.0.0.138'
Oct 26 13:39:08 localhost nm-dispatcher: req:1 'dhcp4-change' [enp4s0]: new request (6 scripts)
Oct 26 13:39:08 localhost NetworkManager[961]: <info>  [1540550348.0770] dhcp4 (enp4s0):   domain name 'Home'
Oct 26 13:39:08 localhost nm-dispatcher: req:1 'dhcp4-change' [enp4s0]: start running ordered scripts...
Oct 26 13:39:08 localhost NetworkManager[961]: <info>  [1540550348.0770] dhcp4 (enp4s0): state changed bound -> bound
Oct 26 13:39:08 localhost systemd: Starting Network Manager Script Dispatcher Service...
Oct 26 13:39:08 localhost systemd: Started Network Manager Script Dispatcher Service.
Oct 26 13:39:18 localhost audit: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=NetworkManager-dispatcher comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Oct 26 13:53:10 localhost plasmashell: QXcbConnection: XCB error: 2 (BadValue), sequence: 37067, resource id: 98566161, major code: 141 (Unknown), minor code: 3
Oct 26 13:53:10 localhost plasmashell: QXcbConnection: XCB error: 2 (BadValue), sequence: 37133, resource id: 104857975, major code: 141 (Unknown), minor code: 3
Oct 26 14:01:27 localhost cupsd: REQUEST localhost - - "POST / HTTP/1.1" 200 183 Renew-Subscription successful-ok
Oct 26 14:01:59 localhost plasmashell: QXcbConnection: XCB error: 2 (BadValue), sequence: 40626, resource id: 98566161, major code: 141 (Unknown), minor code: 3
Oct 26 14:01:59 localhost plasmashell: QXcbConnection: XCB error: 2 (BadValue), sequence: 40644, resource id: 102760454, major code: 141 (Unknown), minor code: 3
Oct 26 14:01:59 localhost plasmashell: QXcbConnection: XCB error: 2 (BadValue), sequence: 40660, resource id: 104857975, major code: 141 (Unknown), minor code: 3
Oct 26 14:01:59 localhost plasmashell: QXcbConnection: XCB error: 2 (BadValue), sequence: 40680, resource id: 102760454, major code: 141 (Unknown), minor code: 3
Oct 26 14:02:00 localhost plasmashell: QXcbConnection: XCB error: 2 (BadValue), sequence: 40721, resource id: 104857975, major code: 141 (Unknown), minor code: 3
Oct 26 14:05:04 localhost dhclient[1166]: DHCPREQUEST on enp4s0 to 10.0.0.138 port 67 (xid=0x385ead74)
Oct 26 14:05:04 localhost dhclient[1166]: DHCPACK from 10.0.0.138 (xid=0x385ead74)
Oct 26 14:05:05 localhost NetworkManager[961]: <info>  [1540551905.0447] dhcp4 (enp4s0):   address 10.0.0.1
Oct 26 14:05:05 localhost NetworkManager[961]: <info>  [1540551905.0451] dhcp4 (enp4s0):   plen 24 (255.255.255.0)
Oct 26 14:05:05 localhost audit: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=NetworkManager-dispatcher comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Oct 26 14:05:05 localhost dbus-daemon[793]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' requested by ':1.9' (uid=0 pid=961 comm="/usr/sbin/NetworkManager --no-daemon " label="system_u:system_r:NetworkManager_t:s0")
Oct 26 14:05:05 localhost NetworkManager[961]: <info>  [1540551905.0451] dhcp4 (enp4s0):   gateway 10.0.0.138
Oct 26 14:05:05 localhost dhclient[1166]: bound to 10.0.0.1 -- renewal in 1392 seconds.
Oct 26 14:05:05 localhost NetworkManager[961]: <info>  [1540551905.0451] dhcp4 (enp4s0):   lease time 3600
Oct 26 14:05:05 localhost dbus-daemon[793]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Oct 26 14:05:05 localhost NetworkManager[961]: <info>  [1540551905.0451] dhcp4 (enp4s0):   nameserver '10.0.0.138'
Oct 26 14:05:05 localhost nm-dispatcher: req:1 'dhcp4-change' [enp4s0]: new request (6 scripts)
Oct 26 14:05:05 localhost NetworkManager[961]: <info>  [1540551905.0451] dhcp4 (enp4s0):   domain name 'Home'
Oct 26 14:05:05 localhost nm-dispatcher: req:1 'dhcp4-change' [enp4s0]: start running ordered scripts...
Oct 26 14:05:05 localhost NetworkManager[961]: <info>  [1540551905.0452] dhcp4 (enp4s0): state changed bound -> bound
Oct 26 14:05:05 localhost systemd: Starting Network Manager Script Dispatcher Service...
Oct 26 14:05:05 localhost systemd: Started Network Manager Script Dispatcher Service.
Oct 26 14:05:15 localhost audit: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=NetworkManager-dispatcher comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Oct 26 14:05:30 localhost chronyd[837]: Source 2001:3a0:0:2006::87:123 replaced with 2a02:2a50:6::123

<nul>...<nul> <-- Few hundreds of bytes of binary data

Oct 26 14:05:30 localhost chronyd[837]: Source 2001:3a0:0:2006::87:123 replaced with 2a02:2a50:6::123
Oct 26 14:20:31 localhost kernel: Linux version 4.15.6-200.spi_gpio.fc26.x86_64 (kaza@localhost.localdomain) (gcc version 7.3.1 20180130 (Red Hat 7.3.1-2) (GCC)) #1 SMP Sat Mar 3 02:13:51 IST 2018
Oct 26 14:20:31 localhost kernel: Command line: BOOT_IMAGE=/vmlinuz-4.15.6-200.spi_gpio.fc26.x86_64 root=/dev/mapper/VolGroup-lv_root ro rd.lvm.lv=VolGroup/lv_root rd.lvm.lv=VolGroup/lv_swap rhgb quiet LANG=en_US.UTF-8
Oct 26 14:20:31 localhost kernel: x86/fpu: x87 FPU will use FXSAVE
Oct 26 14:20:31 localhost kernel: e820: BIOS-provided physical RAM map:
Oct 26 14:20:31 localhost kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009bfff] usable
Oct 26 14:20:31 localhost kernel: BIOS-e820: [mem 0x000000000009c000-0x000000000009ffff] reserved
Oct 26 14:20:31 localhost kernel: BIOS-e820: [mem 0x00000000000e4000-0x00000000000fffff] reserved
Oct 26 14:20:31 localhost kernel: BIOS-e820: [mem 0x0000000000100000-0x00000000cffaffff] usable
Oct 26 14:20:31 localhost kernel: BIOS-e820: [mem 0x00000000cffb0000-0x00000000cffbdfff] ACPI data
Oct 26 14:20:31 localhost kernel: BIOS-e820: [mem 0x00000000cffbe000-0x00000000cffdffff] ACPI NVS
Oct 26 14:20:31 localhost kernel: BIOS-e820: [mem 0x00000000cffe0000-0x00000000cfffffff] reserved
Oct 26 14:20:31 localhost kernel: BIOS-e820: [mem 0x00000000fff00000-0x00000000ffffffff] reserved
Oct 26 14:20:31 localhost kernel: BIOS-e820: [mem 0x0000000100000000-0x000000042fffffff] usable

What can you see in the logs that I can't?

TIA,
kaza.

kaza · 10-26-2018, 01:55 PM

Some more info: I made one more experiment, after installing "iotop" and adding its output
to the list of saved data every 10 minutes. I also changed the cartrige.

Again, after over 2 hours the tape stopped. This time I decided to do nothing for more
than 10 minutes hoping the periodic data save will make one more save and I'll see something in the
last saved group of files. After waiting for more than 10 minutes I decided, instead of like
previously dealing with "nedit" of the tar log file to bring to the top the konsole where
I called the "tar" command and pressed "Ctrl+C" - indeed, the tape made few movements and
after some time the prompt returned. I thought: "ok, maybe if the tar process no longer
requests I/O everything will be OK now". However, when I attempted "ls" in the same konsole
it hang and at the same time the windows of "top", "iotop" and "gkrellm" stopped being
updated (they were updating during the over 10 minutes period from the tape stopping).
The unobscured part of the "top" window again showed "wa" over 70%. In the "iotop" window
there was no activity. Also, since the tape stopped the line with the "tar" command didn't
appear in the "iotop" window.

Together with the frozen "top", "iotop" and "gkrellm" the whole GUI froze - again, not even
the Alt+SysRq+R-E-I-S-U-B could get it to reboot and again I had to power the PC down.

Afer rebooting I looked at the last saved files and saw that during the over 10 minues period
from tape stop to the GUI freeze the periodic files save wasn't performed: the last saved
file timestamp was earlier than the write time of the "tar" command log file.

So far the facts, now I'll speculate, correct me if I'm wrong.
It seems to me that at some point the tape makes some I/O request which can't be completed
(70+% of "wa"). Other processes can continue running as long as they don't need to access
the filesystem. That's why the screens of "top", "iotop", "gkrellm" continue to be updated.
But as soon as something needs to access the filesystem (like "ls") this new request is blocked
until the original tape I/O request completes (which never happens) and thus the whole GUI
freezes.

I wonder what would happen if instead of calling the "tar" command directly I'll call it
through "ionice -c 3" ?

It seems very strange to me that the queueing system of the I/O requests can hang if one device
I/O can't be completed, I would expect some "timeout" allowing later requests to be performed.
But I don't have enough knowledge of the LINUX disk I/O system
to answer that question.

Can anyone explain (or point to a knowledge source) how the LINUX disk I/O system queueing works?

TIA,
kaza.

kaza · 10-30-2018, 06:22 AM

I finally caught the error soon enough to save some relevant info and it doesn't look good.
I managed to run "dmesg -c>filename" just as I heared the beep but before
the GUI stuck and here's what I saw later after having to power down the PC and to reboot:

Code:

[ 4109.872050] sky2 0000:04:00.0 enp4s0: Link is down
[ 4127.824514] sky2 0000:04:00.0 enp4s0: Link is up at 100 Mbps, full duplex, flow control rx
[ 4139.334846] sky2 0000:04:00.0 enp4s0: Link is down
[ 4141.794760] sky2 0000:04:00.0 enp4s0: Link is up at 100 Mbps, full duplex, flow control rx
[ 8832.460042] sd 7:0:1:0: [sdb] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 8832.460053] sd 7:0:1:0: [sdb] tag#1 Sense Key : Hardware Error [current] 
[ 8832.460057] sd 7:0:1:0: [sdb] tag#1 Add. Sense: Internal target failure
[ 8832.460061] sd 7:0:1:0: [sdb] tag#1 CDB: Read(16) 88 00 00 00 00 00 5c 58 3d e0 00 00 01 00 00 00
[ 8832.460065] print_req_error: critical target error, dev sdb, sector 1549286880
[ 8832.490713] sd 7:0:1:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 8832.490719] sd 7:0:1:0: [sdb] tag#0 Sense Key : Hardware Error [current] 
[ 8832.490723] sd 7:0:1:0: [sdb] tag#0 Add. Sense: Internal target failure
[ 8832.490727] sd 7:0:1:0: [sdb] tag#0 CDB: Read(16) 88 00 00 00 00 00 5c 58 3e e0 00 00 01 00 00 00
[ 8832.490730] print_req_error: critical target error, dev sdb, sector 1549287136
[ 8854.477447] sd 7:0:1:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 8854.477455] sd 7:0:1:0: [sdb] tag#0 Sense Key : Hardware Error [current] 
[ 8854.477459] sd 7:0:1:0: [sdb] tag#0 Add. Sense: Internal target failure
[ 8854.477463] sd 7:0:1:0: [sdb] tag#0 CDB: Read(16) 88 00 00 00 00 00 5c 58 3e 00 00 00 00 08 00 00
[ 8854.477466] print_req_error: critical target error, dev sdb, sector 1549286912
[ 8857.893105] blk_partition_remap: fail for partition 1
[ 8857.893113] blk_partition_remap: fail for partition 1
[ 8858.842183] blk_partition_remap: fail for partition 1
[ 8858.842192] blk_partition_remap: fail for partition 1
[ 8859.722773] scsi 7:1:3:0: Direct-Access     TOSHIBA  MK2001TRKB       0105 PQ: 0 ANSI: 5
[ 8859.724433] scsi 7:1:3:0: Attached scsi generic sg2 type 0
[ 8860.188272] blk_partition_remap: fail for partition 1
[ 8860.188280] blk_partition_remap: fail for partition 1
[ 8861.652640] blk_partition_remap: fail for partition 1
[ 8861.652648] blk_partition_remap: fail for partition 1
[ 8861.652889] blk_partition_remap: fail for partition 1
[ 8861.652894] blk_partition_remap: fail for partition 1
[ 8861.652898] blk_partition_remap: fail for partition 1
[ 8861.652902] blk_partition_remap: fail for partition 1
[ 8861.652905] blk_partition_remap: fail for partition 1
[ 8863.513851] blk_partition_remap: fail for partition 1
[ 8863.513857] blk_partition_remap: fail for partition 1
[ 8863.513859] blk_partition_remap: fail for partition 1
[ 8863.513860] blk_partition_remap: fail for partition 1
[ 8863.513864] Aborting journal on device dm-3-8.
[ 8863.513870] blk_partition_remap: fail for partition 1
[ 8863.513873] Buffer I/O error on dev dm-3, logical block 487096320, lost sync page write
[ 8863.513876] JBD2: Error -5 detected when updating journal superblock for dm-3-8.
[ 8870.622855] blk_partition_remap: fail for partition 1
[ 8870.622865] blk_partition_remap: fail for partition 1
[ 8870.622868] blk_partition_remap: fail for partition 1
[ 8870.622872] blk_partition_remap: fail for partition 1
[ 8870.622876] blk_partition_remap: fail for partition 1
[ 8870.622887] blk_partition_remap: fail for partition 1
[ 8870.622890] Buffer I/O error on dev dm-3, logical block 0, lost sync page write
[ 8870.622897] EXT4-fs error (device dm-3): ext4_journal_check_start:61: Detected aborted journal
[ 8870.622899] EXT4-fs (dm-3): Remounting filesystem read-only
[ 8870.622902] EXT4-fs (dm-3): previous I/O error to superblock detected
[ 8870.622904] blk_partition_remap: fail for partition 1
[ 8870.622905] Buffer I/O error on dev dm-3, logical block 0, lost sync page write
[ 8928.949870] blk_partition_remap: fail for partition 1
[ 8928.949878] blk_partition_remap: fail for partition 1
[ 8928.950020] blk_partition_remap: fail for partition 1
[ 8928.950026] blk_partition_remap: fail for partition 1
[ 8928.950030] blk_partition_remap: fail for partition 1
[ 8928.950033] blk_partition_remap: fail for partition 1
[ 8928.950037] blk_partition_remap: fail for partition 1
[ 8941.084894] blk_partition_remap: fail for partition 1
[ 8941.084914] blk_partition_remap: fail for partition 1
[ 8941.084918] blk_partition_remap: fail for partition 1
[ 8941.084922] blk_partition_remap: fail for partition 1
[ 8941.084926] blk_partition_remap: fail for partition 1
[ 8944.289808] scsi 7:0:1:0: Direct-Access     ICP      KA_RAID_1        V1.0 PQ: 0 ANSI: 2
[ 8944.290018] sd 7:0:1:0: [sdc] Very big device. Trying to use READ CAPACITY(16).
[ 8944.290036] sd 7:0:1:0: [sdc] 7801405440 512-byte logical blocks: (3.99 TB/3.63 TiB)
[ 8944.290047] sd 7:0:1:0: [sdc] Write Protect is off
[ 8944.290049] sd 7:0:1:0: [sdc] Mode Sense: 12 00 10 08
[ 8944.290067] sd 7:0:1:0: [sdc] Write cache: disabled, read cache: enabled, supports DPO and FUA
[ 8944.290203] sd 7:0:1:0: Attached scsi generic sg6 type 0
[ 8944.290450] sd 7:0:1:0: [sdc] Very big device. Trying to use READ CAPACITY(16).
[ 8944.325562]  sdc: sdc1
[ 8944.325926] sd 7:0:1:0: [sdc] Very big device. Trying to use READ CAPACITY(16).
[ 8944.326030] sd 7:0:1:0: [sdc] Attached SCSI removable disk
[ 8944.618268] blk_partition_remap: fail for partition 1
[ 8944.618276] blk_partition_remap: fail for partition 1
[ 8944.618406] blk_partition_remap: fail for partition 1
[ 8944.618411] blk_partition_remap: fail for partition 1
[ 8944.618415] blk_partition_remap: fail for partition 1
[ 8944.618418] blk_partition_remap: fail for partition 1
[ 8944.618422] blk_partition_remap: fail for partition 1
[ 8946.304142] blk_partition_remap: fail for partition 1
[ 8946.304151] blk_partition_remap: fail for partition 1
[ 8946.304154] blk_partition_remap: fail for partition 1
[ 8946.304158] blk_partition_remap: fail for partition 1
[ 8946.304161] blk_partition_remap: fail for partition 1
[ 8946.304165] blk_partition_remap: fail for partition 1
[ 8946.304168] blk_partition_remap: fail for partition 1
[ 8946.304177] blk_partition_remap: fail for partition 1
[ 8946.304181] blk_partition_remap: fail for partition 1
[ 8947.897137] blk_partition_remap: fail for partition 1
[ 8947.897159] blk_partition_remap: fail for partition 1
[ 8947.897175] blk_partition_remap: fail for partition 1
[ 8947.897183] blk_partition_remap: fail for partition 1
[ 8947.897401] blk_partition_remap: fail for partition 1

So after searching for the source of the problem at the backup tape the real source of it in the HDD,
a hardware "RAID 0" array of two 2TB SAS disks.
I looked at the log of the "tar" command to see in which directories are the files
with problems, added few directories to the "tar" exclude list and re-run the backup tar command.
So far I'm at the 9-th cartrige and it seems the backup advances normally.
After I'll finish (and verify) the backup I'm planing to run the "fsck /dev/sdb" but I'm worried
with regard of: after I'll "umount" the partition of that disk, will I be able to mount it
again and will most of the files remain? I guess some file (where the damaged sectors reside)
will be damaged but will the rest be OK?

TIA,
kaza.

syg00 · 10-30-2018, 07:39 AM

Well done for sticking with it - and getting that log. Wasn't making a lot of sense prior to that.

fsck fixes filesystems. Often the errors are "transient" in the sense that (temporary) files that were open at the time of failure didn't close, and need to be cleaned up. No harm done.
Sometimes you get cross-linked inodes (for example) that can mean truncated files after "fixing". And sometimes you get a total basket case when the whole lot goes down the toilet. So it can be a bit of a crap-shoot.
Fortunately it's usually benign - but given your history, you may not be that lucky. fsck (for ext4) will attempt to save file fragments in lost+found - which may help you or may just mislead you. In severe cases I just restore and eat any data loss if my backups aren't sufficiently recent.

As a general rule, if lost+found is empty, your data (all of it in that filesystem) is probably safe.

kaza · 10-31-2018, 02:30 PM

Hello syg00, thanks for the reply.

Yesterday I finished the backup and now I'm at the beginning of verify, I guess it'll take few days,
possible up to a week (life...). If the backup will be OK I'll run "fsck" "and may the SHWARTZ be with me" (where I
live there are lots of them). :-)

One thing isn't clear to me: when I'll run the "fsck" (after "umount") should I run it on the physical disk

/dev/sdb

or on it's LVM name:

/dev/mapper/VolGroup_graphics-lv_graphics

?

TIA,
kaza.

syg00 · 10-31-2018, 07:28 PM

Where-ever the filesystem exists - fsck is a filesystem command. So, in this case the lv - /dev/mapper/VolGroup_graphics-lv_graphics

kaza · 11-15-2018, 03:05 PM

Well, it took longer than I thought:

at first, while verifying, on the 8-th (out of 12)
cartriges (each - about 4-5 hours) I've got I/O errors from the tape
(this time - the tape, not the disk). So I replaced the 8-th cartrige
and re-run the backup and the verify and finally, was able to "umount"
the partition in question and to run the "fsck". However, the "fsck"
output looks like it didn't fins any errors:

Code:

<root localhost>.../temp>lsblk
NAME                              MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                                 8:0    1 273.2G  0 disk 
├─sda1                              8:1    1     1G  0 part /boot
└─sda2                              8:2    1 262.2G  0 part 
  ├─VolGroup-lv_root              253:0    0  99.4G  0 lvm  /
  ├─VolGroup-lv_swap              253:1    0    32G  0 lvm  [SWAP]
  └─VolGroup-lv_home              253:3    0 130.9G  0 lvm  /home
sdb                                 8:16   1   3.6T  0 disk 
└─sdb1                              8:17   1   3.6T  0 part 
  └─VolGroup_graphics-lv_graphics 253:2    0   3.6T  0 lvm  /home/user/GRAPHICS
sr0                                11:0    1  1024M  0 rom  
<root localhost>.../temp>umount /home/user/GRAPHICS
<root localhost>.../temp>lsblk
NAME                              MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                                 8:0    1 273.2G  0 disk 
├─sda1                              8:1    1     1G  0 part /boot
└─sda2                              8:2    1 262.2G  0 part 
  ├─VolGroup-lv_root              253:0    0  99.4G  0 lvm  /
  ├─VolGroup-lv_swap              253:1    0    32G  0 lvm  [SWAP]
  └─VolGroup-lv_home              253:3    0 130.9G  0 lvm  /home
sdb                                 8:16   1   3.6T  0 disk 
└─sdb1                              8:17   1   3.6T  0 part 
  └─VolGroup_graphics-lv_graphics 253:2    0   3.6T  0 lvm  
sr0                                11:0    1  1024M  0 rom  


<root localhost>.../temp>lvdisplay
  --- Logical volume ---
  LV Path                /dev/VolGroup_graphics/lv_graphics
  LV Name                lv_graphics
  VG Name                VolGroup_graphics
  LV UUID                eAorkr-Lo1l-qadG-b0nW-AjnY-RWSi-uPFdf5
  LV Write Access        read/write
  LV Creation host, time localhost.localdomain, 2012-10-19 01:46:42 +0200
  LV Status              available
  # open                 0
  LV Size                3.63 TiB
  Current LE             952319
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:2
.
.
.

<root localhost>.../temp>fsck.ext4 /dev/VolGroup_graphics/lv_graphics
e2fsck 1.43.4 (31-Jan-2017)
/dev/VolGroup_graphics/lv_graphics: clean, 17126/243793920 files, 300877458/975174656 blocks

Why is that? I cleary remember (and have logs/dmesg )
that I had 3 disk errors stating that some sectors had a problem
so why the "fsck" doesn't indicate any problem found?

BTW, it took "fsck" about 1 Sec to check 4 TB partition, does this
short time reasonable?

TIA,
kaza.

dc.901 · 11-16-2018, 07:14 AM

If you grep for sdb in you logs; you probably have more hits, so something like:

Code:

grep sdb /var/log/messages*

Is sdb in a software RAID? Can you paste output of "cat /proc/mdstat" and also of "pvdisplay"

kaza · 11-17-2018, 02:59 PM

OK, here's the end of "grep sdb /var/log/messages*", Oct 28 is around when I had the disk errors,
later dates are during the successful backup and the verify:

Code:

/var/log/messages-20181104:Oct 28 21:39:29 localhost kernel: print_req_error: critical target error, dev sdb, sector 7488278528
/var/log/messages-20181104:Oct 28 21:39:29 localhost kernel: sd 7:0:1:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
/var/log/messages-20181104:Oct 28 21:39:29 localhost kernel: sd 7:0:1:0: [sdb] tag#0 Sense Key : Hardware Error [current] 
/var/log/messages-20181104:Oct 28 21:39:29 localhost kernel: sd 7:0:1:0: [sdb] tag#0 Add. Sense: Internal target failure
/var/log/messages-20181104:Oct 28 21:39:29 localhost kernel: sd 7:0:1:0: [sdb] tag#0 CDB: Read(16) 88 00 00 00 00 01 be 56 10 00 00 00 00 08 00 00
/var/log/messages-20181104:Oct 28 21:39:29 localhost kernel: print_req_error: critical target error, dev sdb, sector 7488278528
/var/log/messages-20181104:Nov  3 08:56:56 localhost kernel: sd 5:0:1:0: [sdb] Very big device. Trying to use READ CAPACITY(16).
/var/log/messages-20181104:Nov  3 08:56:56 localhost kernel: sd 5:0:1:0: [sdb] 7801405440 512-byte logical blocks: (3.99 TB/3.63 TiB)
/var/log/messages-20181104:Nov  3 08:56:56 localhost kernel: sd 5:0:1:0: [sdb] Write Protect is off
/var/log/messages-20181104:Nov  3 08:56:56 localhost kernel: sd 5:0:1:0: [sdb] Write cache: disabled, read cache: enabled, supports DPO and FUA
/var/log/messages-20181104:Nov  3 08:56:56 localhost kernel: sd 5:0:1:0: [sdb] Very big device. Trying to use READ CAPACITY(16).
/var/log/messages-20181104:Nov  3 08:56:56 localhost kernel: sdb: sdb1
/var/log/messages-20181104:Nov  3 08:56:56 localhost kernel: sd 5:0:1:0: [sdb] Very big device. Trying to use READ CAPACITY(16).
/var/log/messages-20181104:Nov  3 08:56:56 localhost kernel: sd 5:0:1:0: [sdb] Attached SCSI removable disk
/var/log/messages-20181104:Nov  3 08:57:11 localhost dracut-initqueue: Scanning devices sda2 sdb1  for LVM logical volumes VolGroup/lv_root VolGroup/lv_swap
/var/log/messages-20181104:Nov  3 08:57:19 localhost smartd[777]: Device: /dev/sdb, opened
/var/log/messages-20181104:Nov  3 08:57:19 localhost smartd[777]: Device: /dev/sdb, [ICP      KA_RAID_1        V1.0], S/N: 8435327C, 3.99 TB
/var/log/messages-20181104:Nov  3 08:57:19 localhost smartd[777]: Device: /dev/sdb, is SMART capable. Adding to "monitor" list.
/var/log/messages-20181104:Nov  3 09:25:10 localhost kernel: sd 6:0:1:0: [sdb] Very big device. Trying to use READ CAPACITY(16).
/var/log/messages-20181104:Nov  3 09:25:10 localhost kernel: sd 6:0:1:0: [sdb] 7801405440 512-byte logical blocks: (3.99 TB/3.63 TiB)
/var/log/messages-20181104:Nov  3 09:25:10 localhost kernel: sd 6:0:1:0: [sdb] Write Protect is off
/var/log/messages-20181104:Nov  3 09:25:10 localhost kernel: sd 6:0:1:0: [sdb] Write cache: disabled, read cache: enabled, supports DPO and FUA
/var/log/messages-20181104:Nov  3 09:25:10 localhost kernel: sd 6:0:1:0: [sdb] Very big device. Trying to use READ CAPACITY(16).
/var/log/messages-20181104:Nov  3 09:25:11 localhost kernel: sdb: sdb1
/var/log/messages-20181104:Nov  3 09:25:11 localhost kernel: sd 6:0:1:0: [sdb] Very big device. Trying to use READ CAPACITY(16).
/var/log/messages-20181104:Nov  3 09:25:11 localhost kernel: sd 6:0:1:0: [sdb] Attached SCSI removable disk
/var/log/messages-20181104:Nov  3 09:25:25 localhost dracut-initqueue: Scanning devices sda2 sdb1  for LVM logical volumes VolGroup/lv_root VolGroup/lv_swap
/var/log/messages-20181104:Nov  3 09:25:36 localhost smartd[809]: Device: /dev/sdb, opened
/var/log/messages-20181104:Nov  3 09:25:36 localhost smartd[809]: Device: /dev/sdb, [ICP      KA_RAID_1        V1.0], S/N: 8435327C, 3.99 TB
/var/log/messages-20181104:Nov  3 09:25:36 localhost smartd[809]: Device: /dev/sdb, is SMART capable. Adding to "monitor" list.

Code:

<root localhost.localdomain>.../root>cat /proc/mdstat
Personalities : 
unused devices: <none>

[code]
<root localhost.localdomain>.../root>pvdisplay
--- Physical volume ---
PV Name /dev/sdb1
VG Name VolGroup_graphics
PV Size 3.63 TiB / not usable 2.00 MiB
Allocatable yes (but full)
PE Size 4.00 MiB
Total PE 952319
Free PE 0
Allocated PE 952319
PV UUID W71uV5-m17D-qugv-1RRf-ESAn-wexC-O1ZHhE

--- Physical volume ---
PV Name /dev/sda2
VG Name VolGroup
PV Size 262.24 GiB / not usable 23.00 MiB
Allocatable yes (but full)
PE Size 32.00 MiB
Total PE 8391
Free PE 0
Allocated PE 8391
PV UUID bWsE7m-zeBE-5wCi-BoL2-dL1i-0tFg-ClCH0z

[code]

Is there anything suspicious?

TIA,
kaza.