Problem with boot, end kernel panic

hazel · 08-25-2019, 11:57 AM

Quote:

Originally Posted by jsbjsb001

If their drive is faulty, then it would have happened anyway - systemd or not. Another idea is that they might be able to look in their /var/log/ folder on the drive in question (if it's still there), and see if there is any logs that could help, like /var/log/messages Then they could look for any messages from libata in that log.

We already looked for the systemd journal. There wasn't one. In a systemd distro, the other logs get written by plugging syslogd into a socket provided by journald. But it looks as if the panic occurred before anything could be written to disk at all.

There is perhaps one thing that could still be done: chroot from the live disc, and use apt to remove the corrupt systemd and install a good one. The running init process will be the systemd from the image, so removing the other one shouldn't cause any problems. What do you think of that?

But if the ultimate problem is a bad drive, then your idea of using smartctl to check it takes precedence over any attempt to fix corrupt files.

jsbjsb001 · 08-25-2019, 12:12 PM

There's still /var/log/boot.log they can look at for any messages from libata - I'm using a "systemd distro" and that file exists on my system. And even when I was using CentOS 7 /var/log/messages as well as /var/log/boot.log still existed.

hazel · 08-25-2019, 12:26 PM

OK, Jake, let's try jsbjsb's suggestion. Boot from your live image and look at the tail end of /var/log/boot.log. I still don't think the messages file will help, but you can try.

@jsbjsb: of course the traditional log files still exist in most (all?) systemd distros. The point I was making is that they now get their data from a journald socket, so if there's nothing in the journal, there'll be nothing in the other files either.

jsbjsb001 · 08-25-2019, 02:27 PM

Quote:

Originally Posted by hazel

...
@jsbjsb: of course the traditional log files still exist in most (all?) systemd distros. The point I was making is that they now get their data from a journald socket, so if there's nothing in the journal, there'll be nothing in the other files either.

They could also grep dmesg on their live system for any messages from libata

Code:

dmesg | grep -i ata

JakeJake · 08-26-2019, 02:05 PM

Thanks to you jsbjsb001 and hazel.

Sorry for the late of my answer.

Quote:

Perhaps it's worth looking at it's SMART status with smartctl. Use a "live system" running from a USB stick or similar, then run the following command and post the results using CODE tags;

I launched the command: smartctl -a /dev/sdb5 but the answer was command not found

Quote:

Another idea is that they might be able to look in their /var/log/ folder on the drive in question (if it's still there), and see if there is any logs that could help, like /var/log/messages Then they could look for any messages from libata in that log.

I don't found anything from "libata", is the next message something that can help to understand ?

Code:

ACPI Warning: SystemIO range 0x0000000000000428-0x000000000000042F conflicts with OpRegion 0x0000000000000400-0x000000000000047F (\PMIO) (20160831/utaddress-247)

or these are some of the latest messages:

Code:

ACPI Warning: SystemIO range 0x0000000000000540-0x000000000000054F conflicts with OpRegion 0x0000000000000500-0x0000000000000549 (\SBGP) (20160831/utaddress-247)
ACPI Warning: SystemIO range 0x0000000000000540-0x000000000000054F conflicts with OpRegion 0x0000000000000500-0x0000000000000563 (\GPIO) (20160831/utaddress-247)

gnome-session-binary[1105]: GLib-GObject-CRITICAL: g_object_unref: assertion 'G_IS_OBJECT (object)' failed
gnome-session-binary[1105]: GLib-GObject-CRITICAL: g_object_unref: assertion 'G_IS_OBJECT (object)' failed
gnome-session-binary[1105]: GLib-GObject-CRITICAL: g_object_unref: assertion 'G_IS_OBJECT (object)' failed
gnome-session-binary[1105]: GLib-GObject-CRITICAL: g_object_unref: assertion 'G_IS_OBJECT (object)' failed
gnome-session-binary[1105]: GLib-GObject-CRITICAL: g_object_unref: assertion 'G_IS_OBJECT (object)' failed
gnome-session-binary[1105]: GLib-GObject-CRITICAL: g_object_unref: assertion 'G_IS_OBJECT (object)' failed
gnome-session-binary[1105]: GLib-GObject-CRITICAL: g_object_unref: assertion 'G_IS_OBJECT (object)' failed
gnome-session-binary[1105]: GLib-GObject-CRITICAL: g_object_unref: assertion 'G_IS_OBJECT (object)' failed
gnome-session-binary[1105]: GLib-GObject-CRITICAL: g_object_unref: assertion 'G_IS_OBJECT (object)' failed
 /usr/lib/gdm3/gdm-x-session[1095]: (**) Option "fd" "23"
 /usr/lib/gdm3/gdm-x-session[1095]: (**) Option "fd" "26"
 /usr/lib/gdm3/gdm-x-session[1095]: (**) Option "fd" "27"
 /usr/lib/gdm3/gdm-x-session[1095]: (**) Option "fd" "28"
 /usr/lib/gdm3/gdm-x-session[1095]: (**) Option "fd" "29"
 /usr/lib/gdm3/gdm-x-session[1095]: (**) Option "fd" "30"
 /usr/lib/gdm3/gdm-x-session[1095]: (**) Option "fd" "31"
 /usr/lib/gdm3/gdm-x-session[1095]: (**) Option "fd" "32"
 /usr/lib/gdm3/gdm-x-session[1095]: (II) UnloadModule: "libinput"
 /usr/lib/gdm3/gdm-x-session[1095]: (II) systemd-logind: releasing fd for 13:65
 org.a11y.atspi.Registry[905]: XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":1024"
 org.a11y.atspi.Registry[905]:       after 21 requests (21 known processed) with 0 events remaining.
 kernel: [ 6243.247299] traps: gnome-shell[806] trap int3 ip:7f523cf2d261 sp:7ffccac3c1e0 error:0
 kernel: [ 6243.247306]  in libglib-2.0.so.0.5000.3[7f523cedd000+112000]

Quote:

Boot from your live image and look at the tail end of /var/log/boot.log

I didn't found the file boot.log in the directory /var/log/

Quote:

They could also grep dmesg on their live system for any messages from libata

The output of the command:

Code:

dmesg | grep -i ata
[    0.000000] BIOS-e820: [mem 0x00000000aaffd000-0x00000000aaffffff] ACPI data
[    0.049517] ACPI: SSDT 0x00000000AAF5FA98 0002EF (v01 SataRe SataTabl 00001000 INTL 20091112)
[    0.049613] NODE_DATA(0) allocated [mem 0x24fdf7000-0x24fdfbfff]
[    0.263652] Memory: 8035720K/8297164K available (10252K kernel code, 1243K rwdata, 3184K rodata, 1580K init, 2296K bss, 261444K reserved, 0K cma-reserved)
[    0.311172] core: PEBS disabled due to CPU errata, please upgrade microcode
[    0.317400] MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details.
[    0.323770] ACPI: \: GPE=0x1b, EC_CMD/EC_SC=0x66, EC_DATA=0x62
[    0.523743] ACPI: \_SB_.PCI0.LPCB.EC0_: GPE=0x1b, EC_CMD/EC_SC=0x66, EC_DATA=0x62
[    1.811056] Write protecting the kernel read-only data: 16384k
[    2.155205] libata version 3.00 loaded.
[    2.179794] ahci 0000:00:1f.2: AHCI 0001.0300 32 slots 6 ports 6 Gbps 0x3 impl SATA mode
[    2.196462] ata1: SATA max UDMA/133 abar m2048@0xdf006000 port 0xdf006100 irq 31
[    2.196464] ata2: SATA max UDMA/133 abar m2048@0xdf006000 port 0xdf006180 irq 31
[    2.196465] ata3: DUMMY
[    2.196466] ata4: DUMMY
[    2.196467] ata5: DUMMY
[    2.196468] ata6: DUMMY
[    2.509261] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[    2.512275] ata2.00: ATAPI: SlimtypeDVD A  DS8A5SH, XAA2, max UDMA/100
[    2.513749] ata2.00: configured for UDMA/100
[    4.607109] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[    4.611898] ata1.00: ACPI cmd ef/90:06:00:00:00:00 (SET FEATURES) succeeded
[    4.611906] ata1.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
[    4.611911] ata1.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
[    4.616349] ata1.00: ATA-8: WDC WD7500BPVT-80HXZT3, 01.01A01, max UDMA/133
[    4.616355] ata1.00: 1465149168 sectors, multi 16: LBA48 NCQ (depth 32), AA
[    4.621397] ata1.00: ACPI cmd ef/90:06:00:00:00:00 (SET FEATURES) succeeded
[    4.621404] ata1.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
[    4.621409] ata1.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
[    4.626806] ata1.00: configured for UDMA/133
[    4.627474] scsi 0:0:0:0: Direct-Access     ATA      WDC WD7500BPVT-8 1A01 PQ: 0 ANSI: 5
[ 2330.619972] EXT4-fs (sdb5): mounted filesystem with ordered data mode. Opts: (null)

hazel · 08-26-2019, 02:16 PM

What I like about you is that when you're asked to do something, you do it and post the results. I wish more newbies were like you!

JakeJake · 08-26-2019, 04:33 PM

Quote:

What I like about you is that when you're asked to do something, you do it and post the results. I wish more newbies were like you!

I appreciate very much your help

Do you think that now is time to consider to reinstall the operating system ?
If the answer is yes, in this situation, do yo have any hint to do that ? There is something special that I have to take care ?

jsbjsb001 · 08-27-2019, 01:00 AM

Quote:

Originally Posted by JakeJake

Thanks to you jsbjsb001 and hazel.

Sorry for the late of my answer.

I can't see anything from libata that indicates any problem with your drive.

Quote:

I launched the command: smartctl -a /dev/sdb5 but the answer was command not found
...

The smartmontools package isn't installed by default in some distributions. But given the situation, the easiest way to run smartctl would be to download a live system that includes that package by default. So have a look at the SystemRescueCd.

Also, you don't need to specify the partition as well to smartctl - you can just specify the node for the drive as a whole, eg. just "sdb", rather than "sdb5".

hazel · 08-27-2019, 02:58 AM

Quote:

Originally Posted by JakeJake

Do you think that now is time to consider to reinstall the operating system ? If the answer is yes, in this situation, do you have any hint to do that ? There is something special that I have to take care ?

Before you do that, I would consider just reinstalling systemd, since that seems to be the program that's crashing. You can't do that in a live system, but you could do it in chroot when booted from your installation image. You would then be using the kernel and systemd from the image, not the ones on your hard drive.

But first you can try using smartmontools to test out your drive. That certainly can't do any harm. If the drive really is failing, reinstalling the software won't help.

Reinstalling is easy. You just do the same as when you installed it the first time, then do a major update and restore the saved files from your home partition. But it's kind of like admitting defeat, so let's see if you can find another solution.

JakeJake · 08-27-2019, 03:38 PM

I create the usb stick with systemrescuecd and the command "smartctl -a /dev/sda5" give me this output:

Code:

[root@sysresccd ~]# smartctl -a /dev/sda5
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-4.19.34-1-lts] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
 
=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Scorpio Blue Serial ATA (AF)
Device Model:     WDC WD7500BPVT-80HXZT3
Serial Number:    WD-WXL1E61LVCS6
LU WWN Device Id: 5 0014ee 656e1bae3
Firmware Version: 01.01A01
User Capacity:    750,156,374,016 bytes [750 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Tue Aug 27 21:52:24 2019 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (15900) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.

jsbjsb001 · 08-27-2019, 11:34 PM

Well, the good news is that smartctl hasn't seen, and isn't reporting any failed SMART attributes, so that's good news. But I was hoping that it would give some more information about the drive's attributes, which would give a clearer picture of exactly how healthy or otherwise it is or isn't.

While we haven't seen any indication that the drive is failing in any way; I'd want to be as sure as I could be before reinstalling any system on it. You can try running a long test on it with the command below - it should tell you how long you need to wait until it's finished. It might take a little while depending on the drive. Post the results when it's finished the test.

Code:

smartctl -t long /dev/sdX

Again, replace "sdX" with the correct node for the drive in question. Then run smartctl -a /dev/sdX (replace "sdX" with correct node once again) to view the results of that long test when it's finished testing the drive.

If that doesn't show anything of concern, then I'd agree with Hazel that it would be time to reinstall. And therefore assuming the drive's ok, it was just some filesystem corruption. But I'd just do a full reinstall of the whole system rather than chrooting into the current install and just reinstalling systemd. But make sure you backup anything that's not a part of a default install, that you wish to keep beforehand. The reason I say to just do a full reinstall is because you'll get a clean system, therefore this will likely avoid having to deal with any complications that might occur with just reinstalling systemd.

Just so you know what I mean about smartctl displaying the SMART attributes, and starting a long test;

Code:

[root@jamespc] ~> smartctl -a /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-4.18.3-desktop-1omv4000] (OpenMandriva Lx 7.0-1)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Blue
Device Model:     WDC WD20EZRZ-00Z5HB0
Serial Number:    WD-WCC4M1LY00KA
LU WWN Device Id: 5 0014ee 20ff29446
Firmware Version: 80.00A80
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Wed Aug 28 13:48:04 2019 ACST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (27720) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 280) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x7035) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   169   167   021    Pre-fail  Always       -       4541
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       705
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   094   094   000    Old_age   Always       -       4519
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       705
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       2
193 Load_Cycle_Count        0x0032   196   196   000    Old_age   Always       -       14800
194 Temperature_Celsius     0x0022   122   105   000    Old_age   Always       -       25
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       968         -
# 2  Extended offline    Aborted by host               70%       847         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

[root@jamespc] ~> smartctl -t long /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-4.18.3-desktop-1omv4000] (OpenMandriva Lx 7.0-1)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 280 minutes for test to complete.
Test will complete after Wed Aug 28 18:33:32 2019

Use smartctl -X to abort test.

hazel · 08-28-2019, 06:41 AM

Quote:

Originally Posted by jsbjsb001

If that doesn't show anything of concern, then I'd agree with Hazel that it would be time to reinstall. And therefore assuming the drive's ok, it was just some filesystem corruption. But I'd just do a full reinstall of the whole system rather than chrooting into the current install and just reinstalling systemd... The reason I say to just do a full reinstall is because you'll get a clean system, therefore this will likely avoid having to deal with any complications that might occur with just reinstalling systemd.

Yeah, I always like to do things the hard way

! Not reinstalling has become a bit of a fetish with me. It seems somehow like giving up, although it's often much quicker than the things I end up doing.

jsbjsb001 · 08-28-2019, 07:14 AM

Quote:

Originally Posted by hazel

Yeah, I always like to do things the hard way

! Not reinstalling has become a bit of a fetish with me. It seems somehow like giving up, although it's often much quicker than the things I end up doing.

It can seem like the easy way out, and I guess it probably is a lot of the time. Believe it or not, I tend to "fix" things with my own system before opting for a complete reinstall (or ignore it if it isn't a big deal anyway).

The main reason I suggest it in this case is; who knows what else got corrupted? Particularly if the OP is new to Linux, we probably should make it easy on them

zeebra · 08-28-2019, 07:56 AM

Personally at this point I'd look into compiling a new Kernel and checking the bootloader setup and verify fdisk -l and fstab etc.

Maybe a bit of an overkill, but at least an option. Or, better said (easiest solution first):

1. I'd try to boot into single user/rescue mode
2. If successfull, read logs
3. Check fdisk -l and blkid vs && bootloader configurations
4. Boot another ready kernel if available
5. Compile Kernel from source for testing and rebuilding system

I happen to have several available Kernels I can choose from in most cases, so it would be a rather dramatic situation if I'd compile one from source to try that.

hazel · 08-28-2019, 08:00 AM

Quote:

Originally Posted by zeebra

Personally at this point I'd look into compiling a new Kernel and checking the bootloader setup and verify fdisk -l and fstab etc.

You wouldn't normally compile a kernel on Debian. Reinstall one perhaps, but a complete reinstall would include that. The other steps sound reasonable and would only take a couple of minutes.