[SOLVED] I am having segfaults in hwinfo. I know it worked before.

camorri · 03-19-2020, 06:00 AM

A duckduckgo search on 'visorbus' turns up several kernel patches dealing with this function. I don't know if its related to this issue or not, most of what is there is over my head.

Since this fails on 14.2 and does not fail on current ( at least on my system ) I'm thinking this is in fact a kernel bug that is fixed in later kernel releases.

hazel · 03-19-2020, 06:10 AM

Now that's useful. I've just been going through yesterday's /var/log/messages looking for kernel complaints and found these:

Code:

Mar 18 17:47:01 bigboy root: 138 jobs running
Mar 18 17:48:19 bigboy kernel: [27862.622289] hwinfo[6507]: segfault at 4000191 ip 00007f242f3c613e sp 00007ffc9705ee10 error 4 in libhd.so.21.61[7f242f39e000+a1000]
Mar 18 18:05:02 bigboy -- MARK --
Mar 18 18:10:04 bigboy kernel: [29169.809154] hwinfo[7550]: segfault at 6000191 ip 00007f822c90d13e sp 00007ffe732759e0 error 4 in libhd.so.21.59[7f822c8e5000+a1000]
Mar 18 18:25:02 bigboy -- MARK --

hazel · 03-19-2020, 07:30 AM

Bingo! I just built a new version of hwinfo-21.59 with that one instruction removed and it works! Now I just have to do the same with hwinfo-21.67.

Ha, ha! Version 21.67 has a further new read function. It's called hd_read_mdio. And when this is left in, the program still segfaults. But when both these functions are prevented from running, the program behaves itself.

I think the next step will be to install the kernel from current, and see if the program behaves better with that.

hazel · 03-20-2020, 05:49 AM

So I installed the Series 5 kernel from Slackware Current and booted from it this morning. I was somewhat disappointed not to see any penguins during the boot. What happened to them?

I have just tested the unpatched hwinfo-21.67 and guess what? Segmentation fault!

I think this really has to be a bug in hwinfo and the kernel is not to blame. I would like to report it but there doesn't seem to be any option on their github page for doing that.

PS: I found a maintainer name and address inside the source tarball, so I've emailed him. I wonder what will happen now.

business_kid · 03-21-2020, 06:10 AM

Yeah - email him.

Diff 21.67 orig with 21.67 modified. It sounds like you don't have huge confidence in your mods, but more that they killed your segfault rather than improved program function. I'd also do some diagnostics - like logging what address(es) is it reading, and is there actually any memory there on your box? Reading an address is pretty harmless, and shouldn't segfault; reading a non existent or unauthorised address is an attack on the pc's privates, and you could expect trouble.

When you retire for the night, leave memtest running, in case you have an issue somewhere. A segfault these days is any memory error, isn't it?

hazel · 03-21-2020, 06:23 AM

I'm in correspondence with him now and it's very interesting. He gave me some tests to try out and from that, he has found the immediate cause of the crash: the pointer to a crucial structure called hd_data has got changed somehow to another value which doesn't point to anything. Hence the segfault. We still don't know why it happens.

What I don't understand is that, according to the gdb backtrace, the problem starts in a different part of the program from the one I commented out.

I've found a nice gdb manual and I'm studying it. Hopefully I can find a way of using it to narrow down the problem.

PS: I'm attaching a diff between versions 21.58 and 21.59 with a note on the two lines I removed. That's just a call to the new function which follows; obviously it would have been better to fix the function itself.

business_kid · 03-21-2020, 01:09 PM

It's usually the same on these things…

By the time you're getting your head into this and able to say something meaningful, you're elsewhere in the space/time continuum and you lose us ordinary mortals. If you don't need the latest version, why not just use 21.58? Or are you trying to throw the maintainer as bone?

hazel · 03-21-2020, 01:42 PM

Quote:

Originally Posted by business_kid

It's usually the same on these things…

By the time you're getting your head into this and able to say something meaningful, you're elsewhere in the space/time continuum and you lose us ordinary mortals.

I find your hardware-oriented comments equally impenetrable! That's why I haven't marked them as helpful. I know that you are trying to help but I just can't understand that stuff.

Quote:

If you don't need the latest version, why not just use 21.58? Or are you trying to throw the maintainer as bone?

Yes, I suppose so. If I can help clear a bug, I think it's my duty to do so. Isn't that how the Linux community is supposed to work? The maintainer hasn't been able to reproduce my problem on a virtual machine using the hardware diagnostics I sent him, so it looks like I am the only one who can do this. And also it interests me.

business_kid · 03-22-2020, 05:34 AM

[QUOTE=hazel;6102966]I find your hardware-oriented comments equally impenetrable! That's why I haven't marked them as helpful. I know that you are trying to help but I just can't understand that stuff.

Yes. they probably are. I have had that exact reaction from customers who found themselves paying for something they didn't understand. I would point out that that was why they had hired me, after factory maintenance and electricians had failed. In the end, most of them developed a 'don't want to know' attitude. A few of them could follow it. I found holding something in my hand reduced it to "This" but design issues were a nightmare.

[On reverting back to 21.58]
Yes, I suppose so. If I can help clear a bug, I think it's my duty to do so. Isn't that how the Linux community is supposed to work?
Indeed.

The maintainer hasn't been able to reproduce my problem on a virtual machine using the hardware diagnostics I sent him, so it looks like I am the only one who can do this. And also it interests me.

Well, go for it, then. You could ask him for a patch to dump salient registers to syslog, and that might help. Then he could see what's going on at different times. Because something is changing a value when he doesn't expect it - I don't get the particulars, but I do get that he thinks the program is actually doing one thing, but it's doing another; so the problem is going to be a surprise to both of you.

hazel · 03-22-2020, 06:18 AM

I took your advice and ran memtest overnight. 48 cycles, no errors. So I don't think this is a memory problem. I'm pretty sure it's a bug.

hazel · 03-23-2020, 05:56 AM

I ran a gdb session with a watch on hw_data and sent Steffen the results. I got an email overnight which says:

Quote:

Thanks! I've now an idea what happens. Could you also send me the compiled static hwinfo you used for that debug session?

So I have. The trouble is that when he has found out what went wrong, he probably won't be able to explain it to me in terms that I can understand.

I'm beginning to regret doing that intensive memtest run. It seems to have done something nasty to my machine, because now it won't reboot. It only starts from cold.

business_kid · 03-24-2020, 05:31 AM

Quote:

Originally Posted by hazel

I ran a gdb session with a watch on hw_data and sent Steffen the results. I got an email overnight which says:

So I have. The trouble is that when he has found out what went wrong, he probably won't be able to explain it to me in terms that I can understand.

I'm beginning to regret doing that intensive memtest run. It seems to have done something nasty to my machine, because now it won't reboot. It only starts from cold.

When you don't understand his explanation, I'll have a try

. I don't know what to suggest on the reboot, except to point out you've bigger fish to fry at the moment. If you look at /etc/inittab, you'll see the runlevels laid out, the file is in /rtc/rc.d/rc.6 in Slackware, IIRC. If you've seen My Issues, not getting a reboot is small beer. I can't get booted up, and I'm replying on a little rugrat of a RasPi 4 running Raspbian.

hazel · 03-24-2020, 05:43 AM

Quote:

Originally Posted by business_kid

If you look at /etc/inittab, you'll see the runlevels laid out, the file is in /rtc/rc.d/rc.6 in Slackware, IIRC.

I wasn't explicit enough. This has nothing to do with Slackware. Slack closes down normally and gives the "Rebooting" message. Then the machine tries to start up again but I don't even get to the bootloader any more. It just freezes at the point where a cold boot does the POST.

If I switch off at the main, then switch on again, it boots. Tiresome, but as you say, we all have more serious problems right now.

business_kid · 03-24-2020, 07:41 AM

That sounds like the lack of a proper reset on the hd or chipset. There is a significant difference between reboot and poweroff, as the physical reset button or poweroff does a full reset via the BIOS, but the reboot skips some of the way in and does a software reset (=goto this address). I can only provide highly speculative hardware guesses, so I imagine it's software. Have you reinstalled grub (the mbr), or tried hibernate? Mind you, I'm in enough trouble myself.

hazel · 03-24-2020, 07:59 AM

Now that I understood! Not that it's very important.

Regarding the hwinfo bug, Steffen found out where it was. See the bug report at https://bugzilla.opensuse.org/show_bug.cgi?id=1167561.