[SOLVED] I am having segfaults in hwinfo. I know it worked before.
Linux - HardwareThis forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
A duckduckgo search on 'visorbus' turns up several kernel patches dealing with this function. I don't know if its related to this issue or not, most of what is there is over my head.
Since this fails on 14.2 and does not fail on current ( at least on my system ) I'm thinking this is in fact a kernel bug that is fixed in later kernel releases.
Now that's useful. I've just been going through yesterday's /var/log/messages looking for kernel complaints and found these:
Code:
Mar 18 17:47:01 bigboy root: 138 jobs running
Mar 18 17:48:19 bigboy kernel: [27862.622289] hwinfo[6507]: segfault at 4000191 ip 00007f242f3c613e sp 00007ffc9705ee10 error 4 in libhd.so.21.61[7f242f39e000+a1000]
Mar 18 18:05:02 bigboy -- MARK --
Mar 18 18:10:04 bigboy kernel: [29169.809154] hwinfo[7550]: segfault at 6000191 ip 00007f822c90d13e sp 00007ffe732759e0 error 4 in libhd.so.21.59[7f822c8e5000+a1000]
Mar 18 18:25:02 bigboy -- MARK --
Bingo! I just built a new version of hwinfo-21.59 with that one instruction removed and it works! Now I just have to do the same with hwinfo-21.67.
Ha, ha! Version 21.67 has a further new read function. It's called hd_read_mdio. And when this is left in, the program still segfaults. But when both these functions are prevented from running, the program behaves itself.
I think the next step will be to install the kernel from current, and see if the program behaves better with that.
Last edited by hazel; 03-19-2020 at 10:31 AM.
Reason: Reported result for 21.67
So I installed the Series 5 kernel from Slackware Current and booted from it this morning. I was somewhat disappointed not to see any penguins during the boot. What happened to them?
I have just tested the unpatched hwinfo-21.67 and guess what? Segmentation fault!
I think this really has to be a bug in hwinfo and the kernel is not to blame. I would like to report it but there doesn't seem to be any option on their github page for doing that.
PS: I found a maintainer name and address inside the source tarball, so I've emailed him. I wonder what will happen now.
Last edited by hazel; 03-20-2020 at 06:26 AM.
Reason: Added postscript
Diff 21.67 orig with 21.67 modified. It sounds like you don't have huge confidence in your mods, but more that they killed your segfault rather than improved program function. I'd also do some diagnostics - like logging what address(es) is it reading, and is there actually any memory there on your box? Reading an address is pretty harmless, and shouldn't segfault; reading a non existent or unauthorised address is an attack on the pc's privates, and you could expect trouble.
When you retire for the night, leave memtest running, in case you have an issue somewhere. A segfault these days is any memory error, isn't it?
I'm in correspondence with him now and it's very interesting. He gave me some tests to try out and from that, he has found the immediate cause of the crash: the pointer to a crucial structure called hd_data has got changed somehow to another value which doesn't point to anything. Hence the segfault. We still don't know why it happens.
What I don't understand is that, according to the gdb backtrace, the problem starts in a different part of the program from the one I commented out.
I've found a nice gdb manual and I'm studying it. Hopefully I can find a way of using it to narrow down the problem.
PS: I'm attaching a diff between versions 21.58 and 21.59 with a note on the two lines I removed. That's just a call to the new function which follows; obviously it would have been better to fix the function itself.
Last edited by hazel; 03-21-2020 at 06:51 AM.
Reason: PS added
By the time you're getting your head into this and able to say something meaningful, you're elsewhere in the space/time continuum and you lose us ordinary mortals. If you don't need the latest version, why not just use 21.58? Or are you trying to throw the maintainer as bone?
By the time you're getting your head into this and able to say something meaningful, you're elsewhere in the space/time continuum and you lose us ordinary mortals.
I find your hardware-oriented comments equally impenetrable! That's why I haven't marked them as helpful. I know that you are trying to help but I just can't understand that stuff.
Quote:
If you don't need the latest version, why not just use 21.58? Or are you trying to throw the maintainer as bone?
Yes, I suppose so. If I can help clear a bug, I think it's my duty to do so. Isn't that how the Linux community is supposed to work? The maintainer hasn't been able to reproduce my problem on a virtual machine using the hardware diagnostics I sent him, so it looks like I am the only one who can do this. And also it interests me.
[QUOTE=hazel;6102966]I find your hardware-oriented comments equally impenetrable! That's why I haven't marked them as helpful. I know that you are trying to help but I just can't understand that stuff.
Yes. they probably are. I have had that exact reaction from customers who found themselves paying for something they didn't understand. I would point out that that was why they had hired me, after factory maintenance and electricians had failed. In the end, most of them developed a 'don't want to know' attitude. A few of them could follow it. I found holding something in my hand reduced it to "This" but design issues were a nightmare.
[On reverting back to 21.58] Yes, I suppose so. If I can help clear a bug, I think it's my duty to do so. Isn't that how the Linux community is supposed to work?
Indeed.
The maintainer hasn't been able to reproduce my problem on a virtual machine using the hardware diagnostics I sent him, so it looks like I am the only one who can do this. And also it interests me.
Well, go for it, then. You could ask him for a patch to dump salient registers to syslog, and that might help. Then he could see what's going on at different times. Because something is changing a value when he doesn't expect it - I don't get the particulars, but I do get that he thinks the program is actually doing one thing, but it's doing another; so the problem is going to be a surprise to both of you.
I ran a gdb session with a watch on hw_data and sent Steffen the results. I got an email overnight which says:
Quote:
Thanks! I've now an idea what happens. Could you also send me the compiled static hwinfo you used for that debug session?
So I have. The trouble is that when he has found out what went wrong, he probably won't be able to explain it to me in terms that I can understand.
I'm beginning to regret doing that intensive memtest run. It seems to have done something nasty to my machine, because now it won't reboot. It only starts from cold.
I ran a gdb session with a watch on hw_data and sent Steffen the results. I got an email overnight which says:
So I have. The trouble is that when he has found out what went wrong, he probably won't be able to explain it to me in terms that I can understand.
I'm beginning to regret doing that intensive memtest run. It seems to have done something nasty to my machine, because now it won't reboot. It only starts from cold.
When you don't understand his explanation, I'll have a try. I don't know what to suggest on the reboot, except to point out you've bigger fish to fry at the moment. If you look at /etc/inittab, you'll see the runlevels laid out, the file is in /rtc/rc.d/rc.6 in Slackware, IIRC. If you've seen My Issues, not getting a reboot is small beer. I can't get booted up, and I'm replying on a little rugrat of a RasPi 4 running Raspbian.
If you look at /etc/inittab, you'll see the runlevels laid out, the file is in /rtc/rc.d/rc.6 in Slackware, IIRC.
I wasn't explicit enough. This has nothing to do with Slackware. Slack closes down normally and gives the "Rebooting" message. Then the machine tries to start up again but I don't even get to the bootloader any more. It just freezes at the point where a cold boot does the POST.
If I switch off at the main, then switch on again, it boots. Tiresome, but as you say, we all have more serious problems right now.
That sounds like the lack of a proper reset on the hd or chipset. There is a significant difference between reboot and poweroff, as the physical reset button or poweroff does a full reset via the BIOS, but the reboot skips some of the way in and does a software reset (=goto this address). I can only provide highly speculative hardware guesses, so I imagine it's software. Have you reinstalled grub (the mbr), or tried hibernate? Mind you, I'm in enough trouble myself.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.