[SOLVED] Segfaults in multiple programmes

Cág · 02-15-2017, 07:22 AM

Running NetBSD but had the same problems in Alpine Linux. I already posted to multiple lists, though received no answer.

All my GTK+2 apps segfault on keyboard input. lxappearance for example, when looking for a theme you can start pressing keys and it will search. But in my case it dumps core with /usr/lib/libpthread.so.1, /usr/lib/libc.so.12 and /usr/pkg/lib/libXcursor.so.1. The same thing happens when typing something into a GTK+2 text editor, leafpad, or looking for something in Ctrl+O window in firefox or gimp or any other programme. gimp can't even run inside gdb because of:

Code:

Program received signal SIGTRAP, Trace/breakpoint trap.
0x00007f7fea49f6aa in ___lwp_park60 () from /usr/lib/libc.so.12
(gdb) bt
#0  0x00007f7fea49f6aa in ___lwp_park60 () from /usr/lib/libc.so.12
#1  0x00007f7fec808f2b in pthread_cond_timedwait () from /usr/lib/libpthread.so.1
#2  0x00007f7feb880b80 in g_cond_wait () from /usr/pkg/lib/libglib-2.0.so.0
#3  0x00007f7feb81d7cd in g_async_queue_pop_intern_unlocked () from /usr/pkg/lib/libglib-2.0.so.0
#4  0x00007f7feb86742f in g_thread_pool_thread_proxy () from /usr/pkg/lib/libglib-2.0.so.0
#5  0x00007f7feb866a7d in g_thread_proxy () from /usr/pkg/lib/libglib-2.0.so.0
#6  0x00007f7fec80a9cc in ?? () from /usr/lib/libpthread.so.1
#7  0x00007f7fea483de0 in ?? () from /usr/lib/libc.so.12
#8  0x0000000000000000 in ?? ()

Firefox also has problems in libc.so.12 and libpthread.so.1 but doesn't say about __lwp_park60. It also can't run inside gdb.

lxappearance also dumps core when clicking Apply after changing something (themes, cursor or icon themes, fonts etc.) with another output:

Code:

#0  0x00007f7fefcb27ba in ?? () from /usr/lib/libc.so.12
#1  0x00007f7fefcb2bc7 in malloc () from /usr/lib/libc.so.12
#2  0x00007f7ff1849782 in g_malloc () from /usr/pkg/lib/libglib-2.0.so.0
#3  0x00007f7ff185ef1c in g_memdup () from /usr/pkg/lib/libglib-2.0.so.0
#4  0x00007f7ff18356b8 in g_hash_table_insert_node () from /usr/pkg/lib/libglib-2.0.so.0
#5  0x00007f7ff1835823 in g_hash_table_insert_internal () from /usr/pkg/lib/libglib-2.0.so.0
#6  0x00007f7ff183ccb1 in g_key_file_flush_parse_buffer () from /usr/pkg/lib/libglib-2.0.so.0
#7  0x00007f7ff183cf62 in g_key_file_parse_data () from /usr/pkg/lib/libglib-2.0.so.0
#8  0x00007f7ff183d0e1 in g_key_file_load_from_fd () from /usr/pkg/lib/libglib-2.0.so.0
#9  0x00007f7ff183d99e in g_key_file_load_from_file () from /usr/pkg/lib/libglib-2.0.so.0
#10 0x0000000000405532 in _start ()

Apart from these programmes I receive SIGILL in mplayer when trying to play videos. Backtrace doesn't tell anything useful.

sxiv, an image viewer, segfaults with this:

Code:

#0  0x00007f7ff64b209f in ?? () from /usr/lib/libc.so.12
#1  0x00007f7ff64b3983 in free () from /usr/lib/libc.so.12
#2  0x000000000040729c in remove_file ()
#3  0x0000000000409a92 in main ()

Previously, if built from local pkgsrc tree it worked but now stopped working at all at all.

mpg321 dumps core and says "Memory fault" with this backtrace:

Code:

#0  0x00007f7ff78068b1 in sem_post () from /usr/lib/libpthread.so.1
#1  0x000000000040afe0 in ?? ()
#2  0x0000000000403695 in ?? ()
#3  0x00007f7ff7ffa000 in ?? ()
#4  0x0000000000000002 in ?? ()
#5  0x00007f7ffffffdb0 in ?? ()
#6  0x00007f7ffffffdb7 in ?? ()
#7  0x0000000000000000 in ?? ()

I did memtests, once for four hours (two passes) and once for eight hours (eight passes). I did Dell's ePSA tests (diagnostic utility accessed from BIOS), it has own memtest; all of them returned no errors. I rebuilt gtk2 with debug symbols but it changed nothing.

Thanks everyone for any kind of help.

business_kid · 02-15-2017, 08:21 AM

A real segmentation fault is a memory paging fault. It dates to the 80286, which had 20 address lines, but only 16 bit registers. How do you put data in the top 4 address bits? The answer was 4 bit paging registers. Except when you addressed a page with no memory on it, you got a segmentation error. Nowadays, they're any memory error.

On 2 separate Operating Systems, we can eliminate software. You're left with ( in rough order)
1. Memory errors.
2. Disk errors.
3. Some weird motherboard error. The big ASICs you get today can cause errors that are next to impossible to trace. Heat can also bring them on.
4. Power supply problems.

If you're overclocking, stop. Check all cooling. Run overnight on memtest86. Check the disks with smartmontools as well as the filesystem utilities. Borrow someone's power supply to check that. That eliminates all except the motherboard.

Cág · 02-15-2017, 11:26 AM

Quote:

Originally Posted by business_kid

On 2 separate Operating Systems, we can eliminate software.

I've just tried Ubuntu with glibc and Void with musl, both live USB, on Ubuntu all those things work fine. On Void I tried only sxiv and it worked.

Dell's ePSA includes all kinds of tests, keyboard, hard drive, memory, fans, CPU. It didn't return any errors. I may try longer memtests but yet I am not convinced that these are hardware problems.

pan64 · 02-15-2017, 12:28 PM

if not hardware related, then probably you have incompatible libraries

business_kid · 02-16-2017, 05:57 AM

Ubuntu with glibc and Void with musl seem ti eliminate memory, and most of the motherboard. It still leaves disks, and the psu errors become more remote.

Check the disks. Install something properly. You should not have segfaults.

cynwulf · 02-16-2017, 08:47 AM

It looks like your ports tree is out of sync with the base system (hence the libc related core dumps). As you've provided next to no info of the release version of NetBSD you're running, it's hard to say for sure.

As you've installed binary packages and have also been building via pkgsrc that could also be part of the issue. What repository do you have defined in: /usr/pkg/etc/pkgin/repositories.conf ?

sundialsvcs · 02-16-2017, 09:37 AM

Does BSD have anything comparable to Linux's /sbin/ldconfig command? A loader name-cache that must be updated (by re-running this privileged command ...) whenever libraries change?

Basically, it does sound like "incompatible libraries." If one library attempts to call another, and it doesn't actually know what the parameter-list should be, how parameters are to be passed and so-forth, then basically "all hell breaks loose real quick."

For instance, a parameter got added to a function in version 3.x of the library. The function-call pushes three items onto the stack: the called function pops-off four. Not only is its "fourth parameter" garbage, but, "say goodbye to your stack!" You're headed for a hard fall, and probably a totally-useless stack trace.

"Basically, 'the stack got munged.'" And in this case, the information found in a traceback probably is neither meaningful nor correct – because the content and thus the expected structure of the stack was hosed.

cynwulf · 02-16-2017, 10:48 AM

Yes, ldconfig(8) has been around in various BSD's since the early days.

However, NetBSD in particular seems to be moving away from it: https://www.netbsd.org/docs/elf.html

jggimi · 02-16-2017, 12:54 PM

Cág did post a dmesg earlier today at daemonforums, looking for NetBSD-specific help. It's NetBSD 7.0.2, amd64, on an Ivy Bridge CPU.

sundialsvcs · 02-16-2017, 02:25 PM

Quote:

Originally Posted by cynwulf

Yes, ldconfig(8) has been around in various BSD's since the early days.

However, NetBSD in particular seems to be moving away from it: https://www.netbsd.org/docs/elf.html

Thought so.

However, my "gut take" on this particular situation is probably that it is something much more basic –*such as quite-literal "library incompatibility," or something that didn't get re-compiled, and so on.

It's "just not nice™" when pieces of computer software can't play well together . . .

Cág · 02-18-2017, 03:33 AM

I ran testdisk and two things caught my attention:

Code:

Warning: Bad starting head (CHS and LBA don't match)
Warning: the current number of heads per cylinder is 16 but the correct value may be 255

Errors from the system log:

Code:

PCH transcoder FIFO underrun /* it has always been on this machine) */
ACPI Error: [\_SB_.PCI0.GFX0.DD02._BCL] Namespace lookup failure, AE_NOT_FOUND (20131218/psargs-393)
ACPI Error: Method parse/execution failed [\_SB_.PCI0.PEG0.PEGP.DD02._BCL] (Node 0xfffffe81dd1d2408), AE_NOT_FOUND (20131218/psparse-553)
acpiout1: failed to evaluate \_SB_.PCI0.PEG0.PEGP.DD02._BCL: AE_NOT_FOUND
ACPI Exception: AE_NOT_FOUND, While evaluating Sleep State [\_S1_] (20131218/hwxface-646)
ACPI Exception: AE_NOT_FOUND, While evaluating Sleep State [\_S2_] (20131218/hwxface-646)
i915drmkms0: interrupting at ioapic0 pin 16 (i915)
drm: GMBUS [i915 gmbus vga] timed out, falling back to bit banging on pin 2
intelfb0 at i915drmkms0
i915drmkms0: info: registered panic notifier
DRM error in radeon_get_bios: Unable to locate a BIOS ROM
error: Fatal error during GPU init
radeon0: unable to attach drm: 22

Graphics work fine, but acpi(4) and apm(8) commands don't exist and I suppose I need a custom kernel (which I will certainly build after solving these issues).

A note about mplayer: as I said on DF, it receives SIGILL if running alone and SIGSEGV if inside gdb and in different spots:

Code:

Program terminated with signal SIGILL, Illegal instruction.
#0  0x00007f7fee102d28 in ?? ()
(gdb) bt
#0  0x00007f7fee102d28 in ?? ()
#1  0x0000000000000000 in ?? ()

and

Code:

Program received signal SIGSEGV, Segmentation fault.
0x00007f7ff2407c9e in ?? ()
(gdb) bt
#0  0x00007f7ff2407c9e in ?? ()
#1  0x0000000000000000 in ?? ()

To make it clear: I am running NetBSD 7.0.2, with the latest stable pkgsrc tree that is 2016Q4; in both /usr/pkg/etc/pkgin/repositories.conf and PKG_PATH I have

Code:

http://cdn.netbsd.org/pub/pkgsrc/packages/NetBSD/$arch/7.0/All

GLib, GTK+2, Firefox, MPlayer, GIMP and their almost all dependencies are built locally from the tree.

ldconfig(8) is disabled as advised.

business_kid · 02-19-2017, 04:29 AM

Quote:

Warning: Bad starting head (CHS and LBA don't match)
Warning: the current number of heads per cylinder is 16 but the correct value may be 255

On this: Back in history, ms-dos was coded to deal for hard disks - back when a big clunky disk had a tiny capacity - 10MB was good back then. They were to have no more than 16 heads, 1024 tracks, and some equally ridiculous sectors per track. Since then systems have been lying; There were various maximum limits for disk sizes - 512MB, 2 Gig, etc. Sectors/track and total tracks have expanded beyond all expectations, and various sets of lies were told at various stages to correct this, CHS(Cylinders, Heads, Sectors) & LBA (Logical Block Addressing) being 2 of them.

Nearly certainly you should be set on 255 heads; There is only 2, but it was about the one number with room in it. but altering the heads setting may break up data on the partition.

Graphics is another area where software lies have affected hardware designs, as nobody thought about vga when the pc was designed. The competition was from CP/M mini computers on 80x25 consoles, and mainframes on something similar. Consoles were for printing ascii, and incapable of graphics, although that, like everything else changed.

My suggestion: Ignore the graphics errors(my system says most of that); Back up, change to 255 heads and see what breaks; there's usually an autodetect in the bios these days; You have an acpi problem. I'd delete and reinstall that. Then see what shows.

Cág · 02-20-2017, 04:59 AM

Quote:

Originally Posted by business_kid

My suggestion: Ignore the graphics errors(my system says most of that); Back up, change to 255 heads and see what breaks; there's usually an
autodetect in the bios these days.

Setting 255 heads in testdisk, then writing MBR switches back to 16 after reboot. It is explained in the docs since it is the only operating system on the disk.

Quote:

You have an acpi problem. I'd delete and reinstall that. Then see what shows.

Reinstall what? I tried disabling ACPI but it doesn't change anything.

cynwulf · 02-20-2017, 05:49 AM

I'm not familiar with "testdisk" so not sure if that error referring to the drive geometry is relevant. Can you post your fdisk and disklabel outputs? For disklabel you will need to specify the device node.

jggimi · 02-20-2017, 08:26 AM

Drive geometry should have nothing to do with segfaults. Unless I completely misunderstand.