ICE sockets accumulate then cause failure, possible Slackware fix is proposed

selfprogrammed · 11-02-2023, 07:10 PM

Testing: CloudFlare keeps blocking my attempts to post a new thread.
Got it here, now to put content in by edit. (### CloudFlare blocked the edit)
Going to put little paragraphs in at a time, to get around CloudFlare interference.

---------------------

After a long night of reading ICE library code, I have found the cause of a persistant ICE problem, that has been reported for Ubuntu, OpenBSD, and others, in many help sites, some dated back to 2005. None of the previous postings had found a cause.

*** The problem
Running startx will fail, due to ICE socket failure.
Repeating startx will succeed.
It does not do this every time, but is intermittent.

The message which seems to be relevant is the one about ICE:
"_IceTransSocketUNIXCreateListener: ...SocketCreateListener() failed".

Reference: "https://www.linuxquestions.org/questions/showthread.php?p=6457275#post6457275"

*** The cause
The ICE directory is full of old ICE sockets. They are not being removed upon shutdown.
This may be due to their process being killed by the shutdown procedure.

*** Explanation
I can now see why this fails the way it does, and why repeating the startx works.
The old sockets are NOT BEING REMOVED from "/tmp/.ICE-unix". I see sockets created 6 months ago.

A user logging after a fresh boot will likely have a PID that is the same, day after day.

Starting X (startx) will start Xorg, which will call the ICE library, which will create an ICE socket using the PID of the process as the name. This will be passed to bind to create the socket.
When startx is invoked after a fresh boot, it will have a PID consistently in the same narrow range. This will consistently create sockets in the ICE directory with the same name.

If this hits a stale socket owned by the same user, it probably succeeds (man page does not say).
For users who boot and startx every day, they probably often use a socket with the same PID, every day.

If there have ever been multiple users having used ICE then there can be stale sockets in the ICE directory still owned by other users. If the startx PID happens to hit one of these sockets (named by PID) owned by a different user, then startx will fail (with ICE messages).

This explains why it works when you repeat the startx. That is a new task, which will get the next PID. YOUR PID HAS CHANGED. The new socket name will be different by 1, and will miss the old socket that it hit on the first attempt.

This also explains why the one user, if he did not startx right away, but did some other things first, said that allowed him to startx without errors. By doing that he is getting a different PID from the usual, and thus not hitting the old socket that is in his ICE directory.

*** Who fixes this

The distribution, X, xfce4, all have the opportunity to fix this problem.
** X could make sure to remove the old sockets upon shutdown.
** xfce4 could clear its socket on shutdown, it knows the socket and the PID.
** The distribution could clear this directory upon shutdown, and upon startup, as no sockets will carry over across reboot.

*** Slackware
I have here an attempt to clear the stale sockets from the ICE directory.
This is a first attempt, which can be improved.
It may do better to execute it at boot instead of at shutdown.
I may not have put it in the right place.

Thread is to discuss possible fixes to Slackware, as that can be modified by the people affected most easily.

I have modified my Slackware script /etc/rc.d/rc.K

Code:

--- orig-sw15.0/rc.K	2022-06-18 12:02:59.000000000 -0500
+++ rc.K	2023-10-31 23:46:00.526635586 -0500
@@ -119,8 +119,13 @@
   echo -n "."
 done
 echo

+# Clean up old sockets:
+if [ -x /etc/rc.d/rc.clean_sockets ]; then # quit
+  /etc/rc.d/rc.clean_sockets
+fi
+
 # Now go to the single user level
 echo "Going to single user mode..."
 /sbin/telinit -t 1 1

New rc file, /etc/rc.d/rc.clean_sockets

Code:

#!/bin/sh
#
# /etc/rc.d/rc.clean_sockets:  System socket cleanup.

# The problem is that old sockets accumulate in /tmp/.ICE-unix.
# After boot, a user logging in will be using a low PID and can
# reliably hit a old socket, thus causing startx to fail.

# Remove sockets from /tmp/.ICE-unix for any PID that does not exist.

cwd=$(pwd)
cd /tmp/.ICE-unix
for sock in * ; do
  sn="/tmp/.ICE-unix/$sock"
  # If any process still uses the socket.
  if fuser -s "$sn" ; then
#     echo "$sock fuser IN-USE"
     continue
  fi

  # If the process that created the socket still exists.
  if [ -e "/proc/$sock" ] ; then
#     echo "$sock process still exists"
     continue
  fi

  # Remove the socket
  # THIS IS WHERE CLOUDFLARE INTERFERES, IT WILL NOT ALLOW THIS REMOVE FILE LINE TO BE POSTED.
  # Take the dot out of the "rm" command.
  r.m   $sn

done

cd "$cwd"

selfprogrammed · 11-02-2023, 07:32 PM

CLOUDFLARE INTERFERENCE:
I broke the above posting into 3 to 5 line segments which I edited into the posting individually.
Any posting that included the "rm" command was blocked by CloudFlare.
I got it down to only that one line left, and it blocked every attempt to post that line.

It is obvious that CloudFlare is reading the postings and blocking what it does not like (CENSORING LINUX COMMANDS).
This is a site where scripts and other commands may be posted. How is that script above to be displayed without the command that removes the stale sockets?

astrogeek · 11-02-2023, 07:45 PM

Quote:

Originally Posted by selfprogrammed

CLOUDFLARE INTERFERENCE:
I broke the above posting into 3 to 5 line segments which I edited into the posting individually.
Any posting that included the "rm" command was blocked by CloudFlare.

A definite annoyance...

You may find a work around here, Methods of avoiding Cloudflare blocks when posting code.

For your rm example, try this:

Code:

r[i][/i]m $sn

... which produces...

Code:

rm $sn

... without introducing any extraneous visible or copyable characters.

If you figure out something not already posted in that thread, please add it to the discussion!

rkelsen · 11-02-2023, 07:56 PM

Quote:

Originally Posted by selfprogrammed

*** The cause
The ICE directory is full of old ICE sockets. They are not being removed upon shutdown.
This may be due to their process being killed by the shutdown procedure.

*** Explanation
I can now see why this fails the way it does, and why repeating the startx works.
The old sockets are NOT BEING REMOVED from "/tmp/.ICE-unix". I see sockets created 6 months ago.

A user logging after a fresh boot will likely have a PID that is the same, day after day.

man tmpfs

selfprogrammed · 11-02-2023, 08:27 PM

One possible alternative is to modify startx.
If the DISPLAY indicates that this is the first Xorg, then there should not be any Xorg processes running yet.
It should be safe to remove all sockets in "/tmp/.ICE-unix" at that point.
I suspect that this will work for 90% of the users, but is it safe overall.

Are there any other users of ICE that would be using ICE before Xorg starts ???

This has the advantage of putting the solution to this problem with the Xorg code, which is the entity causing the problem.
Better than burying a solution in the general system startup and shutdown.

--- tmpfs?
The ICE directory is on the hard drive, tmpfs is not involved.
I have no intent of using my precious system memory just to avoid having X clean up after itself.
The solution is to shut those sockets down, or have X clean up after itself.

This discussion is to fix Slackware so this does not get other users. After-the-fact changes that ONLY the unlucky would have to initiate, do not help.
This is very hard to diagnose, and the trigger of having a second user is not obvious, so the unlucky do not know what the problem is, and being so intermittent makes it worse.

---- Report
The script left so many stale sockets that I just went through and removed all that were owned by users other than the one I was using that day.
I can report that my ICE errors have ceased after I cleaned up the ICE directory.

The script needs to be improved, as it leaves many stale sockets. Makes me wonder if the script is even getting run. Or should find a better place to clean that directory.

rkelsen · 11-02-2023, 10:10 PM

If you're the type of user who shuts down their computer every day, then common experience shows that it is best practice to clear out /tmp at shutdown. This will prevent many problems, not only the one you've stumbled upon.

There is one method described in the Slackware docs, here: https://docs.slackware.com/howtos:ge...ree_your_space

Two alternatives are: tmpwatch, or, as previously mentioned: setting up tmpfs (discussed on the same page of the Slackware docs linked to above).

selfprogrammed · 11-03-2023, 04:17 PM

As I said at the beginning, this discussion is about how to fix the Slackware distribution so that this does not occur, for anybody, under any usage.
It has been happening on many distributions, at least since 2005, and needs to be fixed.

Comments that start with "you're the type", really get me mad, so this has allot of angry in it. Most of that is because of the consistent stream of complaint by responders that you have done something different, again. I consistently heard how I am to blame because of using a custom kernel, or not installing every package in the entire Slackware distribution (like why do I need RAID, or disk quotas, or PIM). I have re-read the comment several times, and tried to put different spins upon it.

I have tried to eliminate anything in the following that is not directly argument as to why Slackware needs to be fixed.
I seem to have to repeat the same material every third thread that I create.

Everybody uses their computer differently, and all possible combinations of usage are possible. "Common experience" is a dubious concept.
Slackware, is not limited to just those users who use their computer in one particular way. It is not supposed to break, no matter what order, or when you use it.
Many users of Slackware have chose it because it allows the user to make changes to the normal installation, or have multiple users, or install unusual additional packages that are not limited by a master site, or a host of other things that are different than 80% of the rest. The system should not break for them, and it should never be an excuse that they were different and the system support is only going to be for the "NORMAL" 80%.

It is actually simpler to fix the actual problem, than it is to mess with tmpfs, or other special modifications.

I have managed a corporate computing system, back in the 1970's (VAX VMS). I have enough experience to know how to make my own decisions about tmp and when it needs managing.
If someone actually needed to clear tmp, perhaps to save disk space, that is a separate and distinct problem.
It is not done automatically by Slackware for good reasons, and they can state their own reasoning on that.
It would be a one liner in the startup scripts, and it is not in there, because it is not a safe thing to do for all users. Some other systems do it. Possibly it could be a Slackware option to clear /tmp at shutdown. It would have to be an option that the manager of that system chose to invoke.

Enabling clearing of /tmp (or using tmpfs) would hide problems like this. That is not a good solution for the problem here. It just makes it harder to discover latent bugs, and in no way is a substitute for properly fixing the problem. It is that kind of coverup that actually makes these latent bugs harder to track down. Those who had to disable clearing of /tmp would suddenly experience all the latent bugs that it was covering up.

This problem actually has nothing to do with cleaning tmp.
It has nothing to do with shutting down the computer. It is caused by having multiple accounts, and having them startup a desktop, and X not cleaning up after itself.
The stale sockets would still accumulate, even if the system was left running for months.

The only way to avoid that would be to REQUIRE users to leave their desktop up and logged in all the time. Not all of us can be that security indifferent. I need to secure this computer. It has my personal and company confidential information on it and leaving it running all the time is a security risk. What are you proposing for laptop users to do.

This needs to be fixed in Slackware, startup, shutdown, or both, or in the X-startup, or somewhere else.
The fix I have already proposed is far simpler than messing with startup to get /tmp using tmpfs. That impacts the startup and shutdown at nearly the same places, but tmpfs would be far more intrusive to normal expectations. Then someone else would be blaming those users for being different because they were using tmpfs instead of the normal Slackware install. I have already had enough experience with that myself.

In my personal case, I have a enormous amount of temp data that accumulates in /tmp and it is better for me to leave it alone.
Clearing all of /tmp would force tasks to recreate that data. I wish those programs would have chosen another directory, but that is the one that system uses, and it cannot be changed from one place. It is not a config option.

This fix should have NOTHING to do with my personal case. This discussion is to discover options on how to fix this for everyone who has experienced this bug, however they use their system, or however many times or when they shut it down. We have no idea how many users out there have just gotten used to having to repeat startx several times, because experiencing Linux to be half-broken is the way Linux is, for them.

Windu · 11-03-2023, 04:32 PM

For Slackware-current at least:

Code:

Fri Nov  3 18:38:03 UTC 2023
a/sysvinit-scripts-15.1-noarch-8.txz:  Rebuilt.
  rc.S: clear /tmp/{.ICE-unix,.X11-unix}. Thanks to selfprogrammed.

the3dfxdude · 11-03-2023, 05:56 PM

Quote:

Originally Posted by selfprogrammed

We have no idea how many users out there have just gotten used to having to repeat startx several times, because experiencing Linux to be half-broken is the way Linux is, for them.

Thank you selfprogrammed! Indeed my boys just run startx multiple times. I saw the sockets as well but never made the connection.

selfprogrammed · 11-03-2023, 06:16 PM

For anyone who wishes to debug, test, or reproduce the problem:
It is easier if the system is rebooted, as the bug manifests much more often if a user login, and the start of the desktop, occurs right after boot, because that concentrates the PID collision to a much narrower range of values. Right after boot the bug can be consistently triggered.

There is nothing special about the setup. This only requires two accounts on the system, with both having login and using the desktop soon after a boot.
Later Login and using the desktop is not safe against encountering the bug, but it is harder to encounter a collision with a stale socket when the PID range gets more spread out.
It is possible to get a collision anytime on any system with two accounts that have both used the desktop. It only requires that both users start their desktop with some consistency, so they end up trying to use the same PID value that the other user had used on a different day.

1. Make another user account (user2), with another UID.

2. On any existing system, the ICE directory is likely already populated with user1 stale sockets, with low PID.
In that case you may skip the next two steps.

3. If you want the bug to manifest easily, then reboot the system, so that the PID numbers start at 1 again.
It is not necessary, but it does promote the socket PID collision more reliably.

4. Have user1 start the desktop (startx). Once the socket is created, it is not needed anymore. Leaving it running will not have any effect.
Using startx is not necessary, as any method of starting the desktop will cause ICE sockets to be created by Xorg.

5. Repeating step 2 several times leaves more stale sockets, which makes the bug more likely to happen.
Because the stale sockets persist, this setup will persist for months and does not need to be repeated.
I tried copying a stale socket of user2 to other PID, but that did not work.

6. Reboot the system. This starts the PID back at the state of step 3.

7. Have user2 start the desktop. If done without any optional intervening processes, then this will reliably collide with the stale socket of user1, and startx will fail with an ICE error.

8. If not done perfectly, then try startx several times. Hitting a stale socket depends upon the current PID used for new processes and how many stale sockets have accumulated.

9. If the system is creating PID ("ps -A") that are higher than any stale socket in "/tmp/.ICE-unix", then collisions are no longer possible. You will have to reboot, and possibly login user1 a few more times to populate the ICE directory with stale sockets in the right range.

10. Once you have a stale socket at the "usual" PID value used every day, it can be difficult to miss it, and you will be able to hit it consistently.

selfprogrammed · 11-03-2023, 06:40 PM

Thank you Windu:
Unfortunately I do not actually understand the post, it might be a patch.

The clear of the two directories proposed by Windu probably works too.
I doubt that there would be any active ICE sockets or X11 locks that would exist and need to be preserved, at the time rc.S is run.
I worry about the ability to take the system to multi-user and back down to single-user, and which scripts that might run.
If the clear of tmp should occur any time there was a desktop up, it may crash it when the sockets get removed.

It would take someone more familiar with those rc.S startup and rc.K shutdown scripts than I am, to confirm the safety of exactly where the clear is stuck-in.
I was hoping to involve someone like that with this discussion, and it would save me from having to reverse engineer those scripts.

That is why my clean_sockets script is so involved. It could be run at anytime, such as by a cron job on a weekly basis.
For those who are leaving their system up for days, weeks, an eternity, that may be useful. Such long run-time system may not have many startx problems due to their usually being out into large PID values most of the time. But, they still will populate the ICE directory with stale sockets. Don't know how bad that is, but they could really have a huge number of them, and most all of them would be stale.
However, my script is not removing enough stale sockets, as most of those PID have some kind of process that exists, but has nothing to do with the stale socket itself.
I have not discovered a good script test that would confirm ICE relationships. An active socket does not seem to be open by the owning process that created it.

I am also glad to hear that someone had gotten some benefit from this.
I have patched my system too, but am waiting to see if it is effective enough.

rkelsen · 11-03-2023, 07:01 PM

Quote:

Originally Posted by selfprogrammed

Comments that start with "you're the type", really get me mad, so this has allot of angry in it.

Those are not the words that the comment started with.

Perhaps you should re-read it, because there is a two letter word at the very beginning of the sentence which changes the meaning significantly.

You should not attribute malice where none is intended. We're strangers on the internet. There is nothing personal here.

Quote:

Originally Posted by selfprogrammed

In my personal case, I have a enormous amount of temp data that accumulates in /tmp and it is better for me to leave it alone.
Clearing all of /tmp would force tasks to recreate that data.

According to the FHS: https://refspecs.linuxfoundation.org...s/ch03s18.html

"Programs must not assume that any files or directories in /tmp are preserved between invocations of the program."

Therefore, any software which relies upon the pre-existing contents of /tmp is broken.

I stand by my comment that it is best practice to clear /tmp on shutdown IF you're the type of person who shuts down their computer every day, because the subject of this thread is not the only problem it will prevent.

Quote:

Originally Posted by selfprogrammed

It is not done automatically by Slackware for good reasons, and they can state their own reasoning on that.

As mentioned aboveby Windu, Patrick has now added the automatic deletion of everything under /tmp/.ICE-unix and /tmp/.X11-unix on startup to the scripts in the -current branch.

Windu · 11-04-2023, 04:43 AM

Quote:

Originally Posted by selfprogrammed

Thank you Windu:
Unfortunately I do not actually understand the post, it might be a patch.

I quoted part of the Slackware-current ChangeLog.txt. The meaning of this post was to inform you that as of November 3rd, Slackware-current removes any and all files it finds in /tmp/.ICE-unixand /tmp/.X11-unix every time it boots up.

If you run Slackware 15.0 you'd have to apply that as a patch to your own rc.S script, but I guess you have taken care of things already on your personal system.
The patch: https://git.slackware.nl/current/pat...19969732644584

henca · 11-04-2023, 06:30 AM

Quote:

Originally Posted by Windu

For Slackware-current at least:

Code:

Fri Nov  3 18:38:03 UTC 2023
a/sysvinit-scripts-15.1-noarch-8.txz:  Rebuilt.
  rc.S: clear /tmp/{.ICE-unix,.X11-unix}. Thanks to selfprogrammed.

It is nice to see some kind of fix to this problem, but IMHO it only attempts to hide the real problem in a way which still might fail.

The real problem is that X does not clean up its sockets at exit. Doing a cleanup at boot might help, but a system which is not rebooted very often might wrap around the PIDs several times (if I remember right PID wraps back to 1 when 32768 is reached). As such, left sockets or files owned by another user might block the creation or opening of a file or socket with the same name.

This problem is probably not Slackware specific and should be reported as a bug upstream. So where is upstream? When looking at my system who has the socket opened it seems to be xfce related processes, but also some other things might be related like polkit.

regards Henrik

henca · 11-04-2023, 06:41 AM

Quote:

Originally Posted by selfprogrammed

*** Who fixes this

The distribution, X, xfce4, all have the opportunity to fix this problem.
** X could make sure to remove the old sockets upon shutdown.
** xfce4 could clear its socket on shutdown, it knows the socket and the PID.
** The distribution could clear this directory upon shutdown, and upon startup, as no sockets will carry over across reboot.

I would prefer if this was fixed upstream, but until it has been fixed a workaround in Slackware might be the best choice.

Quote:

Originally Posted by selfprogrammed

This is a first attempt, which can be improved.
It may do better to execute it at boot instead of at shutdown.

Yes, calling the script at boot is better than shutdown as it also will handle left sockets from an unclean shutdown. However, IMHO, the script should be called not only at boot but also from a script in /etc/cron.hourly. Also systems with along uptime might suffer from this problem when the PID number wrap around.

The best solution might be Pats updated rc.S combined with your script placed in /etc/cron.hourly.

regards Henrik