LinuxQuestions.org - [SOLVED] Why do pids jump in a container?

- Linux - Containers (https://www.linuxquestions.org/questions/linux-containers-122/)

- - Why do pids jump in a container? (https://www.linuxquestions.org/questions/linux-containers-122/why-do-pids-jump-in-a-container-4175675946/)

Why do pids jump in a container?

Entering a container (e.g. `docker run` or `docker exec`) makes the PID of next created process jump ahead, why is that?

For example, in the image below the second process in the container (`ps`) is assigned PID 10, not PID 2:
Attachment 33283

According to this answer, Linux appears to be allocating PIDs in a sequence, is that not the case?

Let's get this out of the way first:

You're aware that each Docker container creates its own namespace for pids, right?

Quote:

Originally Posted by dugan (Post 6127578)

Let's get this out of the way first:

You're aware that each Docker container creates its own namespace for pids, right?

Yes, I'm aware of namespaces, that's not my question. I'm asking why, when a process joins the container's pid namespace through `docker run` or `docker exec`, the PID of the next process jumps ahead (as shown in the image attached to the question).

If I just use nsenter to enter the container's PID namespace, this jump doesn't appear to happen:
Attachment 33288

I know that in the process of joining a container, runC forks a few times, but I thought that most of that happens in the host PID namespace, and thus these processes shouldn't count in the container pid namespace. Also the jump seems to vary from 5 to 9 pids ahead, and I'm pretty sure runC is consistent with the amount of times it forks to enter a container.

Additionally, what's weird is that it's not the entered process PID that jumps ahead, but the PID of the next process in the container.

If you have any ideas on why this behaviour occurs, please share. Thanks

This is interesting, I've never seen it myself before either. But this doesn't seem to be container related, actually. The exact same thing happens also on a host.
Have you tried it yourself?
What I've done was run a loop with ls -l /proc/self, to see what it reaches, and after 32767 it starts again with 300 (in my case)

Code:

while true; do ls -l /proc/self; done

[..]

lrwxrwxrwx 1 root root 0 May 26 12:19 /proc/self -> 32765

lrwxrwxrwx 1 root root 0 May 26 12:19 /proc/self -> 32766

lrwxrwxrwx 1 root root 0 May 26 12:19 /proc/self -> 32767

lrwxrwxrwx 1 root root 0 May 26 12:19 /proc/self -> 300

lrwxrwxrwx 1 root root 0 May 26 12:19 /proc/self -> 301

So the second conclusion that I'd draw is that this is also not related to the pid namespace.

Quote:

Originally Posted by vincix (Post 6127945)

This is interesting, I've never seen it myself before either. But this doesn't seem to be container related, actually. The exact same thing happens also on a host.
So the second conclusion that I'd draw is that this is also not related to the pid namespace.

I think what your seeing is the PIDs wrap around once they reach pid_max. If your run `$ cat /proc/sys/kernel/pid_max`, you'll probably see 32767. The documentation specifies this will happen.

I believe the kernel also spawns processes in the initial PID namespace from time to time, so your loop will also occasionally see a jump of a few PIDs. Edit: The host also has a lot of services running that might spawn processes as well. I do think what I'm seeing is related to containers, and specifically to how runC enters a container's PID namespace, but I may be wrong.

I honestly can't see any difference, but if there is, I'd love it if someone explained it to me:
Again, directly on the host:

Code:

[root@macroscian ~]# ps aux | grep "ps aux"

root    15060  0.0  0.0 155476  1820 pts/0    R+  21:47  0:00 ps aux

root    15061  0.0  0.0 112812  940 pts/0    S+  21:47  0:00 grep --color=auto ps aux

[root@macroscian ~]# ps aux | grep "ps aux"

root    15068  0.0  0.0 155476  1812 pts/0    R+  21:47  0:00 ps aux

root    15069  0.0  0.0 112812  936 pts/0    S+  21:47  0:00 grep --color=auto ps aux

[root@macroscian ~]# ps aux | grep "ps aux"

root    15070  0.0  0.0 155476  1816 pts/0    R+  21:47  0:00 ps aux

root    15071  0.0  0.0 112812  940 pts/0    S+  21:47  0:00 grep --color=auto ps aux

[root@macroscian ~]# ps aux | grep "ps aux"

root    15072  0.0  0.0 155476  1816 pts/0    R+  21:47  0:00 ps aux

root    15073  0.0  0.0 112812  940 pts/0    S+  21:47  0:00 grep --color=auto ps aux

[root@macroscian ~]# ls -l /proc/self

lrwxrwxrwx 1 root root 0 May 16 11:41 /proc/self -> 15124

[root@macroscian ~]# ps aux | grep "ps aux"

root    15126  0.0  0.0 155476  1816 pts/0    R+  21:48  0:00 ps aux

root    15127  0.0  0.0 112812  940 pts/0    S+  21:48  0:00 grep --color=auto ps aux

[root@macroscian ~]# ls -l /proc/self

lrwxrwxrwx 1 root root 0 May 16 11:41 /proc/self -> 15128

[root@macroscian ~]# ps aux | grep "ps aux"

root    15134  0.0  0.0 155476  1816 pts/0    R+  21:48  0:00 ps aux

root    15135  0.0  0.0 112812  936 pts/0    S+  21:48  0:00 grep --color=auto ps aux

[root@macroscian ~]# ps aux | grep "ps aux"

root    15137  0.0  0.0 155476  1816 pts/0    R+  21:48  0:00 ps aux

root    15138  0.0  0.0 112812  936 pts/0    S+  21:48  0:00 grep --color=auto ps aux

[root@macroscian ~]# ps aux | grep "ps aux"

root    15139  0.0  0.0 155476  1816 pts/0    R+  21:48  0:00 ps aux

root    15140  0.0  0.0 112812  936 pts/0    S+  21:48  0:00 grep --color=auto ps aux

[root@macroscian ~]# ps aux | grep "ps aux"

root    15141  0.0  0.0 155476  1816 pts/0    R+  21:48  0:00 ps aux

root    15142  0.0  0.0 112812  940 pts/0    S+  21:48  0:00 grep --color=auto ps aux

[root@macroscian ~]# ps aux | grep "ps aux"

root    15144  0.0  0.0 155476  1816 pts/0    R+  21:48  0:00 ps aux

root    15145  0.0  0.0 112812  936 pts/0    S+  21:48  0:00 grep --color=auto ps aux

[root@macroscian ~]# ls -l /proc/self

lrwxrwxrwx 1 root root 0 May 16 11:41 /proc/self -> 15147

How is this behaving differently from your container example?

Ok, so you mean that the difference consists in the fact that it jumps a few pid numbers, not that the number increases like that after running whatever command alternatively.
You're right, it's more complicated than it initially seems to be :)

Talked with once of runc's maintainers, Aleksa Sarai, and he explained why this is happening.

By design, the golang runtime spawns several threads to manage a process. runc is written in golang, and when building/execing into the container, there is a short time where the runc process is running inside the container (before execing the user requested executable, e.g. bash in `docker exec bash`). In Linux, threads and processes are both identified with ids from the same pool, so the go runtime threads are counted in the container pid namespaces, leading to the pid jump I described.

Simply treat all pids ... in every context ... as being "opaque handles." (A common term is "nonce.") Their values are unpredictable and don't mean anything. Neither do they "point to" anything. Take the value that you are given but don't look closely at it. You have no idea what the next one might be. Use it only for its intended purpose – as a "primary key." The value is entirely arbitrary and contains no embedded information. The entire notion of "n+1" is entirely meaningless.

P.S.: These days, many handle-values are now purposely "unpredictable," specifically so that rogue software has a much more difficult time exploiting them.