How do you stop a background process?
While I was working on Redirectory last year, I was often running a Conan package server and tcpdump simultaneously whilel I reverse engineered the server API. (The API is undocumented, and I felt this would be faster than trying to investigate the code.) To help me, I wrote a script that would run them as background processes and then drop me into a shell to run interactive Conan commands. After the subshell exited, it would stop the background processes before continuing. Or rather, it would try to stop them.
The pattern that I expected to work was to send the child processes
SIGTERM
and then wait
for their exit.
I would hit a series of wrinkles in this plan.
Killing a child
Wrinkle the first:
killing a child process might not kill its children.
The problem child was npm
.
I would run Redirectory via npm start
.
npm
was my child process,
and it would start a grandchild sh
process
that would start a great-grandchild node
process that was the server.
Sending SIGTERM
to the npm
process would terminate it and its child,
but not the great-grandchild, which would hold onto the port.
npm
would completely ignore SIGINT
.
I could stop it with SIGKILL
, but that would orphan its child.
I suspect that well-behaved programs have a general responsibility
to propagate terminating signals to their children,
and that npm
was misbehaving in this regard,
but I do not know for sure.
Either way, npm
used to propagate these signals,
but it no longer was.
Comments on that issue suggest that newer versions of npm
have been fixed to restore the old behavior,
but I have not checked.
The fix I arrived at was to kill the process group of the npm
process.
Killing its family
Wrinkle the second:
the process tree rooted by a child process
might not share exactly one process group.
The problem child this time was sudo
, as in sudo tcpdump
.
To be fair, it wasn't a problem in practice.
sudo
would launch its command in a different process group,
but it would also relay SIGINT
and SIGTERM
signals to its children,
as I expected a reasonable program to do,
so that killing the group of the root process
effectively killed its entire tree.
But it could have been a problem in theory.
What if it ignored signals like npm
?
I was searching for a 100% reliable, fool-proof method
for terminating all processes spawned directly and indirectly by a command,
regardless of the specific command.
I guess I could have looked for a way to find the process groups of every process in the tree rooted by a child process, but for the moment I just stuck with killing the group of the child process.
Killing its parents
Wrinkle the third: the child process might be in the same process group as the shell. When I killed a child process group in a terminal, it worked fine, but once I incorporated the technique into a script, it would kill the whole script.
This is how I learned about the monitor
(-m
) shell option.
monitor
enables job control.
This not only includes the jobs
and wait
built-ins,
but it also puts each child process into its own process group.
Interactive shells have monitor
enabled by default,
but non-interactive shells, like the ones that execute scripts, do not.
The fix here was to set the monitor
option at the start of the script:
set -o monitor
Waiting for its children to die
Wrinkle the fourth:
a shell cannot wait for its grandchildren, only its children.
The wait
built-in, when passed an ID for a process that is not a child,
will assume that it has already exited unsuccessfully:
If one or more pid operands are specified that represent unknown process IDs, wait shall treat them as if they were known process IDs that exited with exit status 127.
I came across some ideas to get around this limitation, but I did not investigate any:
- Get a file descriptor for the process with
pidfd_open()
, and wait for its exit withpoll()
(waitid()
only works on child processes). - Periodically poll for the existence of the process with
ps
. inotifywait
a file descriptor opened by the process, e.g.stdout
.
Moving on
At the end of the day,
the best method I got was to send SIGTERM
to the group of the child process,
and then wait
on the child process,
but I'm not confident this will work reliably in all future cases.
How do you do it?
How are we supposed to do it?