WaitPidDaemonThread() stealing return status and asserting
Hello NSPR devs,
TL;DR: If a program spawns a sub-process with a non-NSPR API (e.g.
system(3)), then WaitPidDaemonThread() will sometimes steal the return code
of the program. This sometimes results in a PR_ASSERT() being triggered.
We've noticed a problem with the way that WaitPidDaemonThread()
(mozilla/nsprpub/pr/src/md/unix/uxproces.c) interacts with other parts of
the program that might bypass PR_CreateProcess() for creating a
sub-process. Since the thread is running and regularly calls 'waitpid(-1,
&status, WNOHANG)', it can sometimes steal the return-code of another
thread who may be calling system(3), pclose(3), waitpid(2), etc.
Furthermore, on a system that spawns a lot of PID's and in a program that
spawns a lot of sub-processes, you can sometimes get a PR_ASSERT()
triggered if WaitPidDaemonThread() ever see's the same PID twice for a
non-NSPR sub-process. (Inside ProcessReapedChildInternal()). The first
time, the non-NSPR sub-process is added to pidTable with state
_PR_PID_REAPED. Nobody every removes it from pidTable. The second time,
since it's found to already be in the table the code asserts because the
state is already _PR_PID_REAPED.
I'm pretty sure this is the cause of the following debian icedove bugs:
In addition, we've recently discovered this bug in our code.
An example test-case that demonstrates this is pasted below.
I'm mulling over potential solutions, including:
1. For each process that NSPR spawns, spawn a corresponding thread that
will call waitpid() with the actual pid as the argument (in lieu of -1).
2. It looks like NSPR is installing a SIGCHLD handler. Using sigaction(3)
we can probably get the PID of the child process... and queue that to the
WaitPidDaemonThread(). It can then check to see if it's one of the
processes that it cares about before calling waitpid() on that process.
3. Force everyone to use NSPR for creating sub-processes (I wouldn't hold
my breath on that one...)
4. Have the WaitPidDaemonThread() iterate over all the process that it
knows about and explicitly call waitpid(pid, &status, WNOHANG) on each of
5. in uxproces.c, it looks like using _PR_NATIVE_THREADS and
_PR_SHARE_CLONES are out-of-date. The code suggests that these
implementations were added for LinuxThreads support in 2005, since
kernel-based threads were unavailable. Since this is no longer true,
(However, this by iteself doesn't resolve the problem because that thread
still calls waitpid(-1, ...). But the code is so old and forgotten, I
figured I would mention it.)
I'm currently working on the last proposal... and plan to upstream the fix.
Thus, let me know what kind of solution you would prefer.