Tuesday, April 24, 2007

$SIG{CHLD} REAPER, system() or backticks concurrency

Why does system() or backticks command complete successfully but return -1 and "No child processes"?

Typically system() or backticks will wait() for their process to finish, and return status directly or via $?

Since our perlipc REAPER() function will perform a non-blocking waitpid() of all children finished, it is possible that some other child process AND the system() or backtick process could finish at or about the same time, while in "while (($child = waitpid(-1,&WNOHANG)) > 0)" loop. What will happen is the reaper will wait() on the PID of the finished child, as well as the finished system() or backtick operation, and the result will be unavilable for the system() or backtick call in it's typical fashion, and return -1; $! will return "No child processes".

This is an impossible scenario on a single processor machine, since there is no possibility in two processes actaully finishing at the same time.

The solution, turn off $SIG{CHLD} handing and REAPER() function for the block of code around your system() or backtick call, however DON'T use local(). $SIG{CHLD}='DEFAULT'. See my other post.

Monday, April 23, 2007

Multi-core and signal handling: Unable to create sub named "" at

This little snippit of code will always fail on a multi-core machine, but not on a single core device:

perl -e 'use strict;$SIG{CHLD}=sub {1};while (! system ("true")) {local $SIG{CHLD}="DEFAULT"}' & while kill -n 17 $! ;do true;done

It can be simulated single core on my core-duo w/ the taskset (1) command, setting both processors to '0':
taskset -c 0 perl -e 'use strict;$SIG{CHLD}=sub {1};while (! system ("true")) {local $SIG{CHLD}="DEFAULT"}' & taskset -c 0 bash -c "while kill -n 17 $! ;do true;done"

When the second taskset(1) command is set for a secondary "real" processor with "taskset -c 1", it'll fail typically within 30 seconds with
'Unable to create sub named "" at -e line 1.'

The fix? Don't local()ize $SIG within the inside loop or scope. Chances are when Perl resets the $SIG{CHLD} variable when leaving scope, it momentarily leaves it un-set before returning it to the original, pre-local() global value.

Verified with perl5.8.0 and perl5.8.8-i386-linux-thread-multi

Multi-core and REAPER function

Race conditions dealing with multi-core perl programming and signals

I use Perl to spawn off lots of processes with open3 and collect the data with 4 argument select().

When using a 'REAPER' function like the one described within perlipc:
use POSIX ":sys_wait_h";     sub REAPER {
my $child;
# If a second child dies while in the signal handler caused by the
# first death, we won't get another signal. So must loop here else
# we will leave the unreaped child as a zombie. And the next time
# two children die we get another zombie. And so on.
while (($child = waitpid(-1,WNOHANG)) > 0) {
$Kid_Status{$child} = $?;
}
$SIG{CHLD} = \&REAPER; # still loathe sysV
}
$SIG{CHLD} = \&REAPER;
# do something that forks...

What happens when your signal handler gets interrupted with a SIGCHLD signal? Calls itself again, which is typically fine, however if it interrupts your waitpid() function, you'll get something nasty in your %Kid_Status (like a return code of -1). If you're checking result codes and expecting '0', you'll bomb out. Nasty little gotcha that's shown itself infrequently, but enough to give me headaches.

Consider disabling the SIGCHLD signal within the handler, and checking the $? return code for a real value, otherwise ignore it.

sub REAPER {
$SIG{CHLD}='DEFAULT'; #Make sure we don't re-trigger
my $child;
# If a second child dies while in the signal handler caused by the
# first death, we won't get another signal. So must loop here else
# we will leave the unreaped child as a zombie. And the next time
# two children die we get another zombie. And so on.
while (($child = waitpid(-1,&WNOHANG)) > 0) {
my $status=$?;
if ($status>=0) {
$Kid_Status{$child} = $status;
print STDERR "reaped child $child, return: ".$Kid_Status{$child}."\n";
} else {
print STDERR "WARNING: waitpid returned child status: $child: $status: $!\n";
}
}
$SIG{CHLD} = \&REAPER; # still loathe sysV
}


Also, if you have a long-running programs, and you're exists() the %Kid_Status for result codes to see when your child is wrapped up, make sure to delete() the hash element after you're done with it. Linux wraps PID's at 64k, and if you end up re-using the same PID for another run of your child, your exists() check will immediatly return the result from an old process, not when the newest one finished.