Monday, April 23, 2007

Multi-core and REAPER function

Race conditions dealing with multi-core perl programming and signals

I use Perl to spawn off lots of processes with open3 and collect the data with 4 argument select().

When using a 'REAPER' function like the one described within perlipc:
use POSIX ":sys_wait_h";     sub REAPER {
my $child;
# If a second child dies while in the signal handler caused by the
# first death, we won't get another signal. So must loop here else
# we will leave the unreaped child as a zombie. And the next time
# two children die we get another zombie. And so on.
while (($child = waitpid(-1,WNOHANG)) > 0) {
$Kid_Status{$child} = $?;
}
$SIG{CHLD} = \&REAPER; # still loathe sysV
}
$SIG{CHLD} = \&REAPER;
# do something that forks...

What happens when your signal handler gets interrupted with a SIGCHLD signal? Calls itself again, which is typically fine, however if it interrupts your waitpid() function, you'll get something nasty in your %Kid_Status (like a return code of -1). If you're checking result codes and expecting '0', you'll bomb out. Nasty little gotcha that's shown itself infrequently, but enough to give me headaches.

Consider disabling the SIGCHLD signal within the handler, and checking the $? return code for a real value, otherwise ignore it.

sub REAPER {
$SIG{CHLD}='DEFAULT'; #Make sure we don't re-trigger
my $child;
# If a second child dies while in the signal handler caused by the
# first death, we won't get another signal. So must loop here else
# we will leave the unreaped child as a zombie. And the next time
# two children die we get another zombie. And so on.
while (($child = waitpid(-1,&WNOHANG)) > 0) {
my $status=$?;
if ($status>=0) {
$Kid_Status{$child} = $status;
print STDERR "reaped child $child, return: ".$Kid_Status{$child}."\n";
} else {
print STDERR "WARNING: waitpid returned child status: $child: $status: $!\n";
}
}
$SIG{CHLD} = \&REAPER; # still loathe sysV
}


Also, if you have a long-running programs, and you're exists() the %Kid_Status for result codes to see when your child is wrapped up, make sure to delete() the hash element after you're done with it. Linux wraps PID's at 64k, and if you end up re-using the same PID for another run of your child, your exists() check will immediatly return the result from an old process, not when the newest one finished.

No comments: