It's quite easy right now to have planner bugs slip through the cracks, since we
don't distinguish certain kinds of crashes from regular cases of "failing to
find a solution". Here's some info pasted from an email:
========================================================================
> I see a lot of error files containing "line blah:
> > <number> Killed". Is that a result of the run being killed by a time
> > / memory cutoff?
With that message, I think it should always be runtime. There are two
pieces of evidence to diagnose the cause of an "expected" crash:
1) The file "driver.log" contains periodic checkpointing info for
wall-clock time, memory and CPU time usage. If the planner is running
out of time, the last checkpoint should be close to 1800 seconds of CPU
time.
2) If it's running out of memory, there should be something about
throwing an instance of std::bad_alloc in the run.err file. There will
also often be evidence of "close to running out of memory" situations in
the driver.log checkpointing info in cases like this, but that's less
reliable since memory usage can grow very quickly in a very short time
in certain parts of the code.
Checking that there are no unexpected crashes is obviously something
that the analysis scripts should do for you, and if my to-do list
weren't in the triple digits, it'd be something I would have worked on
by now...
========================================================================
Adding to the info above:
We should check the *.err and driver.log files for unusual conditions such as
any errors other than the "expected" ones. I can think of at least the following
possibilities:
* solved problem and terminated without error
* proved unsolvable and terminated without error
* bailed out because of bad command-line
* bailed out because of unsupported feature (e.g. conditional effects used
together with certain heuristics)
* planner ran out of CPU time
* planner ran out of memory
* driver script ran out of wall-clock time (this indicates an error in the
experimental conditions and should always be reported!)
* segmentation fault
* other kind of crash of planner
* crash/unexpected error output/missing expected output of driver script
We should classify and report these different outcomes.
We should also create warnings in certain cases that might be OK, but look
suspicious, e.g. when wall-clock time in driver.log exceeds CPU time by a
significant amount or when CPU time exceeds wall-clock time by more than a tiny
amount (the former may be indicative of overloaded machines or swapping; the
latter should not happen at all, but that doesn't mean that we haven't seen it
happen).
We should also test invariants such as "if the planner exited cleanly, there
should be some output stating that the problem is unsolvable *or* there should
be a plan".
In cases of crashes, we should warn/signal an error in result analysis even if a
plan was found (think of anytime configuration). So "a plan was found" and "an
error happened" are not mutually exclusive.
Also, any error output should be written to stderr (in some cases it's OK to
*additionally* write something to stdout, but there should always be something
on stderr), and we should *always* look into any unexpected output on stderr (of
either the planner or driver).
Finally, we should check that our signal trapping code does not mask errors.
It's quite possible we're currently attempting to catch segmentation faults and
other things that are clearly errors, and I'm not sure if that's a good idea
since it might mean that the signal info is lost to the calling code, and we
cannot rely on ourselves reporting it properly. (Who can guarantee that the
output machinery isn't hosed after a segfault?) One way to deal with this is by
having the planner *always* write a certain string upon a clean exit, but not on
an exit caused by a signal, and then checking for the absence of that string in
the planner output.
|