Issue 315: scripts: resultfetcher should look classify the causes of crashes - Fast Downward issue tracker

Title	scripts: resultfetcher should look classify the causes of crashes
Priority	wish	Status	resolved
Superseder		Nosy List	jendrik, malte
Assigned To	jendrik	Keywords
Optional summary

Created on 2012-01-13.21:08:28 by malte, last changed by jendrik.

Messages
msg2450 (view)	Author: jendrik	Date: 2013-05-17.15:37:55
This has been fixed in lab which can now make use of the new exit codes from issue338.
msg2031 (view)	Author: malte	Date: 2012-01-13.21:08:28
It's quite easy right now to have planner bugs slip through the cracks, since we don't distinguish certain kinds of crashes from regular cases of "failing to find a solution". Here's some info pasted from an email: ======================================================================== > I see a lot of error files containing "line blah: > > <number> Killed". Is that a result of the run being killed by a time > > / memory cutoff? With that message, I think it should always be runtime. There are two pieces of evidence to diagnose the cause of an "expected" crash: 1) The file "driver.log" contains periodic checkpointing info for wall-clock time, memory and CPU time usage. If the planner is running out of time, the last checkpoint should be close to 1800 seconds of CPU time. 2) If it's running out of memory, there should be something about throwing an instance of std::bad_alloc in the run.err file. There will also often be evidence of "close to running out of memory" situations in the driver.log checkpointing info in cases like this, but that's less reliable since memory usage can grow very quickly in a very short time in certain parts of the code. Checking that there are no unexpected crashes is obviously something that the analysis scripts should do for you, and if my to-do list weren't in the triple digits, it'd be something I would have worked on by now... ======================================================================== Adding to the info above: We should check the .err and driver.log files for unusual conditions such as any errors other than the "expected" ones. I can think of at least the following possibilities: solved problem and terminated without error * proved unsolvable and terminated without error * bailed out because of bad command-line * bailed out because of unsupported feature (e.g. conditional effects used together with certain heuristics) * planner ran out of CPU time * planner ran out of memory * driver script ran out of wall-clock time (this indicates an error in the experimental conditions and should always be reported!) * segmentation fault * other kind of crash of planner * crash/unexpected error output/missing expected output of driver script We should classify and report these different outcomes. We should also create warnings in certain cases that might be OK, but look suspicious, e.g. when wall-clock time in driver.log exceeds CPU time by a significant amount or when CPU time exceeds wall-clock time by more than a tiny amount (the former may be indicative of overloaded machines or swapping; the latter should not happen at all, but that doesn't mean that we haven't seen it happen). We should also test invariants such as "if the planner exited cleanly, there should be some output stating that the problem is unsolvable or there should be a plan". In cases of crashes, we should warn/signal an error in result analysis even if a plan was found (think of anytime configuration). So "a plan was found" and "an error happened" are not mutually exclusive. Also, any error output should be written to stderr (in some cases it's OK to additionally write something to stdout, but there should always be something on stderr), and we should always look into any unexpected output on stderr (of either the planner or driver). Finally, we should check that our signal trapping code does not mask errors. It's quite possible we're currently attempting to catch segmentation faults and other things that are clearly errors, and I'm not sure if that's a good idea since it might mean that the signal info is lost to the calling code, and we cannot rely on ourselves reporting it properly. (Who can guarantee that the output machinery isn't hosed after a segfault?) One way to deal with this is by having the planner always write a certain string upon a clean exit, but not on an exit caused by a signal, and then checking for the absence of that string in the planner output.

History
Date	User	Action	Args
2013-05-17 15:37:55	jendrik	set	status: chatting -> resolved messages: + msg2450
2013-05-15 19:35:59	jendrik	set	assignedto: jendrik
2012-01-13 21:08:28	malte	create