Issue 1164: Ensure statistic reporting on OOM - Fast Downward issue tracker

Title	Ensure statistic reporting on OOM
Priority	feature	Status	chatting
Superseder		Nosy List	clemens, jendrik, malte, masataro
Assigned To		Keywords
Optional summary	https://github.com/aibasel/downward/pull/286

Created on 2025-01-15.07:35:59 by masataro, last changed by malte.

Summary
https://github.com/aibasel/downward/pull/286

Messages
msg12050 (view)	Author: malte	Date: 2026-03-17.18:41:15
I think you're making different assumptions of what is desirable/undesirable than the ones we made. So the goals are different, and with different goals different implementation plans are optimal. For us, at least as we defined the goals when we implemented this, the current behaviour is not undesirable. On OOM the C++ code terminates while printing very little information, but it does so in a reliable way. The information we're chiefly interested in in out-of-memory conditions (how long the planner ran, and that it ran out of memory) can be easily detected by the calling process, even without any parsing (from exit codes and child process runtime statistics). So basically P(Z) for us is 1 -- our output parsing scripts expect exactly the current behaviour and thus work 100% of the time. P(Y), irreproducible output due to triggering undefined behaviour in the language, is 0. (Double output was one symptom -- it could also be segmentation faults, for example.) More generally, we put a very high negative value on triggering undefined behaviour, so this was one of the considerations that triggered the current design.
msg12049 (view)	Author: masataro	Date: 2026-03-13.03:42:53
> is exactly what we used to do before the current version with the reentrant code that has no ambitions beyond terminating. > It did cause issues, which is why we changed it. I remember really weird behaviour like output lines being printed twice because the internal data structures of std::cout got out of whack when the ostream library code itself ran out of memory while trying to do something. Hmm, I don't understand the rationale for this change. Let X be an event that it causes an OOM out of the entire experimental run, whose probability is p(X). It can be large or small. Let Y be an event for the double output line. Let Z be an event where the output is parsed correctly by some experimental script. In the old code, p(Z) = p(Z\|X)p(X) + p(Z\|not X)p(not X) = p(Z\|X)p(X) + 1 * p(not X) p(Z\|X) = p(Z,Y\|X) + p(Z,not Y\|X) = p(Z\|Y,X)p(Y\|X) + p(Z\|not Y,X)(not Y\|X) = 0 * p(Y\|X) + 1 * p(not Y\|X) therefore p(Z) = p(not Y\|X) + p(not X) If you remove the code, p(Z) = p(Z\|X)p(X) + p(Z\|not X)p(not X) = 0 * p(X) + 1 * p(not X) therefore p(Z) = p(not X), which is strictly smaller than p(not Y\|X) + p(not X).
msg12048 (view)	Author: malte	Date: 2026-03-12.23:06:48
To me it looks like the new code, i.e., using memory padding, releasing it when catching bad_alloc, and then attempting to exit gracefully, is exactly what we used to do before the current version with the reentrant code that has no ambitions beyond terminating. It did cause issues, which is why we changed it. I remember really weird behaviour like output lines being printed twice because the internal data structures of std::cout got out of whack when the ostream library code itself ran out of memory while trying to do something. Right now I can find some allusions to difficulties we had in some issue discussions, but nothing that looks directly related to when and why we changed from catching bad_alloc and using padding to just terminating. But I did find some discussion that was at least somewhat related, in issue469.
msg12047 (view)	Author: masataro	Date: 2026-03-12.21:16:12
In a threaded code with a separate stack without a handler for ExitException, the thread would simply call std::terminate which calls std::abort, and it terminates unconditionally. This does not touch the red line. Linux kernel on a killing spree is not an issue either, because SIGKILL is uncatchable anyways. It does not touch the red line.
msg12046 (view)	Author: masataro	Date: 2026-03-12.21:01:23
> I'm skeptical about investing a lot of work into making a program report on its own out-of-memory conditions in general. > We tried a lot over the years in various parts of the code, and it was never robust. For example, the Python code of Fast Downward uses memory padding and catches MemoryError, and it's never been reliable. Instead Python crashes internally in uncatchable ways. I agree on minimizing the effort on this, hence my proposed changes are simple and minimal. The goal of the change is to increase the robustness of error reporting just a bit, not to make it 100%. It is completely fine if it crashes in uncatchable ways in some corner cases. It is still infinitely better than printing no information. Our only red line should be to prevent it from continuing any processsing after faling to allocate the memory. In my PR, the overall behavior does not change; It just unwinds the stack, prints the stat and quits. The only assumption it makes is that the only code that catches ExitException is main(), which should be the case because ExitException is defined in FD. And maybe that no other code overrides the handler with set_new_handler(), but this is already assumed in the current code. In contrast, `bad_alloc` may be internally handled by any such external LP solver by themselves, as you mentioned. Even if the stat reporting is only partially achieved, it is useful to see the stat and unwinds the stack. This is because it is possible for other classes to print other statistics in the destructors --- preferably in an reentrant way, but thanks to the memory padding more flexibly with ostream and strings. I am already using the proposed code in my MCTS fork and empirically it is working just fine.
msg12045 (view)	Author: malte	Date: 2026-03-06.11:02:30
bad_alloc was the mechanism we used in the early days of Fast Downward, and it was much less robust than the current mechanism; there is a reason why we ended up with the current strict handling with reentrant functions and terminating quickly. I'm skeptical about investing a lot of work into making a program report on its own out-of-memory conditions in general. We tried a lot over the years in various parts of the code, and it was never robust. For example, the Python code of Fast Downward uses memory padding and catches MemoryError, and it's never been reliable. Instead Python crashes internally in uncatchable ways. Also with C++ code, there are ways that cannot be controlled at all where programs are killed externally by uncatchable signals. I've seen this happen a lot these days with multi-process code where the Linux kernel overcommits because it predicts too aggressively how much memory forked processes will share. When we tried something like this in the past, it was for example impossible to get the external libraries (= MIP solvers) to play nicely with it. That doesn't mean we should think about what we can do to improve logging for OOM scenarios. But I'd prefer something more robust, such as period logging, that also gives us something when we're killed by signal 9 or when the LP libraries cause an abort on out-of-memory.
msg12044 (view)	Author: masataro	Date: 2026-03-05.20:36:07
Implemented this with memory padding. https://github.com/aibasel/downward/pull/286
msg11734 (view)	Author: masataro	Date: 2025-01-15.07:38:59
Although I said "feels like a much better option", I already have the code that uses bad_alloc + destructors and it has been working nicer with my compute cluster.
msg11733 (view)	Author: masataro	Date: 2025-01-15.07:35:59
Issue : FD does not print any statistics when it was killed by the OS, scheduler, etc via signals. It is currently done by print_statistics() in various parts of the search engines, which gets skipped when the search is terminated externally. More broadly, other classes (e.g. heuristics, evaluators, pruning) can report other useful statistics, if one wish. Pros: I do not think destructures are always called, but they at least have the better language-level guarantee of being actually called for stack-allocated objects. It also feels the right place to put the clean-up code that runs at the end of a search engine object lifetime. Cons: Consing must be discouraged in a destructor and its code requires more care. Related: out-of-memory handler I am aware of the issue about signal handling and reentrancy (https://issues.fast-downward.org/issue479) mostly. The C++20 standard (or earlier standards) mandates the following requirement to the signal handler and out-of-memory handlers, as mentioned in here for example https://en.cppreference.com/w/cpp/memory/new/set_new_handler . Throwing bad_alloc() and catching it in main() feels like a much better option than the current implemnetation, since it unwinds the stack and gracefully calls the destructors, which would have the good aforementioned side effect if combined with statistics reporting. ISO/IEC JTC1 SC22 WG21 N 486 page 17.6.3.3 Type new_handler [new.handler] using new_handler = void (*)(); 1 The type of a handler function to be called by operator new() or operator new[]() (17.6.2) when they cannot satisfy a request for additional storage. 2 Required behavior: A new_handler shall perform one of the following: (2.1) — make more storage available for allocation and then return; (2.2) — throw an exception of type bad_alloc or a class derived from bad_alloc; (2.3) — terminate execution of the program without returning to the caller.

History
Date	User	Action	Args
2026-03-17 18:41:15	malte	set	messages: + msg12050
2026-03-13 03:42:53	masataro	set	messages: + msg12049
2026-03-12 23:06:48	malte	set	messages: + msg12048
2026-03-12 21:16:12	masataro	set	messages: + msg12047
2026-03-12 21:01:23	masataro	set	messages: + msg12046
2026-03-12 21:01:14	masataro	set	summary: > I'm skeptical about investing a lot of work into making a program report on its own out-of-memory conditions in general. > We tried a lot over the years in various parts of the code, and it was never robust. For example, the Python code of Fast Downward uses memory padding and catches MemoryError, and it's never been reliable. Instead Python crashes internally in uncatchable ways. I agree on minimizing the effort on this, hence my proposed changes are simple and minimal. The goal of the change is to increase the robustness of error reporting just a bit, not to make it 100%. It is completely fine if it crashes in uncatchable ways in some corner cases. It is still infinitely better than printing no information. Our only red line should be to prevent it from continuing any processsing after faling to allocate the memory. In my PR, the overall behavior does not change; It just unwinds the stack, prints the stat and quits. The only assumption it makes is that the only code that catches ExitException is main(), which should be the case because ExitException is defined in FD. And maybe that no other code overrides the handler with set_new_handler(), but this is already assumed in the current code. In contrast, `bad_alloc` may be internally handled by any such external LP solver by themselves, as you mentioned. Even if the stat reporting is only partially achieved, it is useful to see the stat and unwinds the stack. This is because it is possible for other classes to print other statistics in the destructors --- preferably in an reentrant way, but thanks to the memory padding more flexibly with ostream and strings. I am already using the proposed code in my MCTS fork and empirically it is working just fine. -> https://github.com/aibasel/downward/pull/286
2026-03-12 21:00:23	masataro	set	summary: > I'm skeptical about investing a lot of work into making a program report on its own out-of-memory conditions in general. > We tried a lot over the years in various parts of the code, and it was never robust. For example, the Python code of Fast Downward uses memory padding and catches MemoryError, and it's never been reliable. Instead Python crashes internally in uncatchable ways. I agree on minimizing the effort on this, hence my proposed changes are simple and minimal. The goal of the change is to increase the robustness of error reporting just a bit, not to make it 100%. It is completely fine if it crashes in uncatchable ways in some corner cases. It is still infinitely better than printing no information. Our only red line should be to prevent it from continuing any processsing after faling to allocate the memory. In my PR, the overall behavior does not change; It just unwinds the stack, prints the stat and quits. The only assumption it makes is that the only code that catches ExitException is main(), which should be the case because ExitException is defined in FD. And maybe that no other code overrides the handler with set_new_handler(), but this is already assumed in the current code. In contrast, `bad_alloc` may be internally handled by any such external LP solver by themselves, as you mentioned. Even if the stat reporting is only partially achieved, it is useful to see the stat and unwinds the stack. This is because it is possible for other classes to print other statistics in the destructors --- preferably in an reentrant way, but thanks to the memory padding more flexibly with ostream and strings. I am already using the proposed code in my MCTS fork and empirically it is working just fine.
2026-03-06 11:02:30	malte	set	messages: + msg12045
2026-03-05 20:42:55	masataro	set	title: Make destructors responsible for statistic reporting -> Ensure statistic reporting on OOM
2026-03-05 20:36:19	masataro	set	priority: wish -> feature
2026-03-05 20:36:07	masataro	set	messages: + msg12044
2025-01-15 09:55:42	clemens	set	nosy: + malte, jendrik, clemens
2025-01-15 07:38:59	masataro	set	status: unread -> chatting messages: + msg11734
2025-01-15 07:35:59	masataro	create