main_loop() contains this code: TALLOC_FREE(rec->nodemap); ret = ctdb_ctrl_getnodemap(ctdb, CONTROL_TIMEOUT(), pnn, rec, &rec->nodemap); The 2nd line contains a nested event loop that waits for the reply to the control. This event loop can invoke message handlers that do not expect rec->nodemap to be NULL. One example is lost_reclock_handler(), which causes rec->nodemap to be unconditionally dereferenced in list_of_nodes() via this call chain: list_of_nodes() list_of_active_nodes() set_recovery_mode() force_election() lost_reclock_handler() This causes the CTDB recovery daemon to crash sometimes when the recovery lock is lost. There are also other handlers that unconditionally reference rec->nodemap.
Created attachment 15872 [details] Patch for 4.12, 4.11 Cherry picks cleanly from master into both branches.
Hi Karolin, This is ready for v4-11 and v4-12. Thanks.
(In reply to Amitay Isaacs from comment #2) Pushed to autobuild-v4-{12,11}-test.
Pushed to both branches. Closing out bug report. Thanks!