Bug 10366 - CTDB 2.5.1 crashes when adding or removing nodes to the cluster
Summary: CTDB 2.5.1 crashes when adding or removing nodes to the cluster
Alias: None
Product: CTDB 2.5.x or older
Classification: Unclassified
Component: ctdb (show other bugs)
Version: unspecified
Hardware: x86 Linux
: P5 critical
Target Milestone: ---
Assignee: Amitay Isaacs
QA Contact: Samba QA Contact
Depends on:
Reported: 2014-01-09 17:39 UTC by Kevin Osbron
Modified: 2015-02-26 04:14 UTC (History)
1 user (show)

See Also:

Attachment output from ctdb_diagnostics (92.37 KB, application/octet-stream)
2014-01-09 18:23 UTC, Kevin Osbron
no flags Details
Contains the log.ctdb files along with stack dumps from crashes (301.03 KB, application/octet-stream)
2014-01-09 18:26 UTC, Kevin Osbron
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Kevin Osbron 2014-01-09 17:39:39 UTC
CTDB crashes some of the nodes currently in the cluster when adding a new node. We were chasing this problem when we were using CTDB version 2.2 and upgrading to 2.5.1 decreased the frequency of the crash. But we can get CTDB to crash on all nodes currently in the cluster when adding a node if we first run "ctdb setdebug INFO". The crashes happen in a few places but every one that I have looked at point to an invalid tcparray data structure. I have seen crashes at the following locations:
File server/ctdb_takeover.c
ctdb_control_send_arp(): DEBUG(DEBUG_INFO,("sending tcp tickle ack for %u->%s:%u\n", ...
ctdb_control_get_tcp_tickle_list(): memcpy(&list->tickles.connections[0], tcparray->connections, 
  -- sometimes see the CTDB_NO_MEMORY() output
ctdb_announce_vnn_iface(): arp->tcparray = talloc_steal(arp, tcparray);
  -- this results in a failure here: talloc_abort (reason=0x48a218 "Bad talloc magic value - unknown value") at lib/talloc/talloc.c:338

Steps to reproduce:
1. form cluster
2. run "ctdb setdebug INFO" on all nodes. (The crash is very infrequent unless you make this setting.)
3. start simple data load on each of the existing nodes using SMB, something like a large xcopy or some other load generator
4. Add node to cluster
5. notice that some of the clients will timeout during the write due to CTDB crashes
6. do "grep INTERNAL /var/log/log.ctdb" or check the /var/log/dumps directory for cores on each node to see if there are any crashes

Base OS is SLES 11 SP1 
Samba build: 3.6.22
CTDB version: 2.5.1
3 node cluster + 1 node addition then + 1 node addition
16 GB of RAM per node

Attached files:
The attached files show failures from adding a 4th node and then adding a 5th node. I had added some debug to the ctdb_control_get_tcp_tickle_list function in the second set of tests so the crash point is lightly different, but due to the same incorrect data in the tcparray data structure.
I have attached one tar with all the ctdb log files and back traces from all the core files, and the ctdb_diagnostics output. Note that I had to do some modifications to the ctdb_diagnostics script so that it would be compatible with our tool set. 

Let me know if you would like any other information or if you need access to the core files themselves.

NOTE: I left the CTDB version unspecified in bugzilla "version" field because 2.5.1 was not listed as an option (nor was 2.5).
Comment 1 Kevin Osbron 2014-01-09 18:23:14 UTC
Created attachment 9562 [details]
Attachment output from ctdb_diagnostics
Comment 2 Kevin Osbron 2014-01-09 18:26:47 UTC
Created attachment 9563 [details]
Contains the log.ctdb files along with stack dumps from crashes
Comment 3 Kevin Osbron 2014-01-17 00:30:25 UTC
We believe that we have found a fix for this problem. Apply the following patch to the 2.5.1 ctdb source code:

--- ctdb-2.5.1/server/ctdb_takeover.c.orig      2014-01-16 09:24:59.000000000 -0800
+++ ctdb-2.5.1/server/ctdb_takeover.c   2014-01-16 09:26:13.000000000 -0800
@@ -3051,11 +3051,9 @@
        /* If this is the first tickle */
        if (tcparray == NULL) {
-               tcparray = talloc_size(ctdb->nodes, 
-                       offsetof(struct ctdb_tcp_array, connections) +
-                       sizeof(struct ctdb_tcp_connection) * 1);
+               tcparray = talloc(ctdb->nodes, struct ctdb_tcp_array);
                CTDB_NO_MEMORY(ctdb, tcparray);
-               vnn->tcp_array = tcparray;
+               vnn->tcp_array = talloc_steal(vnn, tcparray);
                tcparray->num = 0;
                tcparray->connections = talloc_size(tcparray, sizeof(struct ctdb_tcp_connection));

It seems like the original code is making ctdb->nodes the owner of the memory talloc'd for the tcparray. When reload nodes is called as part of the add nodes process the old ctdb->nodes list is eventually talloc_free()'d, which then frees the tcparray that it "owns". Then later on the tcp_array pointer is referenced but the memory has been recycled and the values inside are no longer valid. This is when the crash occurs.

Please let me know if the patch suggested is good, or if you have any other suggestions for how this problem should be addressed. For instance, should we use
tcparray = talloc(ctdb->vnn, struct ctdb_tcp_array); 
and then skip the talloc_steal() two lines later? We suggested the method above because it seemed to be the process used other places in the code.

I hope this helps!

-Kevin Osborn
Comment 4 Amitay Isaacs 2014-01-22 09:41:23 UTC
Hi Kevin,

Good catch. It does not make sense to talloc memory off ctdb->nodes for tcparray.  It should be off vnn corresponding to the IP address for which the tickles are tracked.
Comment 5 Amitay Isaacs 2014-02-28 01:08:27 UTC
The patch to fix this defect is now upstream and backported to 2.5 branch.  It will be in 2.5.3.