CTDB version: 1.0.112-4 Samba: 3.3.10-40.el5 smb.conf: vfs objects = gpfs, fileid, shadow_copy2, syncops, tsmsm, xattr_tdb map read only = no map system = no map archive = no map hidden = no hide dot files = yes I have upgraded my CTDB/Samba cluster and now I have these errors in times of load: 2010/02/18 13:48:35.111837 [10148]: server/ctdb_persistent.c:183 ERROR: trans2_commit retry: client-db_id[0x00000000] != db_id[0x3dd7daa7](client_id[0x003e4547]) 2010/02/18 13:48:36.119771 [10148]: server/ctdb_persistent.c:183 ERROR: trans2_commit retry: client-db_id[0x00000000] != db_id[0x3dd7daa7] (client_id[0x003e4547]) 2010/02/18 13:48:37.267648 [10148]: server/ctdb_persistent.c:183 ERROR: trans2_commit retry: client-db_id[0x00000000] != db_id[0x3dd7daa7] (client_id[0x003e4547]) 2010/02/18 13:48:38.268972 [10148]: server/ctdb_persistent.c:183 ERROR: trans2_commit retry: client-db_id[0x00000000] != db_id[0x3dd7daa7] (client_id[0x003e4547]) 2010/02/18 13:48:39.269653 [10148]: server/ctdb_persistent.c:697 ctdb_control_trans2_error: Unknown database 0x00000000 2010/02/18 14:12:15.694893 [10192]: ctdb_persistent_callback failed with status 1 ((null)) 2010/02/18 14:12:16.725353 [10192]: ctdb_persistent_callback failed with status 1 ((null)) 2010/02/18 14:12:17.757288 [10192]: ctdb_persistent_callback failed with status 1 ((null)) The corresponding tdb 0x3dd7daa7 is the xattr.tdb: ctdb getdbstatus xattr.tdb dbid: 0x3dd7daa7 name: xattr.tdb path: /var/ctdb/persistent/xattr.tdb.1 PERSISTENT: yes HEALTH: OK
Thanks for your bug report. Hmmm, the older versions of the transaction implementation were buggy. This has been fixed (hopefully) for good recently. The latest changes are in ctdb-1.0.112, but samba 3.3 is too old. Could you try out the code basis of the clustered samba repository: repo: git://git.samba.org/obnox/samba-ctdb.git branch: v3-4-ctdb I don't think any of the released upstream versions has sufficient fixes (since the were added too recently). But does the error occur always or just sproadically? (I'd expect them to occur in a contended situation where several processes are trying to access the tdb simultaneously...) Michael
Hi, any input for my questions / hints?
sorry for slow reply ... I don't know what this occurs. It seems to be always and with much frequently when high load is on the cluster. It seems this persistent db has an high load. I think the automatic vacuum is not able to handle this. Now the cluster crashes and this db is ~200MB and not recoverable. We deleted this db and now the cluster is running again. The underlying filesystem is GPFS 3.2 if this matters.
(In reply to comment #3) > sorry for slow reply ... No problem. :-) You said that these error started occurring after you updated CTDB/Samba. What were the versions before the update? > I don't know what this occurs. It seems to be always and > with much frequently when high load is on the cluster. > It seems this persistent db has an high load. Well, this depends on how samba is used. Actually these messages should only occur when the db is written to, and this is the case when some process wants to store extended attributes. Are you using "store dos attributes = yes"? I did not see it in the config portion you posted. Could you please post your complete Samba configuration? And also your CTDB configuration, please. Anyhow, as already said: the error messages you are seeing are a manifestation of bugs (race conditions) in the implementation of transactions in ctdb and the ctdb client implementation. The transactions have been fixed. The new code is in the samba master branch, and I hope to get it into 3.5.1. Then it is in the v3-4-ctdb branch of the samba-ctdb repository here: git://git.samba.org/obnox/samba-ctdb.git I think you need an updated version of samba. Will you be able to build a new set of samba packages based on these different sources or do you need some assistance? > I think the automatic vacuum is not able to handle this. Hmm, vacuuming has nothing to do with transaction crashes actually. It cleans up records marked for deletion. But for persistent DBs like the xattr tdb, it does not do that usually (in newer versions of ctdb), since these should be in sync across the nodes anyway. Or am I misunderstanding your thought here? > Now the cluster crashes and this db is ~200MB and not recoverable. Uh, to what extend does the cluster crash? That sounds bad. I understand that in an extreme case the buggy transactions could lead to a corrupted tdb but crash the whole cluster?... Cheers - Michael > We deleted this db and now the cluster is running again. > > The underlying filesystem is GPFS 3.2 if this matters. >
> You said that these error started occurring after you updated > CTDB/Samba. What were the versions before the update? Before update: ctdb 1.0.96_6 samba 3.2.7-ctdb.54.2 > Could you please post your complete Samba configuration? > And also your CTDB configuration, please. ------ /etc/samba/smb.conf -------------------- [global] netbios name = DKFZFSG workgroup = AD realm = AD.DKFZ-HEIDELBERG.DE server string = "dkfzfsg SAMBA Cluster" security = ads password server = ad1, ad2, ad3, * use kerberos keytab=true clustering = yes preferred master = no local master = no wins server = 193.174.xx.xxx 193.174.yy.yyy cluster addresses = 172.21.zz.1 172.21.zz.2 interfaces = 172.21.zz.1 172.21.zz.2 name resolve order = wins bcast load printers = no auth methods = guest, sam, winbind obey pam restrictions = yes encrypt passwords = yes map to guest = Bad User log file = /var/log/samba/log.%m log level = 1 passdb:1 auth:1 winbind:3 acls:3 idmap:3 vfs:3 smb:1 quota:1 private dir = /var/lib/samba/private pid directory = /var/lib/samba/var/locks socket options = TCP_NODELAY SO_RCVBUF=16384 SO_SNDBUF=16384 use sendfile = no kernel oplocks = yes notify:inotify = no shadow:snapdir = .snapshots shadow:fixinodes = yes smbd:backgroundqueue = false strict locking = yes posix locking = yes host msdfs = no vfs objects = gpfs, fileid, shadow_copy2, syncops, tsmsm, xattr_tdb gpfs:sharemodes = no fileid:mapping = fsname force unknown acl user = yes nfs4: mode = special nfs4: chown = yes nfs4: acedup = merge store dos attributes = yes inherit permissions = Yes inherit acls = Yes map read only = no map system = no map archive = no map hidden = no hide dot files = yes username map = /etc/samba/smbusers winbind separator = + winbind use default domain = no winbind nested groups = yes winbind cache time = 3600 template shell = /bin/false passdb backend = tdbsam groupdb:backend = tdb idmap backend = tdb2 idmap uid = 2000000-3000000 idmap gid = 2000000-3000000 idmap config ad:default = yes idmap config ad:range = 1000000-2000000 idmap config ad:backend = rid winbind enum users = yes winbind enum groups = yes ea support = yes unix extensions = no client ntlmv2 auth = yes [AAA] comment = Bla Bla path = /gpfs/AAA read only = No ... ------------------------------------- ------ /etc/sysconfig/ctdb ------------------------- CTDB_RECOVERY_LOCK="/gpfs/xxxx/ctdb/.RECOVERY-LOCK" CTDB_PUBLIC_INTERFACE=eth0 CTDB_PUBLIC_ADDRESSES=/etc/ctdb/public_addresses CTDB_MANAGES_SAMBA=yes CTDB_NFS_SKIP_SHARE_CHECK=yes CTDB_MANAGES_WINBIND=yes CTDB_MANAGES_NFS=yes ulimit -n 10000 CTDB_NOTIFY_SCRIPT=/etc/ctdb/notify.sh CTDB_DEBUGLEVEL=WARNING CTDB_SET_MAXQUEUEDROPMSG=100000 ------------------------------------------------------ > The transactions have been fixed. The new code is in > the samba master branch, and I hope to get it into > 3.5.1. Then it is in the v3-4-ctdb branch of the > samba-ctdb repository here: > git://git.samba.org/obnox/samba-ctdb.git > I think you need an updated version of samba. > Will you be able to build a new set of samba > packages based on these different sources or > do you need some assistance? ok, but I can test this at the earliest in 3 weeks. > Hmm, vacuuming has nothing to do with transaction crashes > actually. It cleans up records marked for deletion. > But for persistent DBs like the xattr tdb, it does not > do that usually (in newer versions of ctdb), since these > should be in sync across the nodes anyway. ok, but the manually vacuum (about 10-20times) will output many records to delete. It's probably depending problem... >> Now the cluster crashes and this db is ~200MB and not recoverable. > Uh, to what extend does the cluster crash? > That sounds bad. I understand that in an extreme > case the buggy transactions could lead to a corrupted > tdb but crash the whole cluster?... Why the CTDB crashed I don't know, but the restart don't work because of the unrecoverable tdb (on both nodes).
no feedback ... I assume the problem is solved. If the problem persists with the latests versions, please reopen the bug. Thanks!