Bug 7153 - Error in Logs with xattr.tdb (Unknown database 0x00000000)
Summary: Error in Logs with xattr.tdb (Unknown database 0x00000000)
Status: RESOLVED FIXED
Alias: None
Product: CTDB 2.5.x or older
Classification: Unclassified
Component: ctdb (show other bugs)
Version: unspecified
Hardware: x64 Linux
: P3 normal
Target Milestone: ---
Assignee: Michael Adam
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-02-18 07:14 UTC by Thomas Sesselmann
Modified: 2012-09-06 12:42 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Thomas Sesselmann 2010-02-18 07:14:37 UTC
CTDB version: 1.0.112-4
Samba: 3.3.10-40.el5

smb.conf:
 vfs objects = gpfs, fileid, shadow_copy2, syncops, tsmsm, xattr_tdb
 map read only = no
 map system    = no
 map archive   = no
 map hidden    = no
 hide dot files = yes

I have upgraded my CTDB/Samba cluster and now I have these errors in times of load:
 2010/02/18 13:48:35.111837 [10148]: server/ctdb_persistent.c:183 ERROR:  trans2_commit retry: client-db_id[0x00000000] != db_id[0x3dd7daa7](client_id[0x003e4547])
 2010/02/18 13:48:36.119771 [10148]: server/ctdb_persistent.c:183 ERROR: trans2_commit retry: client-db_id[0x00000000] != db_id[0x3dd7daa7] (client_id[0x003e4547])
 2010/02/18 13:48:37.267648 [10148]: server/ctdb_persistent.c:183 ERROR: trans2_commit retry: client-db_id[0x00000000] != db_id[0x3dd7daa7] (client_id[0x003e4547])
 2010/02/18 13:48:38.268972 [10148]: server/ctdb_persistent.c:183 ERROR: trans2_commit retry: client-db_id[0x00000000] != db_id[0x3dd7daa7] (client_id[0x003e4547])
 2010/02/18 13:48:39.269653 [10148]: server/ctdb_persistent.c:697 ctdb_control_trans2_error: Unknown database 0x00000000
 2010/02/18 14:12:15.694893 [10192]: ctdb_persistent_callback failed with status 1 ((null))
 2010/02/18 14:12:16.725353 [10192]: ctdb_persistent_callback failed with status 1 ((null))
 2010/02/18 14:12:17.757288 [10192]: ctdb_persistent_callback failed with status 1 ((null))

The corresponding tdb 0x3dd7daa7 is the xattr.tdb:
ctdb getdbstatus xattr.tdb
 dbid: 0x3dd7daa7
 name: xattr.tdb
 path: /var/ctdb/persistent/xattr.tdb.1
 PERSISTENT: yes
 HEALTH: OK
Comment 1 Michael Adam 2010-02-18 09:52:08 UTC
Thanks for your bug report.

Hmmm, the older versions of the transaction implementation 
were buggy. This has been fixed (hopefully) for good recently.
The latest changes are in ctdb-1.0.112, but samba 3.3 is too
old. Could you try out the code basis of the clustered samba
repository:

repo: git://git.samba.org/obnox/samba-ctdb.git
branch: v3-4-ctdb

I don't think any of the released upstream versions has sufficient
fixes (since the were added too recently).

But does the error occur always or just sproadically?
(I'd expect them to occur in a contended situation where
several processes are trying to access the tdb simultaneously...)

Michael
Comment 2 Michael Adam 2010-02-23 04:18:50 UTC
Hi, any input for my questions / hints?
Comment 3 Thomas Sesselmann 2010-02-23 16:56:09 UTC
sorry for slow reply ...

I don't know what this occurs. It seems to be always and with much frequently when high load is on the cluster. It seems this persistent db has an high load.
I think the automatic vacuum is not able to handle this.
Now the cluster crashes and this db is ~200MB and not recoverable. We deleted this db and now the cluster is running again.

The underlying filesystem is GPFS 3.2 if this matters.
Comment 4 Michael Adam 2010-02-24 02:12:38 UTC
(In reply to comment #3)
> sorry for slow reply ...

No problem. :-)

You said that these error started occurring after you updated
CTDB/Samba. What were the versions before the update?

> I don't know what this occurs. It seems to be always and
> with much frequently when high load is on the cluster.
> It seems this persistent db has an high load.

Well, this depends on how samba is used. Actually these
messages should only occur when the db is written to,
and this is the case when some process wants to store
extended attributes.

Are you using "store dos attributes = yes"? I did not see
it in the config portion you posted.
Could you please post your complete Samba configuration?
And also your CTDB configuration, please.

Anyhow, as already said: the error messages you are seeing
are a manifestation of bugs (race conditions) in the
implementation of transactions in ctdb and the ctdb client
implementation.

The transactions have been fixed. The new code is in 
the samba master branch, and I hope to get it into
3.5.1. Then it is in the v3-4-ctdb branch of the 
samba-ctdb repository here:
git://git.samba.org/obnox/samba-ctdb.git

I think you need an updated version of samba.
Will you be able to build a new set of samba
packages based on these different sources or
do you need some assistance?

> I think the automatic vacuum is not able to handle this.

Hmm, vacuuming has nothing to do with transaction crashes
actually. It cleans up records marked for deletion.
But for persistent DBs like the xattr tdb, it does not
do that usually (in newer versions of ctdb), since these
should be in sync across the nodes anyway.

Or am I misunderstanding your thought here?

> Now the cluster crashes and this db is ~200MB and not recoverable.

Uh, to what extend does the cluster crash?
That sounds bad. I understand that in an extreme
case the buggy transactions could lead to a corrupted
tdb but crash the whole cluster?...

Cheers - Michael

> We deleted this db and now the cluster is running again.
> 
> The underlying filesystem is GPFS 3.2 if this matters.
> 

Comment 5 Thomas Sesselmann 2010-02-25 16:19:30 UTC
> You said that these error started occurring after you updated
> CTDB/Samba. What were the versions before the update?

Before update:
 ctdb 1.0.96_6
 samba 3.2.7-ctdb.54.2


> Could you please post your complete Samba configuration?
> And also your CTDB configuration, please.

------ /etc/samba/smb.conf --------------------
[global]
  netbios name = DKFZFSG
  workgroup = AD
  realm = AD.DKFZ-HEIDELBERG.DE
  server string = "dkfzfsg SAMBA Cluster"
  security = ads
  password server = ad1, ad2, ad3, *
  use kerberos keytab=true
  clustering = yes
  preferred master = no
  local master = no
  wins server = 193.174.xx.xxx 193.174.yy.yyy
  cluster addresses = 172.21.zz.1 172.21.zz.2
  interfaces = 172.21.zz.1 172.21.zz.2
  name resolve order = wins bcast
  load printers = no
  auth methods = guest, sam, winbind
  obey pam restrictions = yes
  encrypt passwords = yes
  map to guest = Bad User
  log file = /var/log/samba/log.%m
  log level = 1 passdb:1 auth:1 winbind:3 acls:3 idmap:3 vfs:3 smb:1 quota:1
  private dir = /var/lib/samba/private
  pid directory = /var/lib/samba/var/locks
  socket options = TCP_NODELAY SO_RCVBUF=16384 SO_SNDBUF=16384
  use sendfile = no
  kernel oplocks = yes
  notify:inotify = no
  shadow:snapdir = .snapshots
  shadow:fixinodes = yes
  smbd:backgroundqueue = false
  strict locking = yes
  posix locking = yes
  host msdfs = no
  vfs objects = gpfs, fileid, shadow_copy2, syncops, tsmsm, xattr_tdb
  gpfs:sharemodes = no
  fileid:mapping = fsname
  force unknown acl user = yes
  nfs4: mode = special
  nfs4: chown = yes
  nfs4: acedup = merge
  store dos attributes = yes
  inherit permissions = Yes
  inherit acls = Yes
  map read only = no
  map system    = no
  map archive   = no
  map hidden    = no
  hide dot files = yes
  username map = /etc/samba/smbusers
  winbind separator = +
  winbind use default domain = no
  winbind nested groups = yes
  winbind cache time = 3600
  template shell = /bin/false
  passdb backend = tdbsam
  groupdb:backend = tdb
  idmap backend = tdb2
  idmap uid = 2000000-3000000
  idmap gid = 2000000-3000000
  idmap config ad:default = yes
  idmap config ad:range = 1000000-2000000
  idmap config ad:backend = rid
  winbind enum users = yes
  winbind enum groups = yes
  ea support = yes
  unix extensions = no
  client ntlmv2 auth = yes

[AAA]
        comment = Bla Bla
        path = /gpfs/AAA
        read only = No
...

-------------------------------------

------ /etc/sysconfig/ctdb -------------------------
CTDB_RECOVERY_LOCK="/gpfs/xxxx/ctdb/.RECOVERY-LOCK"
CTDB_PUBLIC_INTERFACE=eth0
CTDB_PUBLIC_ADDRESSES=/etc/ctdb/public_addresses
CTDB_MANAGES_SAMBA=yes
CTDB_NFS_SKIP_SHARE_CHECK=yes
CTDB_MANAGES_WINBIND=yes
CTDB_MANAGES_NFS=yes
ulimit -n 10000
CTDB_NOTIFY_SCRIPT=/etc/ctdb/notify.sh
CTDB_DEBUGLEVEL=WARNING
CTDB_SET_MAXQUEUEDROPMSG=100000
------------------------------------------------------


> The transactions have been fixed. The new code is in 
> the samba master branch, and I hope to get it into
> 3.5.1. Then it is in the v3-4-ctdb branch of the 
> samba-ctdb repository here:
> git://git.samba.org/obnox/samba-ctdb.git

> I think you need an updated version of samba.
> Will you be able to build a new set of samba
> packages based on these different sources or
> do you need some assistance?

ok, but I can test this at the earliest in 3 weeks.

> Hmm, vacuuming has nothing to do with transaction crashes
> actually. It cleans up records marked for deletion.
> But for persistent DBs like the xattr tdb, it does not
> do that usually (in newer versions of ctdb), since these
> should be in sync across the nodes anyway.

ok, but the manually vacuum (about 10-20times) will output
many records to delete. It's probably depending problem...

>> Now the cluster crashes and this db is ~200MB and not recoverable.
> Uh, to what extend does the cluster crash?
> That sounds bad. I understand that in an extreme
> case the buggy transactions could lead to a corrupted
> tdb but crash the whole cluster?...

Why the CTDB crashed I don't know, but the restart don't work because of the unrecoverable tdb (on both nodes).

Comment 6 Björn Jacke 2012-09-06 12:42:31 UTC
no feedback ... I assume the problem is solved. If the problem persists with the latests versions, please reopen the bug. Thanks!