Executed an id -a <user>, where the user is a member of a large number of groups, and there are 350+ groups being pulled down from AD. Winbindd core dumps. Unfortunately, I don't have much information at the moment, other than a log level = 100 log. I'll compile with debug to get a better idea and report later, if it is still needed. Here's the complete tail end of the log.winbindd after the crash. ldap_get_id_from_sid: Searching for "(&(objectClass=sambaIdmapEntry) (sambaSID=S-1-5-21-462329909-250626402 7-1013402721-1102))" [2004/01/22 21:52:01, 11] lib/smbldap.c:smbldap_open(822) smbldap_open: already connected to the LDAP server [2004/01/22 21:52:01, 3] lib/smbldap.c:smbldap_retry_open(884) Sleeping for 0 milliseconds before reconnecting [2004/01/22 21:52:01, 11] lib/smbldap.c:smbldap_open(822) smbldap_open: already connected to the LDAP server [2004/01/22 21:52:01, 3] lib/smbldap.c:smbldap_retry_open(884) Sleeping for 0 milliseconds before reconnecting [2004/01/22 21:52:01, 11] lib/smbldap.c:smbldap_open(822) smbldap_open: already connected to the LDAP server [2004/01/22 21:52:01, 3] sam/idmap_ldap.c:ldap_get_id_from_sid(593) [2004/01/22 21:52:01, 3] lib/smbldap.c:smbldap_retry_open(884) ldap_get_id_from_sid: Failure looking up idmap entry (Local error) Sleeping for 8 milliseconds before reconnecting [2004/01/22 21:52:01, 1] nsswitch/winbindd_group.c:winbindd_getgrent(689) could not look up gid for group FSTAT_FIRE_UNIT_STATUS_UPDATE_GROUP [2004/01/22 21:52:01, 10] nsswitch/winbindd_group.c:winbindd_getgrent(646) entry_index = 261, num_entries = 353 [2004/01/22 21:52:01, 10] sam/idmap_util.c:idmap_sid_to_gid(179) sid_to_gid: sid = [S-1-5-21-462329909-2506264027-1013402721-8672] [2004/01/22 21:52:01, 10] sam/idmap_tdb.c:db_get_id_from_sid(315) db_get_id_from_sid [2004/01/22 21:52:01, 10] sam/idmap_tdb.c:internal_get_id_from_sid(221) internal_get_id_from_sid: fetching record S-1-5-21-462329909-2506264027- 1013402721-8672 of type 0x2 [2004/01/22 21:52:01, 10] sam/idmap_tdb.c:internal_get_id_from_sid(225) internal_get_id_from_sid: record S-1-5-21-462329909-2506264027-1013402721- 8672 not found [2004/01/22 21:52:01, 8] sam/idmap_ldap.c:ldap_get_id_from_sid(572) ldap_get_id_from_sid: S-1-5-21-462329909-2506264027-1013402721-8672 (group) [2004/01/22 21:52:01, 10] sam/idmap_ldap.c:ldap_get_id_from_sid(583) ldap_get_id_from_sid: Searching for "(&(objectClass=sambaIdmapEntry) (sambaSID=S-1-5-21-462329909-250626402 7-1013402721-8672))" [2004/01/22 21:52:01, 11] lib/smbldap.c:smbldap_open(822) smbldap_open: already connected to the LDAP server Assertion failed: ber->ber_buf == NULL, file io.c, line 508 [2004/01/22 21:52:01, 11] lib/smbldap.c:smbldap_open(822) smbldap_open: already connected to the LDAP server Assertion failed: 0, file io.c, line 701
I have a bit more info. There seems to be two parts needed to recreate this. The first is the idmap mappings in our OpenLDAP were modified so that a large majority of the SIDs map to a single unix gid. The second is that after this is done, the tdb caches are restarted, and samba is restarted, you have to let it sit idle from the last command you issued for about 10 minutes. Then, when you issue a command like "id -a <globaluser>", the winbind log shows the mapping starts, but it then fails trying to reconnect to the ldap. 1) run "getent group" and "getent passwd" to populate the ldap 2) run a script to change about 200 of the SID=>gid mappings to a SID=>19999 mapping (or some such fixed gid). 3) stop the daemons 4) rm /samba/locks/winbind*.tdb files 5) restart the daemons 6) Do "su" commands, "id", "wbinfo", whatever you want, and it maps appropriatly 7) Wait over 10 minutes without any winbind access. 8) "id -a <globaluser>", winbind core dumps. The core is within the OpenLDAP libraries, where it is trying to access an LDAP* internal buffer (ber_buf) that is not initialized. The stack trace is as follows: #0 0xff01f930 in _libc_kill () from /lib/libc.so.1 #1 0xfefb5964 in abort () from /lib/libc.so.1 #2 0xfefb5c08 in _assert () from /lib/libc.so.1 #3 0xff096724 in ber_get_next (sb=0x328588, len=0xffbed814, ber=0x3d03c0) at io.c:701 #4 0xff0c8afc in try_read1msg (ld=0x33e440, msgid=10, all=1, sb=0x328588, lc=0x33e3d8, result=0xffbee6a4) at result.c:445 #5 0xff0c8874 in wait4msg (ld=0x33e440, msgid=10, all=1, timeout=0x33e3d8, result=0xffbee6a4) at result.c:354 #6 0xff0cb030 in ldap_search_s (ld=0x33e440, base=0xa <Address 0xa out of bounds>, scope=1, filter=0x33e3d8 "", attrs=0xffbee6a4, attrsonly=144, res=0xffbee6a4) at search.c:354 #7 0x1e6c68 in smbldap_search (ldap_state=0x328498, base=0x3d03f8 "ou=idmap,dc=aoc,dc=afbase,dc=af,dc=net", scope=2, filter=0xffbedea0 "(&(objectClass=sambaIdmapEntry)(sambaSID=S-1-5-21- 462329909-2506264027-1013402721-8435))", attrs=0x3dc0c8, attrsonly=0, res=0xffbee6a4) at lib/smbldap.c:921 #8 0x206298 in ldap_get_id_from_sid (id=0xffbee7a0, id_type=0xffbee804, sid=0xffbee878) at sam/idmap_ldap.c:588 #9 0x202678 in idmap_get_id_from_sid (id=0xffbee7a0, id_type=0xffbee804, sid=0xffbee878) at sam/idmap.c:211 #10 0x203158 in idmap_sid_to_gid (sid=0xffbee878, gid=0xffbee8d0, flags=2) at sam/idmap_util.c:183 #11 0x490b0 in winbindd_getgrent (state=0xffbeeb60) at nsswitch/winbindd_group.c:686 #12 0x413b4 in process_request (state=0xffbeeb60) at nsswitch/winbindd.c:308 #13 0x41948 in winbind_process_packet (state=0xffbeeb60) at nsswitch/winbindd.c:432 #14 0x65ecc in do_dual_daemon () at nsswitch/winbindd_dual.c:204 #15 0x43320 in main (argc=2, argv=0xffbefca4) at nsswitch/winbindd.c:918 (gdb) up 7 #7 0x1e6c68 in smbldap_search (ldap_state=0x328498, base=0x3d03f8 "ou=idmap,dc=aoc,dc=afbase,dc=af,dc=net", scope=2, filter=0xffbedea0 "(&(objectClass=sambaIdmapEntry)(sambaSID=S-1-5-21- 462329909-2506264027-1013402721-8435))", attrs=0x3dc0c8, attrsonly=0, res=0xffbee6a4) at lib/smbldap.c:921 ------------- And the value of relavent vars is: ------------- (gdb) p *ldap_state->ldap_struct $1 = {ld_sb = 0x328588, ld_options = {ldo_valid = 2, ldo_debug = 0, ldo_tm_api = 0x0, ldo_tm_net = 0x0, ldo_version = 3, ldo_deref = 0, ldo_timelimit = 0, ldo_sizelimit = 0, ldo_defludp = 0x33e730, ldo_defport = 389, ldo_defbase = 0x328538 "dc=aoc,dc=afbase,dc=af,dc=net", ldo_defbinddn = 0x0, ldo_refhoplimit = 5, ldo_sctrls = 0x0, ldo_cctrls = 0x0, ldo_rebind_proc = 0x1e5c64 <rebindproc_connect_with_state>, ldo_rebind_params = 0x328498, ldo_booleans = 1}, ld_lberoptions = 1, ld_errno = 0, ld_error = 0x0, ld_matched = 0x0, ld_referrals = 0x0, ld_msgid = 10, ld_requests = 0x40bf40, ld_responses = 0x0, ld_abandoned = 0x0, ld_cache = 0x0, ld_defconn = 0x33e3d8, ld_conns = 0x33e3d8, ld_selectinfo = 0x33e4c8} (gdb) p *ldap_state->ldap_struct->ld_defconn $2 = {lconn_sb = 0x328588, lconn_refcnt = 3, lconn_lastused = 1074885177, lconn_rebind_inprogress = 0, lconn_rebind_queue = 0x0, lconn_status = 3, lconn_server = 0x3400e0, lconn_ber = 0x3d03c0, lconn_next = 0x0} (gdb) p *ldap_state->ldap_struct->ld_defconn->lconn_ber $3 = {ber_opts = {lbo_valid = 2, lbo_options = 1, lbo_debug = 0, lbo_meminuse = 0}, ber_tag = 129, ber_len = 3859045747, ber_usertag = 1634558464, ber_buf = 0x0, ber_ptr = 0x3d03d0 "æ\004]samb", ber_end = 0x0, ber_sos = 0x0, ber_rwptr = 0x3d03d7 ""} ---------------- OpenLDAP checks ber_len and ber_ptr to ensure they are non-zero. If so, it believes there is memory available and, in this case, continues into a core dump. So, this appears to be an OpenLDAP problem. However, I can't see why changing the data values of an OpenLDAP database would cause OpenLDAP to core. The database should care less what the data is. So, I'm still delving into samba to see what could be causing the problem on that side. ---------------- Following is the end log tail from another failed winbind. I find it interesting that samba is trying to reopen the ldap connection only to say "already connected to the LDAP server". I feel this may be where the problem manifests. ---------------- [2004/01/23 19:12:57, 10] sam/idmap_ldap.c:ldap_get_id_from_sid(583) ldap_get_id_from_sid: Searching for "(&(objectClass=sambaIdmapEntry) (sambaSID=S-1-5-21-462329909-2506264027-1013402721-8434))" [2004/01/23 19:12:57, 11] lib/smbldap.c:smbldap_open(822) smbldap_open: already connected to the LDAP server [2004/01/23 19:12:57, 3] lib/smbldap.c:smbldap_retry_open(884) Sleeping for 0 milliseconds before reconnecting [2004/01/23 19:12:57, 11] lib/smbldap.c:smbldap_open(822) smbldap_open: already connected to the LDAP server [2004/01/23 19:12:57, 3] sam/idmap_ldap.c:ldap_get_id_from_sid(593) [2004/01/23 19:12:57, 3] lib/smbldap.c:smbldap_retry_open(884) ldap_get_id_from_sid: Failure looking up idmap entry (Decoding error) Sleeping for 8 milliseconds before reconnecting [2004/01/23 19:12:57, 1] nsswitch/winbindd_group.c:winbindd_getgrent(689) could not look up gid for group ABMSetDataRole [2004/01/23 19:12:57, 10] nsswitch/winbindd_group.c:winbindd_getgrent(646) entry_index = 24, num_entries = 354 [2004/01/23 19:12:57, 10] sam/idmap_util.c:idmap_sid_to_gid(179) sid_to_gid: sid = [S-1-5-21-462329909-2506264027-1013402721-8435] [2004/01/23 19:12:57, 10] sam/idmap_tdb.c:db_get_id_from_sid(315) db_get_id_from_sid [2004/01/23 19:12:57, 10] sam/idmap_tdb.c:internal_get_id_from_sid(221) internal_get_id_from_sid: fetching record S-1-5-21-462329909-2506264027- 1013402721-8435 of type 0x2 [2004/01/23 19:12:57, 10] sam/idmap_tdb.c:internal_get_id_from_sid(225) internal_get_id_from_sid: record S-1-5-21-462329909-2506264027-1013402721- 8435 not found [2004/01/23 19:12:57, 8] sam/idmap_ldap.c:ldap_get_id_from_sid(572) ldap_get_id_from_sid: S-1-5-21-462329909-2506264027-1013402721-8435 (group) [2004/01/23 19:12:57, 10] sam/idmap_ldap.c:ldap_get_id_from_sid(583) ldap_get_id_from_sid: Searching for "(&(objectClass=sambaIdmapEntry) (sambaSID=S-1-5-21-462329909-2506264027-1013402721-8435))" [2004/01/23 19:12:57, 11] lib/smbldap.c:smbldap_open(822) smbldap_open: already connected to the LDAP server [2004/01/23 19:12:57, 3] lib/smbldap.c:smbldap_retry_open(884) Sleeping for 0 milliseconds before reconnecting [2004/01/23 19:12:57, 11] lib/smbldap.c:smbldap_open(822) smbldap_open: already connected to the LDAP server Assertion failed: 0, file io.c, line 701 [2004/01/23 19:12:57, 11] lib/smbldap.c:smbldap_open(822) smbldap_open: already connected to the LDAP server Assertion failed: 0, file io.c, line 701
Although we are not pursuing this method any longer, I thought I'd do a little more testing using the newer ldap version 2.1.25. Following the same process as before, I setup the environment with multiple SIDs mapping to the same GID, truss'd winbindd, performed an "id -a <globaluser>", waited over 10 minutes, and did another "id -a <globaluser>". Running a few times came up with two different behaviours. The first result was identical to the previous attempts, where, on the second id attempt, winbindd core dumped on an Assert in openLDAP. The logs matched what is shown in the previous comments. The second result was a hang in winbind. The following is the tail of a level 100 debug up to when the stall occurred. The last "already connected" line shown is the last output to the log when the process stalled. [2004/03/12 16:00:23, 8] sam/idmap_ldap.c:ldap_get_id_from_sid(572) ldap_get_id_from_sid: S-1-5-21-1482476501-746137067-682003330-512 (group) [2004/03/12 16:00:23, 10] sam/idmap_tdb.c:internal_get_id_from_sid(225) internal_get_id_from_sid: record S-1-5-21-1482476501-746137067-682003330-1365 not found [2004/03/12 16:00:23, 10] sam/idmap_ldap.c:ldap_get_id_from_sid(583) ldap_get_id_from_sid: Searching for "(&(objectClass=sambaIdmapEntry)(sambaSID=S-1-5-21-1482476501-746137067-682003330-512))" [2004/03/12 16:00:23, 8] sam/idmap_ldap.c:ldap_get_id_from_sid(572) ldap_get_id_from_sid: S-1-5-21-1482476501-746137067-682003330-1365 (group) [2004/03/12 16:00:23, 11] lib/smbldap.c:smbldap_open(822) [2004/03/12 16:00:23, 10] sam/idmap_ldap.c:ldap_get_id_from_sid(583) smbldap_open: already connected to the LDAP server ldap_get_id_from_sid: Searching for "(&(objectClass=sambaIdmapEntry)(sambaSID=S-1-5-21-1482476501-746137067-682003330-1365))" [2004/03/12 16:00:23, 11] lib/smbldap.c:smbldap_open(822) smbldap_open: already connected to the LDAP server Here is the tail of the truss when the stall occurred: 3876: write(3, " [ 2 0 0 4 / 0 3 / 1 2 ".., 58) = 58 3876: getuid() = 0 [0] 3876: write(3, " s m b l d a p _ o p".., 53) = 53 3876: time() = 1079108128 3876: write(7, " 081C9020202B9 c81C204 &".., 204) = 204 3878: poll(0xFFBECE60, 1, -1) = 1 3878: read(7, " 081EB020202B9 d", 8) = 8 3878: read(7, "81E404 \ s a m b a S I D".., 230) = 230 3878: poll(0xFFBECE60, 1, -1) = 1 3878: read(7, " 0\r020202B9 e07", 8) = 8 3878: read(7, "\n01\004\004\0", 7) = 7 3876: poll(0xFFBED8C8, 1, -1) = 1 3876: read(7, 0x0029053F, 8) (sleeping...) 3878: poll(0xFFBECE60, 1, -1) (sleeping...) Doing a "pkill winbindd" using the default SIGTERM awakened winbindd, and it continued processing the id request. Unfortunately, I formally killed it with a SIGKILL before I noticed the SIGTERM had revived it, so I don't know if it truly recovered. I performed another run that stalled at the same place. This time I did a pkill -HUP winbindd, and winbindd began running again, only to die at another assert. The following are the log tails for this final run. [2004/03/12 17:32:26, 10] nsswitch/winbindd_group.c:winbindd_getgrent(648) entry_index = 30, num_entries = 382 [2004/03/12 17:32:26, 10] sam/idmap_util.c:idmap_sid_to_gid(179) sid_to_gid: sid = [S-1-5-21-1482476501-746137067-682003330-1136] [2004/03/12 17:32:26, 10] sam/idmap_tdb.c:db_get_id_from_sid(315) db_get_id_from_sid [2004/03/12 17:32:26, 10] sam/idmap_tdb.c:internal_get_id_from_sid(221) internal_get_id_from_sid: fetching record S-1-5-21-1482476501-746137067-682003330-1136 of type 0x2 [2004/03/12 17:32:26, 10] sam/idmap_tdb.c:internal_get_id_from_sid(225) internal_get_id_from_sid: record S-1-5-21-1482476501-746137067-682003330-1136 not found [2004/03/12 17:32:26, 8] sam/idmap_ldap.c:ldap_get_id_from_sid(572) ldap_get_id_from_sid: S-1-5-21-1482476501-746137067-682003330-1136 (group) [2004/03/12 17:32:26, 10] sam/idmap_ldap.c:ldap_get_id_from_sid(583) ldap_get_id_from_sid: Searching for "(&(objectClass=sambaIdmapEntry)(sambaSID=S-1-5-21-1482476501-746137067-682003330-1136))" [2004/03/12 17:32:26, 11] lib/smbldap.c:smbldap_open(822) smbldap_open: already connected to the LDAP server Assertion failed: ber->ber_buf == NULL, file io.c, line 508 4033: write(3, " [ 2 0 0 4 / 0 3 / 1 2 ".., 58) = 58 4033: getuid() = 0 [0] 4033: write(3, " s m b l d a p _ o p".., 53) = 53 4033: time() = 1079112746 4033: write(7, " 081C80201 # c81C204 & o".., 203) = 203 4033: poll(0xFFBECE60, 1, -1) = 1 4033: write(2, " A s s e r t i o n f a".., 60) = 60 4033: sigaction(SIGABRT, 0x00000000, 0xFFBED288) = 0 4033: llseek(0, 0, SEEK_CUR) = 0 4033: sigaction(SIGABRT, 0xFFBED158, 0xFFBED1D8) = 0 4033: sigprocmask(SIG_UNBLOCK, 0xFFBED208, 0x00000000) = 0 4033: getpid() = 4033 [4031] 4033: kill(4033, SIGABRT) = 0 4033: Received signal #6, SIGABRT [default] 4033: siginfo: SIGABRT pid=4033 uid=0 4033: *** process killed *** 4031: *** process killed ***
should be fixed in 3.0.8
sorry for the same, cleaning up the database to prevent unecessary reopens of bugs.