Bug 997 - Winbind core dump enumerating large group list
Winbind core dump enumerating large group list
Status: CLOSED FIXED
Product: Samba 3.0
Classification: Unclassified
Component: winbind
3.0.1
All Solaris
: P1 critical
: none
Assigned To: Samba Bugzilla Account
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2004-01-22 14:04 UTC by John Klinger
Modified: 2005-08-24 10:16 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description John Klinger 2004-01-22 14:04:10 UTC
Executed an id -a <user>, where the user is a member of a large number of 
groups, and there are 350+ groups being pulled down from AD. Winbindd core 
dumps. Unfortunately, I don't have much information at the moment, other than
a log level = 100 log. I'll compile with debug to get a better idea and report 
later, if it is still needed.

Here's the complete tail end of the log.winbindd after the crash.

  ldap_get_id_from_sid: Searching for "(&(objectClass=sambaIdmapEntry)
(sambaSID=S-1-5-21-462329909-250626402
7-1013402721-1102))"
[2004/01/22 21:52:01, 11] lib/smbldap.c:smbldap_open(822)
  smbldap_open: already connected to the LDAP server
[2004/01/22 21:52:01, 3] lib/smbldap.c:smbldap_retry_open(884)
  Sleeping for 0 milliseconds before reconnecting
[2004/01/22 21:52:01, 11] lib/smbldap.c:smbldap_open(822)
  smbldap_open: already connected to the LDAP server
[2004/01/22 21:52:01, 3] lib/smbldap.c:smbldap_retry_open(884)
  Sleeping for 0 milliseconds before reconnecting
[2004/01/22 21:52:01, 11] lib/smbldap.c:smbldap_open(822)
  smbldap_open: already connected to the LDAP server
[2004/01/22 21:52:01, 3] sam/idmap_ldap.c:ldap_get_id_from_sid(593)
[2004/01/22 21:52:01, 3] lib/smbldap.c:smbldap_retry_open(884)
  ldap_get_id_from_sid: Failure looking up idmap entry (Local error)
  Sleeping for 8 milliseconds before reconnecting
[2004/01/22 21:52:01, 1] nsswitch/winbindd_group.c:winbindd_getgrent(689)
  could not look up gid for group FSTAT_FIRE_UNIT_STATUS_UPDATE_GROUP
[2004/01/22 21:52:01, 10] nsswitch/winbindd_group.c:winbindd_getgrent(646)
  entry_index = 261, num_entries = 353
[2004/01/22 21:52:01, 10] sam/idmap_util.c:idmap_sid_to_gid(179)
  sid_to_gid: sid = [S-1-5-21-462329909-2506264027-1013402721-8672]
[2004/01/22 21:52:01, 10] sam/idmap_tdb.c:db_get_id_from_sid(315)
  db_get_id_from_sid
[2004/01/22 21:52:01, 10] sam/idmap_tdb.c:internal_get_id_from_sid(221)
  internal_get_id_from_sid: fetching record S-1-5-21-462329909-2506264027-
1013402721-8672 of type 0x2
[2004/01/22 21:52:01, 10] sam/idmap_tdb.c:internal_get_id_from_sid(225)
  internal_get_id_from_sid: record S-1-5-21-462329909-2506264027-1013402721-
8672 not found
[2004/01/22 21:52:01, 8] sam/idmap_ldap.c:ldap_get_id_from_sid(572)
  ldap_get_id_from_sid: S-1-5-21-462329909-2506264027-1013402721-8672 (group)
[2004/01/22 21:52:01, 10] sam/idmap_ldap.c:ldap_get_id_from_sid(583)
  ldap_get_id_from_sid: Searching for "(&(objectClass=sambaIdmapEntry)
(sambaSID=S-1-5-21-462329909-250626402
7-1013402721-8672))"
[2004/01/22 21:52:01, 11] lib/smbldap.c:smbldap_open(822)
  smbldap_open: already connected to the LDAP server
Assertion failed: ber->ber_buf == NULL, file io.c, line 508
[2004/01/22 21:52:01, 11] lib/smbldap.c:smbldap_open(822)
  smbldap_open: already connected to the LDAP server
Assertion failed: 0, file io.c, line 701
Comment 1 John Klinger 2004-01-23 14:37:15 UTC
I have a bit more info. There seems to be two parts needed to recreate this. 
The first is the idmap mappings in our OpenLDAP were modified so that a large 
majority of the SIDs map to a single unix gid. The second is that after this is 
done, the tdb caches are restarted, and samba is restarted, you have to let it 
sit idle from the last command you issued for about 10 minutes. Then, when you 
issue a command like "id -a <globaluser>", the winbind log shows the mapping 
starts, but it then fails trying to reconnect to the ldap.

1) run "getent group" and "getent passwd" to populate the ldap
2) run a script to change about 200 of the SID=>gid mappings to a SID=>19999 
mapping (or some such fixed gid).
3) stop the daemons
4) rm /samba/locks/winbind*.tdb files
5) restart the daemons
6) Do "su" commands, "id", "wbinfo", whatever you want, and it maps appropriatly
7) Wait over 10 minutes without any winbind access.
8) "id -a <globaluser>", winbind core dumps.

The core is within the OpenLDAP libraries, where it is trying to access an 
LDAP* internal buffer (ber_buf) that is not initialized.

The stack trace is as follows:

#0  0xff01f930 in _libc_kill () from /lib/libc.so.1
#1  0xfefb5964 in abort () from /lib/libc.so.1
#2  0xfefb5c08 in _assert () from /lib/libc.so.1
#3  0xff096724 in ber_get_next (sb=0x328588, len=0xffbed814, ber=0x3d03c0)
    at io.c:701
#4  0xff0c8afc in try_read1msg (ld=0x33e440, msgid=10, all=1, sb=0x328588,
    lc=0x33e3d8, result=0xffbee6a4) at result.c:445
#5  0xff0c8874 in wait4msg (ld=0x33e440, msgid=10, all=1, timeout=0x33e3d8,
    result=0xffbee6a4) at result.c:354
#6  0xff0cb030 in ldap_search_s (ld=0x33e440,
    base=0xa <Address 0xa out of bounds>, scope=1, filter=0x33e3d8 "",
    attrs=0xffbee6a4, attrsonly=144, res=0xffbee6a4) at search.c:354
#7  0x1e6c68 in smbldap_search (ldap_state=0x328498,
    base=0x3d03f8 "ou=idmap,dc=aoc,dc=afbase,dc=af,dc=net", scope=2,
    filter=0xffbedea0 "(&(objectClass=sambaIdmapEntry)(sambaSID=S-1-5-21-
462329909-2506264027-1013402721-8435))", attrs=0x3dc0c8, attrsonly=0, 
res=0xffbee6a4)
    at lib/smbldap.c:921
#8  0x206298 in ldap_get_id_from_sid (id=0xffbee7a0, id_type=0xffbee804,
    sid=0xffbee878) at sam/idmap_ldap.c:588
#9  0x202678 in idmap_get_id_from_sid (id=0xffbee7a0, id_type=0xffbee804,
    sid=0xffbee878) at sam/idmap.c:211
#10 0x203158 in idmap_sid_to_gid (sid=0xffbee878, gid=0xffbee8d0, flags=2)
    at sam/idmap_util.c:183
#11 0x490b0 in winbindd_getgrent (state=0xffbeeb60)
    at nsswitch/winbindd_group.c:686
#12 0x413b4 in process_request (state=0xffbeeb60) at nsswitch/winbindd.c:308
#13 0x41948 in winbind_process_packet (state=0xffbeeb60)
    at nsswitch/winbindd.c:432
#14 0x65ecc in do_dual_daemon () at nsswitch/winbindd_dual.c:204
#15 0x43320 in main (argc=2, argv=0xffbefca4) at nsswitch/winbindd.c:918
(gdb) up 7
#7  0x1e6c68 in smbldap_search (ldap_state=0x328498,
    base=0x3d03f8 "ou=idmap,dc=aoc,dc=afbase,dc=af,dc=net", scope=2,
    filter=0xffbedea0 "(&(objectClass=sambaIdmapEntry)(sambaSID=S-1-5-21-
462329909-2506264027-1013402721-8435))", attrs=0x3dc0c8, attrsonly=0, 
res=0xffbee6a4)
    at lib/smbldap.c:921

-------------

And the value of relavent vars is:

-------------    
  
(gdb) p *ldap_state->ldap_struct
$1 = {ld_sb = 0x328588, ld_options = {ldo_valid = 2, ldo_debug = 0,
    ldo_tm_api = 0x0, ldo_tm_net = 0x0, ldo_version = 3, ldo_deref = 0, 
    ldo_timelimit = 0, ldo_sizelimit = 0, ldo_defludp = 0x33e730, 
    ldo_defport = 389, ldo_defbase = 0x328538 "dc=aoc,dc=afbase,dc=af,dc=net", 
    ldo_defbinddn = 0x0, ldo_refhoplimit = 5, ldo_sctrls = 0x0, 
    ldo_cctrls = 0x0, 
    ldo_rebind_proc = 0x1e5c64 <rebindproc_connect_with_state>, 
    ldo_rebind_params = 0x328498, ldo_booleans = 1}, ld_lberoptions = 1, 
  ld_errno = 0, ld_error = 0x0, ld_matched = 0x0, ld_referrals = 0x0, 
  ld_msgid = 10, ld_requests = 0x40bf40, ld_responses = 0x0, 
  ld_abandoned = 0x0, ld_cache = 0x0, ld_defconn = 0x33e3d8, 
  ld_conns = 0x33e3d8, ld_selectinfo = 0x33e4c8}


(gdb) p *ldap_state->ldap_struct->ld_defconn
$2 = {lconn_sb = 0x328588, lconn_refcnt = 3, lconn_lastused = 1074885177, 
  lconn_rebind_inprogress = 0, lconn_rebind_queue = 0x0, lconn_status = 3, 
  lconn_server = 0x3400e0, lconn_ber = 0x3d03c0, lconn_next = 0x0}


(gdb) p *ldap_state->ldap_struct->ld_defconn->lconn_ber
$3 = {ber_opts = {lbo_valid = 2, lbo_options = 1, lbo_debug = 0, 
    lbo_meminuse = 0}, ber_tag = 129, ber_len = 3859045747, 
  ber_usertag = 1634558464, ber_buf = 0x0, ber_ptr = 0x3d03d0 "æ\004]samb", 
  ber_end = 0x0, ber_sos = 0x0, ber_rwptr = 0x3d03d7 ""}

----------------

OpenLDAP checks ber_len and ber_ptr to ensure they are non-zero. If so, it 
believes there is memory available and, in this case, continues into a core 
dump.

So, this appears to be an OpenLDAP problem. However, I can't see why changing 
the data values of an OpenLDAP database would cause OpenLDAP to core. The 
database should care less what the data is. So, I'm still delving into samba to 
see what could be causing the problem on that side.

----------------

Following is the end log tail from another failed winbind. I find it 
interesting that samba is trying to reopen the ldap connection only to 
say "already connected to the LDAP server". I feel this may be where the 
problem manifests.

----------------

[2004/01/23 19:12:57, 10] sam/idmap_ldap.c:ldap_get_id_from_sid(583)
  ldap_get_id_from_sid: Searching for "(&(objectClass=sambaIdmapEntry)
(sambaSID=S-1-5-21-462329909-2506264027-1013402721-8434))"
[2004/01/23 19:12:57, 11] lib/smbldap.c:smbldap_open(822)
  smbldap_open: already connected to the LDAP server
[2004/01/23 19:12:57, 3] lib/smbldap.c:smbldap_retry_open(884)
  Sleeping for 0 milliseconds before reconnecting
[2004/01/23 19:12:57, 11] lib/smbldap.c:smbldap_open(822)
  smbldap_open: already connected to the LDAP server
[2004/01/23 19:12:57, 3] sam/idmap_ldap.c:ldap_get_id_from_sid(593)
[2004/01/23 19:12:57, 3] lib/smbldap.c:smbldap_retry_open(884)
  ldap_get_id_from_sid: Failure looking up idmap entry (Decoding error)
  Sleeping for 8 milliseconds before reconnecting
[2004/01/23 19:12:57, 1] nsswitch/winbindd_group.c:winbindd_getgrent(689)
  could not look up gid for group ABMSetDataRole
[2004/01/23 19:12:57, 10] nsswitch/winbindd_group.c:winbindd_getgrent(646)
  entry_index = 24, num_entries = 354
[2004/01/23 19:12:57, 10] sam/idmap_util.c:idmap_sid_to_gid(179)
  sid_to_gid: sid = [S-1-5-21-462329909-2506264027-1013402721-8435]
[2004/01/23 19:12:57, 10] sam/idmap_tdb.c:db_get_id_from_sid(315)
  db_get_id_from_sid
[2004/01/23 19:12:57, 10] sam/idmap_tdb.c:internal_get_id_from_sid(221)
  internal_get_id_from_sid: fetching record S-1-5-21-462329909-2506264027-
1013402721-8435 of type 0x2
[2004/01/23 19:12:57, 10] sam/idmap_tdb.c:internal_get_id_from_sid(225)
  internal_get_id_from_sid: record S-1-5-21-462329909-2506264027-1013402721-
8435 not found
[2004/01/23 19:12:57, 8] sam/idmap_ldap.c:ldap_get_id_from_sid(572)
  ldap_get_id_from_sid: S-1-5-21-462329909-2506264027-1013402721-8435 (group)
[2004/01/23 19:12:57, 10] sam/idmap_ldap.c:ldap_get_id_from_sid(583)
  ldap_get_id_from_sid: Searching for "(&(objectClass=sambaIdmapEntry)
(sambaSID=S-1-5-21-462329909-2506264027-1013402721-8435))"
[2004/01/23 19:12:57, 11] lib/smbldap.c:smbldap_open(822)
  smbldap_open: already connected to the LDAP server
[2004/01/23 19:12:57, 3] lib/smbldap.c:smbldap_retry_open(884)
  Sleeping for 0 milliseconds before reconnecting
[2004/01/23 19:12:57, 11] lib/smbldap.c:smbldap_open(822)
  smbldap_open: already connected to the LDAP server
Assertion failed: 0, file io.c, line 701
[2004/01/23 19:12:57, 11] lib/smbldap.c:smbldap_open(822)
  smbldap_open: already connected to the LDAP server
Assertion failed: 0, file io.c, line 701
Comment 2 John Klinger 2004-03-12 09:48:15 UTC
Although we are not pursuing this method any longer, I thought I'd do a little
more testing using the newer ldap version 2.1.25. Following the same process as
before, I setup the environment with multiple SIDs mapping to the same GID,
truss'd winbindd, performed an "id -a <globaluser>", waited over 10 minutes, and
did another "id -a <globaluser>". Running a few times came up with two different
behaviours.

The first result was identical to the previous attempts, where, on the second id
attempt, winbindd core dumped on an Assert in openLDAP. The logs matched what is
shown in the previous comments.

The second result was a hang in winbind. The following is the tail of a level
100 debug up to when the stall occurred. The last "already connected" line shown
is the last output to the log when the process stalled.


[2004/03/12 16:00:23, 8] sam/idmap_ldap.c:ldap_get_id_from_sid(572)
  ldap_get_id_from_sid: S-1-5-21-1482476501-746137067-682003330-512 (group)
[2004/03/12 16:00:23, 10] sam/idmap_tdb.c:internal_get_id_from_sid(225)
  internal_get_id_from_sid: record S-1-5-21-1482476501-746137067-682003330-1365
not found
[2004/03/12 16:00:23, 10] sam/idmap_ldap.c:ldap_get_id_from_sid(583)
  ldap_get_id_from_sid: Searching for
"(&(objectClass=sambaIdmapEntry)(sambaSID=S-1-5-21-1482476501-746137067-682003330-512))"
[2004/03/12 16:00:23, 8] sam/idmap_ldap.c:ldap_get_id_from_sid(572)
  ldap_get_id_from_sid: S-1-5-21-1482476501-746137067-682003330-1365 (group)
[2004/03/12 16:00:23, 11] lib/smbldap.c:smbldap_open(822)
[2004/03/12 16:00:23, 10] sam/idmap_ldap.c:ldap_get_id_from_sid(583)
  smbldap_open: already connected to the LDAP server
  ldap_get_id_from_sid: Searching for
"(&(objectClass=sambaIdmapEntry)(sambaSID=S-1-5-21-1482476501-746137067-682003330-1365))"
[2004/03/12 16:00:23, 11] lib/smbldap.c:smbldap_open(822)
  smbldap_open: already connected to the LDAP server


Here is the tail of the truss when the stall occurred:

3876:   write(3, " [ 2 0 0 4 / 0 3 / 1 2  ".., 58)      = 58
3876:   getuid()                                        = 0 [0]
3876:   write(3, "     s m b l d a p _ o p".., 53)      = 53
3876:   time()                                          = 1079108128
3876:   write(7, " 081C9020202B9 c81C204 &".., 204)     = 204
3878:   poll(0xFFBECE60, 1, -1)                         = 1 
3878:   read(7, " 081EB020202B9 d", 8)                  = 8
3878:   read(7, "81E404 \ s a m b a S I D".., 230)      = 230
3878:   poll(0xFFBECE60, 1, -1)                         = 1
3878:   read(7, " 0\r020202B9 e07", 8)                  = 8
3878:   read(7, "\n01\004\004\0", 7)                    = 7 
3876:   poll(0xFFBED8C8, 1, -1)                         = 1
3876:   read(7, 0x0029053F, 8)          (sleeping...)   
3878:   poll(0xFFBECE60, 1, -1)         (sleeping...)   


Doing a "pkill winbindd" using the default SIGTERM awakened winbindd, and it
continued processing the id request. Unfortunately, I formally killed it with a
SIGKILL before I noticed the SIGTERM had revived it, so I don't know if it truly
recovered.

I performed another run that stalled at the same place. This time I did a pkill
-HUP winbindd, and winbindd began running again, only to die at another assert.
The following are the log tails for this final run.


[2004/03/12 17:32:26, 10] nsswitch/winbindd_group.c:winbindd_getgrent(648)
  entry_index = 30, num_entries = 382
[2004/03/12 17:32:26, 10] sam/idmap_util.c:idmap_sid_to_gid(179)
  sid_to_gid: sid = [S-1-5-21-1482476501-746137067-682003330-1136]
[2004/03/12 17:32:26, 10] sam/idmap_tdb.c:db_get_id_from_sid(315)
  db_get_id_from_sid
[2004/03/12 17:32:26, 10] sam/idmap_tdb.c:internal_get_id_from_sid(221)
  internal_get_id_from_sid: fetching record
S-1-5-21-1482476501-746137067-682003330-1136 of type 0x2
[2004/03/12 17:32:26, 10] sam/idmap_tdb.c:internal_get_id_from_sid(225)
  internal_get_id_from_sid: record S-1-5-21-1482476501-746137067-682003330-1136
not found
[2004/03/12 17:32:26, 8] sam/idmap_ldap.c:ldap_get_id_from_sid(572)
  ldap_get_id_from_sid: S-1-5-21-1482476501-746137067-682003330-1136 (group)
[2004/03/12 17:32:26, 10] sam/idmap_ldap.c:ldap_get_id_from_sid(583)
  ldap_get_id_from_sid: Searching for
"(&(objectClass=sambaIdmapEntry)(sambaSID=S-1-5-21-1482476501-746137067-682003330-1136))"
[2004/03/12 17:32:26, 11] lib/smbldap.c:smbldap_open(822)
  smbldap_open: already connected to the LDAP server
Assertion failed: ber->ber_buf == NULL, file io.c, line 508


4033:   write(3, " [ 2 0 0 4 / 0 3 / 1 2  ".., 58)      = 58
4033:   getuid()                                        = 0 [0]
4033:   write(3, "     s m b l d a p _ o p".., 53)      = 53
4033:   time()                                          = 1079112746
4033:   write(7, " 081C80201 # c81C204 & o".., 203)     = 203
4033:   poll(0xFFBECE60, 1, -1)                         = 1
4033:   write(2, " A s s e r t i o n   f a".., 60)      = 60
4033:   sigaction(SIGABRT, 0x00000000, 0xFFBED288)      = 0
4033:   llseek(0, 0, SEEK_CUR)                          = 0
4033:   sigaction(SIGABRT, 0xFFBED158, 0xFFBED1D8)      = 0
4033:   sigprocmask(SIG_UNBLOCK, 0xFFBED208, 0x00000000) = 0
4033:   getpid()                                        = 4033 [4031]
4033:   kill(4033, SIGABRT)                             = 0
4033:       Received signal #6, SIGABRT [default]
4033:         siginfo: SIGABRT pid=4033 uid=0
4033:           *** process killed ***
4031:           *** process killed ***
Comment 3 Gerald (Jerry) Carter 2004-10-29 07:52:19 UTC
should be fixed in 3.0.8
Comment 4 Gerald (Jerry) Carter 2005-08-24 10:16:18 UTC
sorry for the same, cleaning up the database to prevent unecessary reopens of bugs.