Bug 4920 - winbind cache sigment fault when running wbinfo -g
Summary: winbind cache sigment fault when running wbinfo -g
Alias: None
Product: Samba 3.0
Classification: Unclassified
Component: winbind (show other bugs)
Version: 3.0.22
Hardware: Other HP-UX
: P3 regression
Target Milestone: none
Assignee: Samba Bugzilla Account
QA Contact: Samba QA Contact
Depends on:
Reported: 2007-08-24 11:01 UTC by Ying Li
Modified: 2011-03-06 09:18 UTC (History)
3 users (show)

See Also:

Last part of log.winbindd. (6.26 KB, text/plain)
2007-08-24 15:14 UTC, Ying Li
no flags Details
winbindd_cache.tdb.gz (415.32 KB, application/octet-stream)
2007-08-24 15:24 UTC, Ying Li
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ying Li 2007-08-24 11:01:56 UTC
We have a customer who got a winbind cache crash on enum_local_groups() when running wbinfo -g on 3.0.22. I checked the latest release samba-3.0.25b. Seems the function was no change at all. Probably have the same issue.

 931          status = domain->backend->enum_local_groups(domain, mem_ctx, num_entries, info);
   933          /* and save it */
   934          refresh_sequence_number(domain, False);
   935          centry = centry_start(domain, status);
   936          if (!centry)
   937                  goto skip_save;
   938          centry_put_uint32(centry, *num_entries);
   939          for (i=0; i<(*num_entries); i++) {
   940                  centry_put_string(centry, (*info)[i].acct_name);
   941                  centry_put_string(centry, (*info)[i].acct_desc);
   942                  centry_put_uint32(centry, (*info)[i].rid);    <====
   943          }
   944          centry_end(centry, "GL/%s/local", domain->name);
   945          centry_free(centry);
   947  skip_save:
   948          return status;
   949  }

Here is backtrace.
Core was generated by `winbindd'.
Program terminated with signal 6, Aborted.
SI_UNKNOWN - signal of unknown origin
#0 0xc020cdc0 in kill+0x10 () from /usr/lib/libc.2
(gdb) bt
#0 0xc020cdc0 in kill+0x10 () from /usr/lib/libc.2
#1 0xc01a72e4 in raise+0x24 () from /usr/lib/libc.2
#2 0xc01e8638 in abort_C+0x160 () from /usr/lib/libc.2
#3 0xc01e8694 in abort+0x1c () from /usr/lib/libc.2
#4 0xca9e4 in smb_panic2 () at lib/util.c:1620
#5 0xca764 in smb_panic () at lib/util.c:1506
#6 0xae74c in fault_report () at lib/fault.c:42
#7 0xae7dc in sig_fault () at lib/fault.c:65
#8 <signal handler called>
#9 0x4ddd8 in enum_local_groups () at nsswitch/winbindd_cache.c:942
#10 0x45cbc in get_sam_group_entries () at nsswitch/winbindd_group.c:525
#11 0x4698c in winbindd_list_groups () at nsswitch/winbindd_group.c:837
#12 0x3fb3c in process_request () at nsswitch/winbindd.c:329
#13 0x40898 in request_recv () at nsswitch/winbindd.c:615
#14 0x40770 in request_main_recv () at nsswitch/winbindd.c:576
#15 0x3feb0 in rw_callback () at nsswitch/winbindd.c:408
#16 0x40ea4 in process_loop () at nsswitch/winbindd.c:829
#17 0x4185c in main () at nsswitch/winbindd.c:1077

   security = domain
   idmap uid = 10000-3000000
   idmap gid = 10000-3000000
   idmap backend = rid:MYDOM=10000-3000000
   winbind enum groups = Yes
Comment 1 Ying Li 2007-08-24 13:18:37 UTC
I forget to mention. This was happened on a large amount of accounts on a domain controller. For example, 18000 groups or 100000 users in DC. Winbind saves the user/group list with UL or GL into cache tdb with a single cache entry. I feel this might not good because of easily leading to winbindd_cache.tdb crash. Secondly, even if we disable winbind cache, the problem still presistes. Seems Winbindd -n couldn't fully disable cache usage. 
We are hoping there is a way such as a option in smb.conf to fully disable winbind cache, or do not save user/group list at all. 

Comment 2 Jeremy Allison 2007-08-24 13:23:00 UTC
Looks like this might be a corrupt cache file. Can you try using 3.0.25c, deleting the cache file and re-testing ?
Comment 3 Ying Li 2007-08-24 14:04:09 UTC
Hi Jeremy,

Unfortunately, I couldn't verify 3.0.25c. I did a quick look of the code path. 
Just centry_expand() call made a little change from SMB_REALLOC to SMB_REALLOC_ARRAY(). No other changes. Seems having the issue on saving a large entry in winbind cache for both UL or GL. 

Comment 4 Ying Li 2007-08-24 15:14:17 UTC
Created attachment 2884 [details]
Last part of log.winbindd.
Comment 5 Jeremy Allison 2007-08-24 15:14:48 UTC
Yeah there doesn't look like there's anything wrong with the code. Check the tdb cache file. Does it reoccur if you delete the tdb cache file ?
Comment 6 Ying Li 2007-08-24 15:24:26 UTC
Created attachment 2885 [details]

I use tdbbackup to check winbindd_cache.tdb. It should be OK. Only have 3 records as below. but with 11182080 bytes in size. Let me upoad the tdb.
key(7) = "DR/7979"
key(12) = "GL/EU/domain"
key(10) = "SEQNUM/EU\00"
Comment 7 Ying Li 2007-08-24 15:27:48 UTC
>Does it reoccur if you delete the tdb cache file ?
Yes. Even clearup all tdbs. The issue reoccured again.
Comment 8 Jeremy Allison 2007-08-24 15:44:42 UTC
Ah, I think you;re running into this :

#define MAX_TALLOC_SIZE 0x10000000

        if (count >= MAX_TALLOC_SIZE/el_size) {
                return NULL;

Looks like the size of the entry is larger than 0x10000000. Can you confirm ?
Comment 9 Ying Li 2007-08-24 16:33:22 UTC
How can I confirm the size of the GL entry? 

I know winbindd_cache.tdb size is 11182080, including 3 entries(DR,GL, SEQNUM). 0x10000000 is a bigger number, about 268435456 in dec. So the large entry should be in the limit.

What I can do to help this?
Comment 10 Jeremy Allison 2007-08-24 16:45:54 UTC
You need to add the "panic action = /bin/sleep 9999999" to the smb.conf, attach to the crashed winbnindd and get a proper backtrace with local variables. I need to know *exactly* what is causing this to go down.
Comment 11 Ying Li 2007-08-24 23:06:49 UTC
I'm unable to reproduce. Unsure if that customer can give a try next monday.
Comment 12 Jeremy Allison 2007-08-25 12:32:31 UTC
Hmmm. That's suspicious. Is the customer on a small-memory system ? If it were a generic bug with that winbindd cache tdb then you should be able to easily reproduce with that cache file.
Comment 13 Ying Li 2007-08-26 09:45:15 UTC
Hi Jemery,

I found two reasons not allow me reproducing, even I use the same smbconf.
- I have no their trusts environment. When using rid idmap backend, we always use allow trusted domains=no for mutual execution with rid idmap. Unfortunately, rescan_trusted_domains did not check the setting of the option. So it would initialize a domain list of all trusted domains. When executing wbinfo -g, winbindd_list_group would lookup groups from all trusted domains. Their cache data showed me they successfully saved more than 30000 groups from trusted domain in GL/domain cache entry. Since I didn't have their trust data, I only saved about their 18000 groups in the cache. I think we probably need to have a check of allow trusted domains option, so that initialized domain list only contains the primary domain without any trusted domains. Otherwise, allow trusted domains did not honor the definition if it's set to NO.

- I noticed both of us can successfully execute enum_dom_group() and save GL/domain entry into the cache. On my testing, I never run enum_local_groups() call because I have no GL/local cache entry.  But for them, seems they were running enum_local_groups that generated the core. I'm looking for what to drive the different result. Look at the code in get_sam_group_entries().

        /* get the domain local groups if we are a member of a native win2k domain
           and are not using LDAP to get the groups */

        if ( ( lp_security() != SEC_ADS && domain->native_mode
                && domain->primary) || domain->internal )
Probably the condition might generate diff result. I'm wondering why they need enum_local_groups() call, I didn't. Actually, they have already successfully retrieve groups from enum_dom_groups. Why they send rpc requests again to look up groups? Seems to me it's unnecessary. But the above condition couldn't be cover SEC_DOMAIN, and assume enum_dom_group always using LDAP query. I have no data about the condition. Looks like we have some difference.
Could you please look at this?

thank you very much.
Comment 14 Karolin Seeger 2009-12-11 07:57:09 UTC
Is this still an issue with Samba 3.4.3?
Comment 15 Volker Lendecke 2011-03-06 09:18:30 UTC
No feedback for over a year. Closing.