We have a customer who got a winbind cache crash on enum_local_groups() when running wbinfo -g on 3.0.22. I checked the latest release samba-3.0.25b. Seems the function was no change at all. Probably have the same issue. 931 status = domain->backend->enum_local_groups(domain, mem_ctx, num_entries, info); 932 933 /* and save it */ 934 refresh_sequence_number(domain, False); 935 centry = centry_start(domain, status); 936 if (!centry) 937 goto skip_save; 938 centry_put_uint32(centry, *num_entries); 939 for (i=0; i<(*num_entries); i++) { 940 centry_put_string(centry, (*info)[i].acct_name); 941 centry_put_string(centry, (*info)[i].acct_desc); 942 centry_put_uint32(centry, (*info)[i].rid); <==== 943 } 944 centry_end(centry, "GL/%s/local", domain->name); 945 centry_free(centry); 946 947 skip_save: 948 return status; 949 } Here is backtrace. Core was generated by `winbindd'. Program terminated with signal 6, Aborted. SI_UNKNOWN - signal of unknown origin #0 0xc020cdc0 in kill+0x10 () from /usr/lib/libc.2 (gdb) bt #0 0xc020cdc0 in kill+0x10 () from /usr/lib/libc.2 #1 0xc01a72e4 in raise+0x24 () from /usr/lib/libc.2 #2 0xc01e8638 in abort_C+0x160 () from /usr/lib/libc.2 #3 0xc01e8694 in abort+0x1c () from /usr/lib/libc.2 #4 0xca9e4 in smb_panic2 () at lib/util.c:1620 #5 0xca764 in smb_panic () at lib/util.c:1506 #6 0xae74c in fault_report () at lib/fault.c:42 #7 0xae7dc in sig_fault () at lib/fault.c:65 #8 <signal handler called> #9 0x4ddd8 in enum_local_groups () at nsswitch/winbindd_cache.c:942 #10 0x45cbc in get_sam_group_entries () at nsswitch/winbindd_group.c:525 #11 0x4698c in winbindd_list_groups () at nsswitch/winbindd_group.c:837 #12 0x3fb3c in process_request () at nsswitch/winbindd.c:329 #13 0x40898 in request_recv () at nsswitch/winbindd.c:615 #14 0x40770 in request_main_recv () at nsswitch/winbindd.c:576 #15 0x3feb0 in rw_callback () at nsswitch/winbindd.c:408 #16 0x40ea4 in process_loop () at nsswitch/winbindd.c:829 #17 0x4185c in main () at nsswitch/winbindd.c:1077 smb.conf security = domain idmap uid = 10000-3000000 idmap gid = 10000-3000000 idmap backend = rid:MYDOM=10000-3000000 winbind enum groups = Yes
I forget to mention. This was happened on a large amount of accounts on a domain controller. For example, 18000 groups or 100000 users in DC. Winbind saves the user/group list with UL or GL into cache tdb with a single cache entry. I feel this might not good because of easily leading to winbindd_cache.tdb crash. Secondly, even if we disable winbind cache, the problem still presistes. Seems Winbindd -n couldn't fully disable cache usage. We are hoping there is a way such as a option in smb.conf to fully disable winbind cache, or do not save user/group list at all.
Looks like this might be a corrupt cache file. Can you try using 3.0.25c, deleting the cache file and re-testing ? Jeremy.
Hi Jeremy, Unfortunately, I couldn't verify 3.0.25c. I did a quick look of the code path. Just centry_expand() call made a little change from SMB_REALLOC to SMB_REALLOC_ARRAY(). No other changes. Seems having the issue on saving a large entry in winbind cache for both UL or GL. Thanks.
Created attachment 2884 [details] Last part of log.winbindd.
Yeah there doesn't look like there's anything wrong with the code. Check the tdb cache file. Does it reoccur if you delete the tdb cache file ? Jeremy.
Created attachment 2885 [details] winbindd_cache.tdb.gz I use tdbbackup to check winbindd_cache.tdb. It should be OK. Only have 3 records as below. but with 11182080 bytes in size. Let me upoad the tdb. key(7) = "DR/7979" key(12) = "GL/EU/domain" key(10) = "SEQNUM/EU\00"
>Does it reoccur if you delete the tdb cache file ? Yes. Even clearup all tdbs. The issue reoccured again.
Ah, I think you;re running into this : #define MAX_TALLOC_SIZE 0x10000000 if (count >= MAX_TALLOC_SIZE/el_size) { return NULL; } Looks like the size of the entry is larger than 0x10000000. Can you confirm ? Jeremy.
How can I confirm the size of the GL entry? I know winbindd_cache.tdb size is 11182080, including 3 entries(DR,GL, SEQNUM). 0x10000000 is a bigger number, about 268435456 in dec. So the large entry should be in the limit. What I can do to help this?
You need to add the "panic action = /bin/sleep 9999999" to the smb.conf, attach to the crashed winbnindd and get a proper backtrace with local variables. I need to know *exactly* what is causing this to go down. Jeremy.
I'm unable to reproduce. Unsure if that customer can give a try next monday. Thanks.
Hmmm. That's suspicious. Is the customer on a small-memory system ? If it were a generic bug with that winbindd cache tdb then you should be able to easily reproduce with that cache file. Jeremy.
Hi Jemery, I found two reasons not allow me reproducing, even I use the same smbconf. - I have no their trusts environment. When using rid idmap backend, we always use allow trusted domains=no for mutual execution with rid idmap. Unfortunately, rescan_trusted_domains did not check the setting of the option. So it would initialize a domain list of all trusted domains. When executing wbinfo -g, winbindd_list_group would lookup groups from all trusted domains. Their cache data showed me they successfully saved more than 30000 groups from trusted domain in GL/domain cache entry. Since I didn't have their trust data, I only saved about their 18000 groups in the cache. I think we probably need to have a check of allow trusted domains option, so that initialized domain list only contains the primary domain without any trusted domains. Otherwise, allow trusted domains did not honor the definition if it's set to NO. - I noticed both of us can successfully execute enum_dom_group() and save GL/domain entry into the cache. On my testing, I never run enum_local_groups() call because I have no GL/local cache entry. But for them, seems they were running enum_local_groups that generated the core. I'm looking for what to drive the different result. Look at the code in get_sam_group_entries(). /* get the domain local groups if we are a member of a native win2k domain and are not using LDAP to get the groups */ if ( ( lp_security() != SEC_ADS && domain->native_mode && domain->primary) || domain->internal ) Probably the condition might generate diff result. I'm wondering why they need enum_local_groups() call, I didn't. Actually, they have already successfully retrieve groups from enum_dom_groups. Why they send rpc requests again to look up groups? Seems to me it's unnecessary. But the above condition couldn't be cover SEC_DOMAIN, and assume enum_dom_group always using LDAP query. I have no data about the condition. Looks like we have some difference. Could you please look at this? thank you very much. -Ying
Is this still an issue with Samba 3.4.3?
No feedback for over a year. Closing.