Bug 4917 - High winbind memory usage
Summary: High winbind memory usage
Status: CLOSED FIXED
Alias: None
Product: Samba 3.0
Classification: Unclassified
Component: winbind (show other bugs)
Version: 3.0.25c
Hardware: Other Linux
: P3 normal
Target Milestone: none
Assignee: Samba Bugzilla Account
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-08-23 15:00 UTC by Patrick Rynhart
Modified: 2007-09-10 03:24 UTC (History)
1 user (show)

See Also:


Attachments
proposed patch (938 bytes, patch)
2007-08-24 06:27 UTC, Volker Lendecke
no flags Details
smbcontrol output (333.27 KB, application/x-gzip)
2007-08-31 03:40 UTC, Patrick Rynhart
no flags Details
Next memleak patch (687 bytes, patch)
2007-08-31 04:42 UTC, Volker Lendecke
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Patrick Rynhart 2007-08-23 15:00:02 UTC
I'm running a Samba 3.0.25c domain comprised of a PDC, BDC, W2K3 member servers and XP clients (~200) in a one way NT trust relationship with a W2K3 domain.  Staff log on to the Windows domain interactively while Students log on to the Samba domain.

On both DCs, one of the winbind processes becomes very large in terms of memory usage which eventually leads to physical and virtual memory exhaustion.

Example: 

# ps -elfy | grep winbindd

S root      4189     1  0  75   0  4368  2789 -      Aug23 ?        00:02:33 /usr/sbin/winbindd -D
S root      4191  4189  0  75   0  3608  2614 429496 Aug23 ?        00:00:00 /usr/sbin/winbindd -D
S root      4194  4189  0  75   0 427288 108442 429496 Aug23 ?      00:04:00 /usr/sbin/winbindd -D
S root      4195  4189  0  76   0  2644  2355 429496 Aug23 ?        00:00:00 /usr/sbin/winbindd -D
S root      4222  4189  0  75   0  3356  2439 429496 Aug23 ?        00:00:01 /usr/sbin/winbindd -D
S root      4380  4189  0  75   0  1916  1928 429496 Aug23 ?        00:00:00 /usr/sbin/winbindd -D

i.e. one of the processes is currently using over 400 MB of memory.

The memory usage is approximately constant overnight (possibly because Staff are not logging on ?) but then grows during the day.  I have

winbind enum users = no
winbind enum groups = no

set in smb.conf

I should also note that this doesn't seem to be related to the number of smbd processes.  I have a deadtime set, and currently there are only 7 smbd processes.

I can't spot anything related to this in the logs, and the domain is otherwise running fine.

Regards,

Patrick
Comment 1 Volker Lendecke 2007-08-23 15:08:17 UTC
Can you see which type of winbind this is? Most easily figured out by doing a

ls -l /proc/4194/fd

There you can see which logfile it has open. I'm curious if this is maybe the idmap child, then it should have some logfile like 'log.winbind-imap' or so open.

Volker
Comment 2 Patrick Rynhart 2007-08-23 15:59:10 UTC
Hi Volker,

Thank you for your reply.  The process 4194 is indeed the idmap child:

ist-dc1:/var/log/samba# ls -l /proc/4194/fd
total 25
lrwx------ 1 root root 64 2007-08-24 08:51 0 -> /dev/null
lrwx------ 1 root root 64 2007-08-24 08:51 1 -> /dev/null
lrwx------ 1 root root 64 2007-08-24 08:51 10 -> /usr/local/samba/var/locks/netsamlogon_cache.tdb
lrwx------ 1 root root 64 2007-08-24 08:51 11 -> socket:[31157]
lrwx------ 1 root root 64 2007-08-24 08:51 12 -> socket:[31159]
lr-x------ 1 root root 64 2007-08-24 08:51 13 -> pipe:[31164]
lrwx------ 1 root root 64 2007-08-24 08:51 14 -> socket:[31162]
l-wx------ 1 root root 64 2007-08-24 08:51 15 -> pipe:[31164]
lrwx------ 1 root root 64 2007-08-24 08:51 16 -> /usr/local/samba/var/locks/winbindd_cache.tdb
lrwx------ 1 root root 64 2007-08-24 08:51 17 -> socket:[31194]
l-wx------ 1 root root 64 2007-08-24 08:51 18 -> /usr/local/samba/var/log.winbindd-idmap
lrwx------ 1 root root 64 2007-08-24 08:51 19 -> socket:[31193]
l-wx------ 1 root root 64 2007-08-24 08:51 2 -> /usr/local/samba/var/log.winbindd-idmap
l-wx------ 1 root root 64 2007-08-24 08:51 20 -> pipe:[31196]
lrwx------ 1 root root 64 2007-08-24 08:51 21 -> socket:[31199]
lrwx------ 1 root root 64 2007-08-24 08:51 22 -> socket:[31201]
lrwx------ 1 root root 64 2007-08-24 08:51 23 -> socket:[31203]
lrwx------ 1 root root 64 2007-08-24 08:51 24 -> /usr/local/samba/var/locks/login_cache.tdb
lr-x------ 1 root root 64 2007-08-24 08:51 3 -> pipe:[31196]
lrwx------ 1 root root 64 2007-08-24 08:51 4 -> /usr/local/samba/private/secrets.tdb
lr-x------ 1 root root 64 2007-08-24 08:51 5 -> /dev/urandom
lrwx------ 1 root root 64 2007-08-24 08:51 6 -> /usr/local/samba/var/locks/gencache.tdb
lrwx------ 1 root root 64 2007-08-24 08:51 7 -> /usr/local/samba/var/locks/idmap_cache.tdb
l-wx------ 1 root root 64 2007-08-24 08:51 8 -> /usr/local/samba/var/locks/winbindd.pid
lrwx------ 1 root root 64 2007-08-24 08:51 9 -> /usr/local/samba/var/locks/messages.tdb

Best Regards,

Patrick
Comment 3 Volker Lendecke 2007-08-23 16:09:09 UTC
Ok, thanks. It's the idmap child. Do you have any special idmap options set?

Volker
Comment 4 Patrick Rynhart 2007-08-23 16:35:09 UTC
Hi Volker,

The idmap configuration is standard, i.e.,

ist-dc1:/var/log/samba# grep idmap /etc/samba/smb.conf

ldap idmap suffix = ou=idmap
idmap backend = ldap:ldap://127.0.0.1
idmap uid = 10000-29000
idmap gid = 10000-29000

Regards,

Patrick


Comment 5 Volker Lendecke 2007-08-24 06:27:16 UTC
Created attachment 2880 [details]
proposed patch

This was checked in with r24649.

Thanks!

Volker
Comment 6 Volker Lendecke 2007-08-24 06:28:38 UTC
I'm closing this, because the attached patch fixes a severe memory leak for me. I'm certainly not 100% sure it fixes all, so please re-open if you still get high memory usage.

Volker
Comment 7 Patrick Rynhart 2007-08-30 17:14:54 UTC
Hi Volker,

Thank you for looking into this bug so promptly.  I have applied your patch to our 3.0.25c domain controllers which has significantly decreased the memory leak of the winbind idmap child process.

However, I am still seeing a steady memory leak of approximately 20 MB per day (per domain controller) in our production environment.  After 4 working days:

# ps -elfy | grep winbind
S root     10613     1  0  75   0  2660  2222 -      Aug25 ?        00:02:32 /usr/sbin/winbindd -D
S root     10615 10613  0  78   0  3552  2845 429496 Aug25 ?        00:00:00 /usr/sbin/winbindd -D
S root     10618 10613  0  75   0 85424 23222 429496 Aug25 ?        00:04:12 /usr/sbin/winbindd -D
S root     10619 10613  0  82   0   936  1999 429496 Aug25 ?        00:00:00 /usr/sbin/winbindd -D
S root     10622 10613  0  75   0  3356  2668 429496 Aug25 ?        00:00:06 /usr/sbin/winbindd -D
S root     10933 10613  0  75   0  1880  2147 429496 Aug25 ?        00:00:00 /usr/sbin/winbindd -D

Please let me know if you require any further information.

Best Regards,

Patrick

Comment 8 Volker Lendecke 2007-08-31 00:22:50 UTC
Can you do a "smbcontrol 10618 pool-usage" and upload the output?

Thanks,

Volker
Comment 9 Patrick Rynhart 2007-08-31 01:44:56 UTC
Hi Volker,

When I run that command, I get:

ist-dc2:~# smbcontrol 10618 pool-usage
No replies received

Regards,

Patrick
Comment 10 Volker Lendecke 2007-08-31 01:53:13 UTC
Hmmm. Then that # is not the winbind child anymore... Or it wants to send so much that the smbcontrol timeout is not enough. Do you see a spike in CPU usage of that winbind process you're sending that smbcontrol to? If so, it might help to see that output after it ran for some hours, not with 80 megs.

For a sample output you might send that control to other winbind processes.

Volker
Comment 11 Patrick Rynhart 2007-08-31 02:21:50 UTC
Hi Volker,

Thanks for your reply.  The process 10618 is the idmap child:

# ls -l /proc/10618/fd | grep idmap
l-wx------ 1 root root 64 2007-09-01 06:46 18 -> /usr/local/samba/var/log.winbindd-idmap
l-wx------ 1 root root 64 2007-09-01 06:46 2 -> /usr/local/samba/var/log.winbindd-idmap
lrwx------ 1 root root 64 2007-09-01 06:46 7 -> /usr/local/samba/var/locks/idmap_cache.tdb

When I run "smbcontrol 10618 pool-usage" I do see a spike in CPU usage which continues after the message "No replies received" is displayed (i.e. the smbcontrol timeout occurs before the RPC call completes).

Interestingly, the smbcontrol command seems aggravate the leak:

# ps -elfy | grep winbind | grep 10618
S root     10618 10613  0  76   0 113592 308608 429496 Aug25 ?      00:05:39 /usr/sbin/winbindd -D

I will restart Samba services and post the output of the smbcontrol command on the new idmap child when the total memory usage is smaller (would around 10 - 20 MB be good?).

Thank you again for your help.

Regards,

Patrick

Comment 12 Volker Lendecke 2007-08-31 02:24:13 UTC
Anything that shows some unusal pattern where some things are repeated a lot would be fine. It might even be after an hour or so, depending on the activity.

Volker
Comment 13 Patrick Rynhart 2007-08-31 02:37:29 UTC
Hi again,

Just a thought - Could the timeout that smbcontrol is using be changed to say 300 seconds ?  I could recompile and run the new smbcontrol binary (i.e "make" but no "make install") against process 10618 (i.e. no restart) - which would allow us to get some data (although I have no idea where the timeout that smbcontrol is using is in the Samba source!)

Otherwise, I can go with the original plan (i.e. restart Samba and poll when the idmap child winbind process is smaller).

Regards,

Patrick

Comment 14 Volker Lendecke 2007-08-31 02:40:34 UTC
Sure, but this would be overkill. Looking at the source code of smbcontrol I found that it has a "-t <seconds>" parameter. Guess what -t stands for :-))

Volker
Comment 15 Patrick Rynhart 2007-08-31 03:40:33 UTC
Created attachment 2907 [details]
smbcontrol output

smbcontrol output
Comment 16 Patrick Rynhart 2007-08-31 03:41:06 UTC
Bravo!  Thanks Volker. 

The output of the smbcontrol command is 1.1 million lines long!  I have therefore decided to create an attachment, rather than paste it below :-)

Cheers,

Patrick
Comment 17 Volker Lendecke 2007-08-31 04:42:00 UTC
Created attachment 2908 [details]
Next memleak patch

Attached find a patch that should fix this new memleak. I'm leaving this bug open this time, maybe there's more :-)

Volker
Comment 18 Patrick Rynhart 2007-08-31 08:17:37 UTC
Hi Volker,

Brilliant!  Thank you for this.  I've applied the patch to our DCs and I will keep an eye on the memory usage of winbind over the next couple of days.  I'll hopefully then be back to close this bug ;-)

Note that group 1000, the repeated line in the output of smbcontrol, was defined on our system, but in /etc/group rather than within LDAP.

Regards,

Patrick
Comment 19 Patrick Rynhart 2007-09-09 15:50:32 UTC
Dear Volker,

Your patches have resolved this bug.  Thank you very much for your prompt & helpful assistance!

Regards,

Patrick
Comment 20 Volker Lendecke 2007-09-10 03:24:54 UTC
Thanks for the feedback. As Jerry says in his announcements: Our Code, Our Bugs, Our Responsibility.

Volker