The Samba-Bugzilla – Bug 11232
memory leak in the ldap process
Last modified: 2015-07-31 20:49:32 UTC
We get frequent OOM kills of the Samba's LDAP server process on our new AD server.
The version is samba-4.1.17, system is rhel7. Build options are the same as in the Fedora package, but with enabled DC and builtin kerberos. System has 8G RAM and 1G swap.
Today I managed to `gdb -p` the process in time and thus have a core dump (8G, compressed to 1G), but I have no idea if that could be of any use to you. Please advise on how to debug this.
Oh, I forgot to add: this is for the National Library of Technology, Prague, Czech Republic. We have about 60k user accounts and the database is approximately 1G in size.
If you can reproduce on git master, then (finally!) we have the ability to report the talloc pool usage with:
smbcontrol $PID pool-usage
That would be the single fastest way to find and fix this.
Without being able to do that, the other option is to get into the big process with gdb, and run talloc_report_full().
This link should give you some suggestions on how to run that (but smbcontrol is much easier, if you can):
(In reply to Andrew Bartlett from comment #2)
I've tried several times, but I would literally have to sit in front of htop whole day to catch this one in time. It exhausts main memory in something like 30 seconds and fills the 1G swap soon after.
Would it be possible to hack up some very simple tripwire that would check memory usage on every allocation and abort() with full dump when it hits ~7G? I would then attach gdb in screen and leave it running.
(I've tried just that via gdb + setrlimit now. Let's see where that gets us.)
I have failed to reproduce it with the setrlimit method mentioned above. It instead slowly went to crawl. In any case, at the point I had to restart it, memory situation looked something like this:
full talloc report: 1M in 27042 blocks
actual heap usage: 4G
mmapped files: ~1G
Just touching base on this again. If we have the core dump, could we run the talloc report on that, re-animated under gdb? (Well away from your production server, but with the same binaries)
It seems to me that the memory may not be allocated with talloc, it might be something else (eg malloc memory).
Are you able to get a network trace when it blows up, are there any other clues as to what triggers this?
(In reply to Andrew Bartlett from comment #5)
> Just touching base on this again. If we have the core dump, could we run the talloc report on that, re-animated under gdb?
AFAIK you can't really do that.
> It seems to me that the memory may not be allocated with talloc, it might be something else (eg malloc memory).
Well, since talloc report clearly states *much* lower usage than what is observed, that would be my first guess too.
> Are you able to get a network trace when it blows up, are there any other clues as to what triggers this?
Well, we use LSC to sync users from our primary LDAP to Samba DC for the library terminal solution. That's somewhere around 65k accounts, but we do restrict that to changed ones only, which yields around 1k daily. It seems that samba simply slowly accumulates leaked memory and then dies.
Since then, we mitigated the problem by not syncing all accounts, but only around 13k active ones, which is not ideal, but it basically fixed the problem for us.
For debugging / testing I would suggest just creating some script to create a lot of samba users via LDAP and then update them in cycles until samba dies. Alas, I don't have time necessary to write it myself. You could also use valgrind that way.