We get frequent OOM kills of the Samba's LDAP server process on our new AD server.
The version is samba-4.1.17, system is rhel7. Build options are the same as in the Fedora package, but with enabled DC and builtin kerberos. System has 8G RAM and 1G swap.
Today I managed to `gdb -p` the process in time and thus have a core dump (8G, compressed to 1G), but I have no idea if that could be of any use to you. Please advise on how to debug this.
Oh, I forgot to add: this is for the National Library of Technology, Prague, Czech Republic. We have about 60k user accounts and the database is approximately 1G in size.
If you can reproduce on git master, then (finally!) we have the ability to report the talloc pool usage with:
smbcontrol $PID pool-usage
That would be the single fastest way to find and fix this.
Without being able to do that, the other option is to get into the big process with gdb, and run talloc_report_full().
This link should give you some suggestions on how to run that (but smbcontrol is much easier, if you can):
(In reply to Andrew Bartlett from comment #2)
I've tried several times, but I would literally have to sit in front of htop whole day to catch this one in time. It exhausts main memory in something like 30 seconds and fills the 1G swap soon after.
Would it be possible to hack up some very simple tripwire that would check memory usage on every allocation and abort() with full dump when it hits ~7G? I would then attach gdb in screen and leave it running.
(I've tried just that via gdb + setrlimit now. Let's see where that gets us.)
I have failed to reproduce it with the setrlimit method mentioned above. It instead slowly went to crawl. In any case, at the point I had to restart it, memory situation looked something like this:
full talloc report: 1M in 27042 blocks
actual heap usage: 4G
mmapped files: ~1G
Just touching base on this again. If we have the core dump, could we run the talloc report on that, re-animated under gdb? (Well away from your production server, but with the same binaries)
It seems to me that the memory may not be allocated with talloc, it might be something else (eg malloc memory).
Are you able to get a network trace when it blows up, are there any other clues as to what triggers this?
(In reply to Andrew Bartlett from comment #5)
> Just touching base on this again. If we have the core dump, could we run the talloc report on that, re-animated under gdb?
AFAIK you can't really do that.
> It seems to me that the memory may not be allocated with talloc, it might be something else (eg malloc memory).
Well, since talloc report clearly states *much* lower usage than what is observed, that would be my first guess too.
> Are you able to get a network trace when it blows up, are there any other clues as to what triggers this?
Well, we use LSC to sync users from our primary LDAP to Samba DC for the library terminal solution. That's somewhere around 65k accounts, but we do restrict that to changed ones only, which yields around 1k daily. It seems that samba simply slowly accumulates leaked memory and then dies.
Since then, we mitigated the problem by not syncing all accounts, but only around 13k active ones, which is not ideal, but it basically fixed the problem for us.
For debugging / testing I would suggest just creating some script to create a lot of samba users via LDAP and then update them in cycles until samba dies. Alas, I don't have time necessary to write it myself. You could also use valgrind that way.
(In reply to Jan Dvořák from comment #6)
There is a reasonable chance that this has improved with Samba 4.5 with a number of the other fixes we have made. In those cases it wasn't a strict memory leak, but issues with they way we handled linked attributes when we have a large number of group members.
I expect that many aspects of Samba will have improved at the 65k account scale, and we are continuing to improve that code with Samba 4.6 and beyond.
(In reply to Andrew Bartlett from comment #7)
We are currently on 4.4.4 and don't really want to use non-packaged samba. Right now we only sync about 13k accounts and restart every day at 2AM. As soon as I have some more time on my hands, I will try building a newer release and use that.
Hello! Let me tell my story... I don't know if it is related to this problem, but it resolved the problems on my setup.
I work at a public university where students account are not deleted. So, basically, database never stops growing. We have a subscription with Sernet and started with Samba Version 4.2. At the time of our initial deploy, we had 90k users accounts. We first started with 2 controllers (VM) using 4 vCPUs and 4 GB RAM. As the number of workstations was growing, we had to add new controllers, and so, our new setup escalates to 5 controllers (3 sites, 1-2-2 on the sites). With our database growing, our problems started. The LDAP service was consuming all RAM and SWAP until it crashes.
We updated to Samba 4.4 and increased our controller's RAM to 8GB. The problem persisted. So we added a job to restart Samba periodically. First every 2 hours, later every hour and finally every 15 min. LDAP still kept freezing. Our memory escalated to 16 GB and later to 24GB.
With Samba 4.5, memory consumption decreased a little, but not enough. With 4.6 we changed the job to execute the script every minute and restart Samba only with controllers are below 2GB RAM free. So, we decreased RAM to 20GB and our setup was "stable".
With Samba 4.7 we did not have any perception of changes.
When we upgrade to Samba 4.8 (off-topic: in-place upgrade did not work), the problems escalated again. The script was unable to restart services and Samba just kept crashing. So, we starting to increase controllers' memory. We took the two more busy controllers and increased to 32GB and later 64GB. With 64GB, memory consumption peaks were around ~43GB but Samba kept crash. Connections were dropped, timed-out and refused, even with free memory on the controllers. At the time of one of this event was occurring, using smbstatus did not give any response. So we tried smbstatus with the numeric option and THAT WAS IT!!! It did respond normally. So we disabled Winbind resolution on nsswitch and restarted Samba services and BINGO!!! Problem solved! Consumption of memory was down to < 1GB. We are now using on controllers with 8GB RAM and kept the monitor job active. Our database has ~130k user accounts and ~1,6k workstations. From times to times, in peaks moment, we see Samba services get restarted by the script but the setup is pretty stable now.
So, I do not know it this information help you to solve this problem, but I hope it does.
Thank you for your attention and this great software.
I'm going to mark this bug closed.
Samba 4.9, 4.10 and the new 4.11 due in September each has numerous bug fixes for large databases.
In particular Samba 4.11 has much improved memory behaviour for large search replies and large group memberships.
I hope you have managed to preserver with Samba in the meantime and encourage an upgrade!
We've restricted the database to ~23000 accounts and it's been mostly smooth sailing since then. The OOM incident accured about thrice since my last message. Thanks for letting me know and thank for your hard work as well.
(In reply to Jan Dvořák from comment #11)
I think 0559430ab6e5c48d6e853fda0d8b63f2e149015c in particular would make quite a difference in your case.
The ldap server previously had the property of consuming memory that would be left allocated in the process (and therefore unavailable to the kernel) but not enrolled in malloc() or talloc() any longer.
We changed it to only allocate one copy of the LDAP reply, being the final network packet, and to re-use the rest of memory to work on the other responses.
This will be in Samba 4.11. If it is still not resolved, do let me know!