Created attachment 14698 [details] pmap output from Samba PID OS + App info: Running Centos 7.5 (3.10.0-693.17.1.el7.x86_64) Samba 4.8.6 + CTDB + Winbind built from source Gluster 3.12.14 We're seeing very high memory usage on multiple servers across the cluster, especially when there are lots of write operations occurring on the share. We have three servers with 64GB of RAM each, handling 120 connections in total, each node is handling around 40 each. Running htop or top shows that the SMBD processes are using 36GB of RAM, and running pmap on multiple PIDs confirms this (attached). A few systems are running at 100% RAM usage and 100% swap usage. Is anyone able to point us in the right direction to gather more data, or assist with troubleshooting this potential memory leak?
Can you do a smbcontrol <smbd-pid> pool-usage on one of the large smbds and upload the output? This will show the core smbd memory usage. Also, there have been versions of gluster out there that are quite memory-hungry if used with many independent processes like smbd does it. You might want to clarify with the gluster team if that is the case with your version.
Hi Volker, Thanks for the quick reply. Command output for multiple PIDs: Command: smbcontrol 8496 pool-usage Return code: 1 Return output: (No output) I've opened a bugzilla ticket with Gluster too, hopefully they will be able to shed some light on this. Please let me know if you need any more data. Many thanks, Ryan
How long does the smbcontrol command take? You might want to add a "-t 300" for a 5-minute timeout
The command completes almost instantly. I'v re-run the command with the -t 300 argument but I get the same result (Running 'echo $?' returns 1. Best regards, Ryan
Hmm. Strange. 8496 is really an smbd? Can you run strace -ttT -f -o /tmp/control.strace smbcontrol 8496 pool-usage and upload /tmp/control.strace?
Created attachment 14699 [details] strace 01
I've attached that to the ticket. Best, Ryan
Created attachment 14700 [details] strace02
Looking through the strace, I can see that it wa adding a double // on the CTDB private directory. This was because the private dir value in smb.con had a trailing /. I've removed this and reloaded the config and run strace again (strace02). Still not getting output from the 'smbcontrol <pid> pool-usage' command. Best, Ryan
65216 12:08:29.580634 connect(10, {sa_family=AF_LOCAL, sun_path="/mnt/ctdbv01/ctdb/msg.sock/8496"}, 110) = -1 ENOENT (No such file or directory) <0.000804> That line says that 8496 is not a smbd, or the configuration is so vastly different between smbcontrol and smbd that the msg.sock subdirectory is located somewhere else for the smbd in question.
Created attachment 14701 [details] strace03
Created attachment 14702 [details] SMB process forked with high RAM
That PID seemed to be a forked thread from a main process. I've run the command on a main PID and the output looks more promising. Running 'smbcontrol 175380 pool-usage' resulted in a message saying 'No replies'. I've attached a new strace from this command. This is a live system, which that node has now become unresponsive, so I've disabled CTDB on it which has reduced the impact the issue is having on the customer. However, this means some of the processes with very high RAM usage have been killed off as the client connections have moved to a different node.
Please upload the stdout of the smbcontrol that resulted in strace03
Also, are you using a configuration with vfs_glusterfs or do you re-export a fusemount ?
I was unable to run the command on that PID as it was killed when I disabled the node via CTDB. I've run the test on another PID and the command times out after the 300 second timeout value. We're sharing Gluster via Samba with the VFS module, not the fuse mount. However we do have the volume mounted on the node via the FUSE module. many thanks, Ryan
Ok, probably using a fusemount is a valid workaround to stay operational while this is being researched further with Volker. There is a redhat knowledgebase article that explains the necessary steps, most should be applicable to your setup as well I guess. "High memory consumption when using RHGS with Samba" https://access.redhat.com/solutions/2969381
Thanks Guenther, is there any ETA on when this code base optimisation would completed? I'll keep an eye on the customers system and will raise a change request to move to a FUSE mount if the RAM usage gets too high. Please let me know if you need any more data to assist with the progression of the solution to this issue. Many thanks, Ryan
Hello, Is there any further testing we can do on this? I don't believe the issue is only to do with this: "High memory consumption when using RHGS with Samba" https://access.redhat.com/solutions/2969381 As it mentions 500MB being used per thread, whereas I'm seeing at least 17GB of memory usage.
(In reply to ryan from comment #19) > Is there any further testing we can do on this? > I don't believe the issue is only to do with this: We still are waiting for the stdout of the strace03 run. See comment 14.
Hi Volker, The command times out after the 300 second timeout value. Best, Ryan
(In reply to ryan from comment #21) > The command times out after the 300 second timeout value. Well, we need that output. If that times out after 300 seconds and the smbd is still spinning at 100% CPU, increase the timeout. If the smbd you're investigating is not spinning CPU, the problem is somewhere else. I think you must let someone take a look at the server directly. https://samba.org/samba/support has companies offering support with an NDA such that you can let them log into your box. Volker
Hi Volker, I will try and run the command with a longer timeout value and get you some data. We are actively looking into Sernet Samba, however in the meantime your assistance is much appreciated. I'll reply tomorrow with some more data. Many thanks, Ryan
Created attachment 14733 [details] PID 04: 'smbcontrol 98757 pool-usage' output
Created attachment 14734 [details] PID 04: 'strace -ttT -f -o /tmp/control.strace smbcontrol 98757 pool-usage' output
Hi Volker, Please find attached the commands you requested before. Unfortunately the PID for strace03 had died, but this PID i've run the following commands on is also having the issue: -smbcontrol 98757 pool-usage -strace -ttT -f -o /tmp/control.strace smbcontrol 98757 pool-usage This time the smbcontrol command returned instantly without any timeout. Many thanks for the assistance, Best regards, Ryan
Samba's own internal bookkeeping says it's using 390KiloBytes: full talloc report on 'null_context' (total 390744 bytes in 4144 blocks) Give it a factor of 2 in overhead, that means one Megabyte. I really don't think it's a core Samba issue, it's an issue with the gluster module. Yes, we do ship vfs_glusterfs as part of Samba, but the main memory consumption at least used to be outside of Samba's control. Have you tried with the fuse workaround that Günther asked about in comment 15?
Hi Volker, Good to hear the SMB core isn't the issue. Hopefully should limit down the scope of the troubleshooting. I've tried using the FUSE workaround that Günther suggested, however we're seeing worse issues with that, which I have opened a ticket for on the Redhat bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1657743 Many thanks, Ryan
closing, there will be more progress here as this is obviously not a core samba issue here. Feel free to share any news on this issue though if you like :-)