I am using SAMBA 3.2.4-37 (installed from sernet’s samba3 rpms) on a CentOS 5 machine (uname –a: Linux filer1 2.6.18-92.1.18.el5 #1 SMP Wed Nov 12 09:19:49 EST 2008 x86_64 x86_64 x86_64 GNU/Linux) . It is in the back end of a small network of windows servers with a trivial active directory domain (no forests, single site, very few users, windows 2003 DCs, samba is a member server) and I am using it as the remote UNC storage for my websites(IIS is configured to access these files using domain users). This works well 99.9% of the time, but on occasion, one of my web servers (happens on both server, and in arbitrary times including times with very little traffic) stops communicating with the UNC shares. Tracing the network with wireshark from the web servers shows something peculiar – the web server sends a “Session Setup AndX Request: NTLMSSP_NEGOTIATE” and the server responds (every time) with a “Session Setup AndX Response Error: STATUS_INVALID_PARAMETER” Tracing the same situation when the session is not in this bizarre state shows that the same NTLMSSP_NEGOTIATE request gets answered (as it should) with an NTLMSSP_CHALLENGE and the session gets built correctly. On the samba server, if I then use smbstatus to identify the PID of the smbd process serving the active session from the server that can’t connect and “kill” it (just a normal kill, no “-9” necessary) the web server immediately is able to setup the session correctly and resume operation (at least for several hours until this happens again) I have gone over the windows configuration with Microsoft supporters and they have concluded that the windows setup and configuration (iis, AD, smb, networking) is good. Here are a couple of exerpts form network traces filtered for SMB Command 0x73 that show the problem: In these traces: 10.0.0.115 is the samba server 10.0.2.1 is the web server affected by the issue when trace is taken 10.0.1.1 is the web server NOT affected by the issue when trace is taken (before you ask... subnet mask is 255.255.0.0) Trace 1: web2 when the issue occurs ( all session setup requests denied without a challenge) =========================================================================== 21:48:59.199 0.000000 1301 4.005879 {NbtSS:157, TCP:156, IPv4:155} 10.0.2.1 10.0.0.115 SMB SMB:C; Session Setup Andx, NTLM NEGOTIATE MESSAGE 21:48:59.200 0.000157 1302 4.006036 {NbtSS:157, TCP:156, IPv4:155} 10.0.0.115 10.0.2.1 SMB SMB:R; Session Setup Andx - NT Status: System - Error, Code = (13) STATUS_INVALID_PARAMETER 21:48:59.360 0.160932 1368 4.166968 {TCP:156, IPv4:155} 10.0.2.1 10.0.0.115 TCP TCP: [Bad CheckSum]Flags=...A...., SrcPort=10044, DstPort=Microsoft-DS(445), PayloadLen=0, Seq=3500048810, Ack=2385985999, Win=65365 21:49:00.255 0.894679 1632 5.061647 {NbtSS:157, TCP:156, IPv4:155} 10.0.2.1 10.0.0.115 SMB SMB:C; Session Setup Andx, NTLM NEGOTIATE MESSAGE 21:49:00.255 0.000177 1633 5.061824 {NbtSS:157, TCP:156, IPv4:155} 10.0.0.115 10.0.2.1 SMB SMB:R; Session Setup Andx - NT Status: System - Error, Code = (13) STATUS_INVALID_PARAMETER 21:49:00.371 0.115845 1693 5.177669 {TCP:156, IPv4:155} 10.0.2.1 10.0.0.115 TCP TCP: [Bad CheckSum]Flags=...A...., SrcPort=10044, DstPort=Microsoft-DS(445), PayloadLen=0, Seq=3500049080, Ack=2385986038, Win=65326 21:49:01.284 0.912846 1839 6.090515 {NbtSS:157, TCP:156, IPv4:155} 10.0.2.1 10.0.0.115 SMB SMB:C; Session Setup Andx, NTLM NEGOTIATE MESSAGE 21:49:01.284 0.000213 1840 6.090728 {NbtSS:157, TCP:156, IPv4:155} 10.0.0.115 10.0.2.1 SMB SMB:R; Session Setup Andx - NT Status: System - Error, Code = (13) STATUS_INVALID_PARAMETER 21:49:01.473 0.188502 1967 6.279230 {TCP:156, IPv4:155} 10.0.2.1 10.0.0.115 TCP TCP: [Bad CheckSum]Flags=...A...., SrcPort=10044, DstPort=Microsoft-DS(445), PayloadLen=0, Seq=3500049350, Ack=2385986077, Win=65287 Trace 2: (for comparison) web1 not experiencing the problem at the same time (NTLMSSP_NEGOTIATE requests are responded as expected with a CHALLENGE) ======================================================================== 22:14:57.916 0.000000 84 0.017983 {NbtSS:8, TCP:7, IPv4:6} 10.0.1.1 10.0.0.115 SMB SMB:C; Session Setup Andx, NTLM NEGOTIATE MESSAGE 22:14:57.917 0.000267 85 0.018250 {NbtSS:8, TCP:7, IPv4:6} 10.0.0.115 10.0.1.1 SMB SMB:R; Session Setup Andx, NTLM CHALLENGE MESSAGE - NT Status: System - Error, Code = (22) STATUS_MORE_PROCESSING_REQUIRED 22:14:57.917 0.000168 86 0.018418 {NbtSS:8, TCP:7, IPv4:6} 10.0.1.1 10.0.0.115 SMB SMB:C; Session Setup Andx, NTLM AUTHENTICATE MESSAGE, Domain: BAYNET, User: IUSR_BAYNET, Workstation: WEB1 22:14:57.927 0.009865 95 0.028283 {NbtSS:8, TCP:7, IPv4:6} 10.0.0.115 10.0.1.1 SMB SMB:R; Session Setup Andx, SpnegoNegTokenResp 22:14:58.349 0.422286 921 0.450569 {NbtSS:8, TCP:7, IPv4:6} 10.0.1.1 10.0.0.115 SMB SMB:C; Session Setup Andx, NTLM NEGOTIATE MESSAGE 22:14:58.349 0.000244 923 0.450813 {NbtSS:8, TCP:7, IPv4:6} 10.0.0.115 10.0.1.1 SMB SMB:R; Session Setup Andx, NTLM CHALLENGE MESSAGE - NT Status: System - Error, Code = (22) STATUS_MORE_PROCESSING_REQUIRED 22:14:58.349 0.000178 926 0.450991 {NbtSS:8, TCP:7, IPv4:6} 10.0.1.1 10.0.0.115 SMB SMB:C; Session Setup Andx, NTLM AUTHENTICATE MESSAGE, Domain: BAYNET, User: IUSR_BAYNET, Workstation: WEB1 22:14:58.359 0.009884 955 0.460875 {NbtSS:8, TCP:7, IPv4:6} 10.0.0.115 10.0.1.1 SMB SMB:R; Session Setup Andx, SpnegoNegTokenResp 22:14:58.547 0.187521 1847 0.648396 {NbtSS:8, TCP:7, IPv4:6} 10.0.1.1 10.0.0.115 SMB SMB:C; Session Setup Andx, NTLM NEGOTIATE MESSAGE 22:14:58.547 0.000242 1848 0.648638 {NbtSS:8, TCP:7, IPv4:6} 10.0.0.115 10.0.1.1 SMB SMB:R; Session Setup Andx, NTLM CHALLENGE MESSAGE - NT Status: System - Error, Code = (22) STATUS_MORE_PROCESSING_REQUIRED 22:14:58.547 0.000215 1855 0.648853 {NbtSS:8, TCP:7, IPv4:6} 10.0.1.1 10.0.0.115 SMB SMB:C; Session Setup Andx, NTLM AUTHENTICATE MESSAGE, Domain: BAYNET, User: IUSR_BAYNET, Workstation: WEB1 22:14:58.557 0.009704 1932 0.658557 {NbtSS:8, TCP:7, IPv4:6} 10.0.0.115 10.0.1.1 SMB SMB:R; Session Setup Andx, SpnegoNegTokenResp
Created attachment 3770 [details] log.smbd (level: 10) while issue is happenning briefly setting the log level to 10 while the issue is happening on one of my servers yielded a huge log.smbd file (very active server), but no entries that looked particularly bad to my untrained eye. I have attachet a gzipped copy in case it helps
Created attachment 3771 [details] zipped pcap network trace from web2 while seesions are denied Network trace taken on WEB2 (10.0.2.1) while it's session requests are denied by the samba server (\\filer.baynet.com = 10.0.0.115)
Just in case it is of interest, here is my smb.conf file (the poor thing is riddled with comments of block after block of configuration comments we have tried to add/remove to overcome this situation over the last 9+ months) [global] workgroup = BAYNET ##Added 07-03-08## #netbios name = filer netbios aliases = filer1 filer2 filer filer.baynet.com 10.0.0.115 reset on zero vc = yes ###################### server string = CentOS 5 File Server timestamp logs = yes dos filetimes = yes dos filemode = yes inherit acls = yes name cache timeout = 0 winbind uid = 101-6000000 winbind gid = 101-6000000 local master = no preferred master = no # guest account = guest map to guest = Never unix charset = CP1252 dos charset = CP850 allow trusted domains = yes security = ADS disable netbios = no password server = * encrypt passwords = yes username level = 5 load printers = no printing = none printcap name = /dev/null disable spoolss = yes # utmp directory = /var/run # utmp = yes name resolve order = host lmhosts wins bcast ldap timeout = 5 restrict anonymous = 0 password server = nas2.baynet.com pdc1.baynet.com use kerberos keytab = true # panic action = /usr/bin/backtrace %d > /tmp/segv_samba_%d.out 2>&1 # root preexec = /usr/local/samba/bin/log_connect.sh '%u' '%m' '%I' '%S' # root postexec = /usr/local/samba/bin/log_disconnect.sh '%u' '%m' '%I' '%S' wins server = 10.0.0.200 10.0.0.201 username level = 0 realm = BAYNET.COM ###ADDED winbind enum users = yes winbind enum groups = yes client use spnego = yes client ntlmv2 auth = yes winbind use default domain = yes restrict anonymous = 2 domain master = no local master = no preferred master = no os level = 0 disable netbios = no ###END ###Added 05-07-08 bind interfaces only = yes interfaces = 10.0.0.115 debug level = 0 ###END ###Added 8/7/2008 winbind cache time = 600 winbind refresh tickets = Yes # note that "winbind offline logon = Yes" # requires also a change to /etc/security/pam_winbind.conf : "cached_login = yes" needs to be enabled winbind offline logon = Yes ###END smb ports = 445 max mux = 32767 log level = 1 winbind:5 socket options = TCP_NODELAY SO_RCVBUF=8192 SO_SNDBUF=8192 IPTOS_LOWDELAY SO_KEEPALIVE read raw = 1 write raw = 1 oplocks = no level 2 oplocks = no #oplocks = yes max xmit = 65535 dead time = 15 getwd cache = 1 [MLSData] path = /mnt/MLSData read only = no comment = MLS Data and Images follow symlinks = yes printable = no browseable = yes map acl inherit = yes create mask = 0777 security mask = 0777 directory mask = 0777 directory security mask = 0777 force create mode = 0777 force security mode = 0777 force directory mode = 0777 force directory security mode = 0777 oplocks = no level 2 oplocks = no #oplocks = yes guest ok = yes valid users = \ @"BAYNET\Domain Users" \ @"AllLocalUsers" \ @"BAYNET\Domain Admins" \ @"BAYNET\SQLServerExec" \ write list = \ @"BAYNET\Domain Users" \ @"AllLocalUsers" \ @"BAYNET\Domain Admins" \ @"BAYNET\SQLServerExec" [App] path = /mnt/App read only = no comment = Application Data follow symlinks = yes printable = no browseable = yes map acl inherit = yes create mask = 0777 security mask = 0777 directory mask = 0777 directory security mask = 0777 force create mode = 0777 force security mode = 0777 force directory mode = 0777 force directory security mode = 0777 guest ok = yes writeable = yes oplocks = no level 2 oplocks = no #oplocks = yes valid users = \ @"BAYNET\Domain Users" \ @"AllLocalUsers" \ @"BAYNET\Domain Admins" \ @"BAYNET\WEB Users" write list = \ @"BAYNET\Domain Users" \ @"AllLocalUsers" \ @"BAYNET\Domain Admins" \ @"BAYNET\WEB Users"
No NT_STATUS_INVALID_PARAMETER in that logfile. You might want to increase the debug level of a single smbd with smbcontrol <pid> debug 10
Thanks, I wasn't aware of that command, and will try to capture a log with it the next time this issue happens. I will post it when it is collected
Created attachment 3772 [details] log with single failing smbd proc at level 10 showing INVALID PARAMETER The issue occurred again 10 minutes ago (14:20) I identified the process (pid=19949 FWIW) and used smbcontrol to switch it to log level 10. the attached file is the tail of that log since 14:20. I checked and it has several errors with "invalid parameter" logged. Hope this will be helpful in understanding this issue. as before, killing the smbd process caused the windows machine to immediately re-connect. logging for that process was still at level 10 when I killed it, so that might provide additional info in the log file... Thank you again for your help.
Volker, I have attached a log file with the failing smbd daemon process only at debug level 10 and verified that it contains multiple NT_STATUS_INVALID_PARAMETER problems logged. Is there *any* additional way I can help figure this out? any more information I can collect or detail that can help? My samba server has been running into this problem on average every 4-6 hours (bad...). Before upgrading to version 3.2.4-37 (sernet samba3 rpms) I was using 3.2.4-28 (sernet samba3 rpms as well) which also had the same problem but back then it happened roughly "only" every 12-24 hours. The RPMs that came from the CentOS rpm repository also exhibited the same issue.
No, for now that is sufficient. I can understand that you're desparate, but right now I am extremely busy, sorry. Paying customers come first :-) -- Volker
I found the problem and believe that the code in register_existing_vuid() is incorrect and num_validated_vuids++; is not needed and effectively make us run out of vuid. For every session, we increment num_validated_vuids twice but only decrement it once and this is wrong. Additional notes: 1. This bug was also confirmed using smbd compiled from 3.2.6 latest sources 2. Applying this fix on the 3.2.6 sources and then running our production samba server with the fix the STATUS_INVALID_PARAMETER issue hasn't been noticed in over 72 hours (vs. 4-6 hours between failures with the vuid leak in place)
Thanks for that analysis. Got it reproduce here now, looking at your suggested fix. Volker
Created attachment 3836 [details] patch Thanks for that bug report. You were fully correct, incrementing num_validated_vuids is pointless. Testing with BENCH-SESSSETUP from master showed another flaw in allocating vuids though that the attached patch should fix. I haven't pushed that yet, still doing a recompile and running a "make test". Can you test that patch? Thanks, Volker
Created attachment 3837 [details] corrected patch Corrected patch -- thanks to Stefan Metzmacher :-)
For me it's fixed in all branches now. Please re-open if you still have problems. Thanks, Volker
Volker, My engineer looked at the patch, and it looks very similar to the solution we came up with. However, my engineer thinks the patched code is still not 100% right because the new function increment_next_vuid(&next_vuid) must also be called in register_existing_vuid(). Once this will be added, it should fix the problem He thinks smbd will still have vuid wrapping issues with the last patch if this suggestion is not implemented. To test for this last wrapping issue, please test with at least 64k time session setups and close to make sure vuid wraps up nicely. Ofer
Could you please ask your engineer for a closer explanation of the reasons why we should also increment next_vuid in register_existing_vuid()? According to what I've seen in sesssetup.c, we always call register_initial_vuid() before register_existing_vuid(), so I think to get a 1-step incremented vuid by each complete session setup sequence I think it is sufficient to only increment next_vuid only once. This view is also supported by a quick test: I've run smbtorture //127.0.0.1/tmp -Uuser%pw sesssetup_bench -o 100000 from master, and I ended up with a final vuid of 34666. 34666+65536=100202, which is pretty much what I would expect. Can you please ask your engineer to give me more exact instructions how to reproduce the problem he still sees? Thanks a lot, Volker
After further testing I can also confirm that the patch works in all our test cases. Although slightly different than the code we used to fix the problem it does solve the issue. re-closing the bug :)