The Samba-Bugzilla – Bug 10099
DNS updates processed twice
Last modified: 2013-08-20 21:19:38 UTC
TLDR: Samba's internal dns server processes dns updates twice, this can result in dns responses backing up and causing timeouts. See the following conversation for details.
<Paruza> anyone run into issues with samba4's internal dns server not responding to queries in a timely fashion?
<Paruza> i'm having such an issue, it takes quite some time for it to resolve, even local names
<Poster> do you have a secondary nameserver defined on your client?
<Paruza> and the server is local
<Paruza> it does work, but it takes like 10 seconds to return a result
<Poster> right, so you only have 1 DNS server defined on your client, which points to the samba server's LAN IP address?
<Poster> do you have "dns forwarder =" defined in your samba configuration?
<Paruza> though as i mentioned, the same thing happens for local names
<Poster> how about samba to itself?
<Paruza> like the domain name of the samba domain
<Paruza> i will check
<Paruza> it is slow in the same way
<Poster> ok, are you using samba integrated DNS?
<Paruza> enabling logging now
<Poster> ok you're not running any other DNS server, BIND, dnsmasq, nsd, etc?
<Paruza> no, i will double check that nothing is running though
<Paruza> named is installed but shouldnt be running
<Paruza> yeah its not running
* gregoryo (~email@example.com) has joined
<Poster> what is the Samba host OS?
<Paruza> latest version of server lts
<Paruza> samba though was compiled from source
<Paruza> root@main:/usr/local/samba/sbin# ./samba --version
<Paruza> Version 4.0.7
<Poster> ok, the IP you have defined for: dns forwarder , is it your ISP or something external?
<Paruza> which is external yes
<Poster> ok, try this
<Poster> dig @ip.of.isp.ns www.google.com
<Poster> see how it responds, you'll get a query time in the bottom
<Paruza> its very fast
<Paruza> dig @192.168.0.10
<Paruza> however which is the samba server is slow
<Paruza> the external dns server itself is fine
<Paruza> but yeah even pulling the root records like that is slow
<Poster> root's not likely going to be fast
<Paruza> well i mean similar delay, multiple seconds, vs doing the same thing to the isp which is less than a second
<Paruza> this is slow as in like 5-10 seconds, long enouph for things to timeout, users to give up, etc keep in mind of course that many websites need many lookups, and so often this is resulting in "page cannot be displayed" errors on the clientside, etc
<Poster> ok try stopping samba completely
<Poster> check the process list to make sure there are no lingering processes
<Paruza> hmm interestingly the samba processes are taking quite a bit of cpu
<Poster> ok try: netstat -an | grep 53
<Poster> something else might be wrong if the CPU usage is high
<Poster> I've got /var/log/samba4, but I am running on FreeBSD 9.1
<Paruza> ill disable logging quick to make sure that change didnt cause taht
<Poster> might vary slightly for your installer, /var/log or possibly /usr/local/var/log
<Poster> I am not sure
<Paruza> thats the tail -f log* of samba starting and running a little bit
<Poster> I'm not 100%, but I think this may be contributing: Cannot contact any KDC for requested realm
* err-or_ (~err-or@gateway/tor-sasl/err-or) has joined
* err-or has quit (Ping timeout: 240 seconds)
<Poster> it's a long shot, but maybe check to see if it has entries in /etc/hosts for itself
<Paruza> was actually just looking at that
* _____mavrick61 has quit (Remote host closed the connection)
* _____mavrick61 (~firstname.lastname@example.org) has joined
<Paruza> hmm ok so the name in hosts was spelled incorrectly, I resolved that and restarted samba but i still have the same issue looks like
<Paruza> looking into kerberos some more
<Poster> double check the system clock
<Poster> ntp, all that
<RiXtEr> sorry I missed the conversation.
<RiXtEr> the kdc error is likely a misconfigured krb5.conf
<Paruza> clock is correct, but it wasnt at the time of the samba install
<RiXtEr> Paruza, what is the output of cat /etc/krb5.conf
<Paruza> default_realm = INGEBRETSEN.LOCAL
<Paruza> dns_lookup_realm = false
<Paruza> dns_lookup_kdc = true
<RiXtEr> Paruza, what about netstat -anp | grep 53 | grep "LISTEN "
<RiXtEr> Paruza, please use pastebin or something similar if its more than 5 lines.
<Paruza> tcp 0 0 192.168.0.10:53 0.0.0.0:* LISTEN 17519/samba
<Paruza> root@main:/usr/local/samba/sbin# kinit
<Paruza> kinit: Cannot resolve servers for KDC in realm "INGEBRETSEN.LOCAL" while getting initial credentials
<RiXtEr> Paruza, what about netstat -anp | grep 88 | grep "LISTEN "
<Paruza> tcp 0 0 192.168.0.10:88 0.0.0.0:* LISTEN 17513/samba
<Paruza> the above error btw also appears in the samba log
<RiXtEr> Paruza, killall samba && sleep 3 && samba && tail -n 50 /usr/local/samba/var/log.samba
<RiXtEr> (thats one command)
<RiXtEr> Paruza, what does your smb.conf look like (/usr/local/samba/etc/smb.conf)
<Paruza> hmm does kerberos have demons that should be running?
<RiXtEr> samba spawns them.
<RiXtEr> I'd expect to see lines like 0.0.0.0:53 and 0.0.0.0:88 from your netstat output
<RiXtEr> but if you have a bind interfaces only line then it could be the ip
<Paruza> yeah i do
<RiXtEr> Paruza, ok, what about cat /etc/resolv.conf | grep nameserver
<RiXtEr> thats your problem.
<Paruza> ill change it quick
<RiXtEr> nameserver needs to be nameserver 192.168.0.10
<RiXtEr> that will speed up your other name resolution too
<Paruza> well the issue is that name resolution on .10 is extremely slow / nearly disfunctional
<RiXtEr> right because your nameserver line is wrong
<RiXtEr> so the lookups are done incorrectly
<RiXtEr> (at least that is my thinking on the subject)
<Paruza> testing atm
<RiXtEr> Paruza, also, its not a good idea to have ingebretsen.local 192.168.0.10 in your hosts file
<RiXtEr> makes finding dns issue quite hard.
<RiXtEr> as some things will look like they work (when they may not)
* kukks has quit (Quit: Going home ....)
<Paruza> well the issue appears to be remaining, one of the samba processes is consuming lots of cpu, and dns resolution is still very slow
<Paruza> i suspect its slow because of the cpu usage on the process
<Paruza> root 17767 61.4 10.3 1005408 417708 ? R 21:51 2:42 ./samba
<RiXtEr> did you change any of the following lines after you did the provision?
<RiXtEr> workgroup = INGEBRETSEN
<RiXtEr> realm = ingebretsen.local
<RiXtEr> netbios name = MAIN
<RiXtEr> so provision does a lot behind the scenes, those lines should not be changed after setting up samba.
<RiXtEr> (with samba 3 it was no issue, but samba 4 as an ad dc just doesn't like it)
<abartlet> yeah, really, really doesn't like it...
<RiXtEr> abartlet, Hi!
<Paruza> is there a way to confirm theres no issues there?
<RiXtEr> Long time no talk to abartlet
<RiXtEr> Paruza, is this a fresh install anyway?
<RiXtEr> or have you setup users and connected machines already?
<Paruza> there are users working
<Paruza> its mostly fresh it was an upgrade from samba 3
<RiXtEr> Paruza, samba-tool domain classicupgrade ?
<Paruza> using that classic upgrade process, or the other one, i forget
<RiXtEr> Paruza, one min let me boot up my vm
<Paruza> at one point the domain was mispelled ingebretson insted of sen, so i had wiped the /usr/local/samba dir and reinstalled/reprovisioned from scratch
<abartlet> Paruza: 'the other one'
<RiXtEr> Paruza, mind stopping samba and doing a netstat -anp | grep "LISTEN " | grep "135\|445\|88\|53\|636\|389"
<Paruza> wasnt there another upgrade option besides classicupgrade?
<RiXtEr> Paruza, please make sure samba is stopped with ps ax | grep samba after you killall samba
<RiXtEr> before running that command.
<RiXtEr> abartlet, i think he is referring to samba3upgrade (same codepath)
<Paruza> does not return anything
<RiXtEr> Paruza, start samba and do that again
<Paruza> right that was the other, i dont recall which this install was
<Paruza> i suspect classic
<RiXtEr> Paruza, they are an alias to one another
<RiXtEr> so they do the exact same thing.
<Paruza> this is 4.0.7 btw
<RiXtEr> right caught that :)
<Paruza> i had read there was a dos bug that was resolved in .8, but that was a ram consuming issue, so i imagine it has nothing to do with this
<RiXtEr> its great... you got a process with '88' in the middle of it :)
<Paruza> lol yeah
<RiXtEr> Paruza, and cat /etc/resolv.conf still shows 192.168.0.10 as the nameserver
<RiXtEr> (it can change without notice if its not done through resolvconf)
<Paruza> it is still
<Paruza> ill change it using the tools so its perm quick
<RiXtEr> and `ip a` shows that the 192.168.0.10 as the address to eth0 ?
<RiXtEr> (or some eth at least)
<Paruza> correct, and that is the only active interface
<Paruza> well... theres lo and ppp1
<RiXtEr> ok so lets test this... comment out interfaces = and bind interfaces only = lines in the smb.conf and restart samba
<RiXtEr> then give me the tail -n 50 /var/log/smaba
<RiXtEr> tail -n 50 /usr/local/samba/var/log.samba
<RiXtEr> what about kinit administrator
<Paruza> pauses for awhile, then prompts for password
<RiXtEr> give it your administrator password
<RiXtEr> hopefully it will issue you a ticket.
<Paruza> i assume this is the correct error for incorrect password? kinit: Preauthentication failed while getting initial credentials
<RiXtEr> its your administrator password
<Paruza> well kinit winadmin with my pass returns with no error
<RiXtEr> Paruza, what about klist now
<Paruza> i'm not sure if the administrator by that name exists
<Paruza> Default principal: winadmin@INGEBRETSEN.LOCAL
<Paruza> Valid starting Expires Service principal
<Paruza> 19/08/2013 22:14 20/08/2013 08:14 krbtgt/INGEBRETSEN.LOCAL@INGEBRETSEN.LOCAL
<Paruza> renew until 20/08/2013 22:13
<RiXtEr> Hey, that means your KDC server is working.
<RiXtEr> is the dns still slow?
<Paruza> yes :/
<RiXtEr> explain that a bit more to me.
<Paruza> dig @192.168.0.10 takes about 10+ seconds, sometimes times out
<Paruza> usually works however after that length of time
<Paruza> dig @anywhere else works fine, like 184.108.40.206, etc
<Paruza> the first took about 16 secs before timeout, the next took about 13 then worked
<Paruza> root 18047 59.0 6.8 1032436 277012 ? R 22:09 7:18 ./samba
<Paruza> i imagine its because of that cpu usage though
<Paruza> normally it should sit at around 0, 1 or so i imagine?
<Paruza> there is nothing going on at that location atm, theres only 10 workstations or so
* OldEnK (~OldEnK@63-152-103-245.cdrr.qwest.net) has joined
<Paruza> the server is only used for dhcp, samba, pptpd, and the install is new as of about a week ago
<RiXtEr> how many users do you have total?
<Paruza> this issue has been going on that whole week, however we initially assumed it was related to their internet connection, which we were replacing anyway (with fiber) which is in place now.
<Paruza> not many, its a little retail shop
<Paruza> less than 10
<Paruza> usually no more than 2-3 at any time
<Paruza> (actually doing work on computers)
<RiXtEr> Not sure that a fresh provision wouldn't help more than anything (especially if you are not sure about if you changed those lines or not)
<Paruza> that would mean they arent working in the morning if I did that, since i'd have to go rejoin all the pc's
<RiXtEr> yep :(
<Paruza> is there anyway we can dig into the ldap or something and see if theres anything out of place there?
<RiXtEr> you can use the ldb tools
<Paruza> i dont really know my way around samba4 well yet
<RiXtEr> something like ldbsearch -H /usr/local/samba/private/sam.ldb
<RiXtEr> I am afraid I am of little help when it comes to messing with the internal DB
<Paruza> i will search for my spelling mistake
<abartlet> Paruza: what is in your resolv.conf?
<Paruza> ok cool yeah theres no spelling mistakes in the domain name in that output
<Paruza> ldbsearch -H /usr/local/samba/private/sam.ldb
<RiXtEr> Paruza, add a --cross-ncs to that as well.
<Paruza> nameserver 192.168.0.10
<Paruza> search ingebretsen.local
<abartlet> Paruza: looks plausable...
<abartlet> is the server CPU busy during the delays, or idle?
<abartlet> (what does top show)
<Paruza> its always busy because the samba process is always chewing cpu
<abartlet> ok, so that's the issue
<abartlet> we had one other person report a spinner recently, but they never got the info I needed
<Paruza> ill get whatever you want XD
<abartlet> got time to learn about 'perf', and gdb?
<abartlet> if we have a CPU spinner, we need to know where the cpu time is being spent
<Paruza> i do some c++ but i'm not very good with those tool so i'd be happy to know more
<abartlet> with a *lot* of spare disk space, if you run Samba under perf as 'perf record -g samba'
<abartlet> and then after it spins, kill it off and run 'perf report -g', it gives a tree output showing where the CPU time is spent, and what the callers are at the time
<Paruza> i have 185 gigs
<abartlet> it's pretty low level, but it is the best clue at this poing
<abartlet> the other approach (perhaps try this first) is to run samba, and then attach to the busy process with 'gdb -p PID' where PID is the PID of the busy process
<abartlet> and then 'bt full', hopefully catching it in the 'busy' routine
* OldEnK (~OldEnK@63-152-103-245.cdrr.qwest.net) has left ("Leaving")
<abartlet> then 'step', and see if you can sense a loop in the code flow
<abartlet> hmm, but that loop is 'normal'
<abartlet> your database is small, right?
<Paruza> like 3 user accounts
<abartlet> ok, get me the 'perf report -g' output
<Paruza> is it usual for perf.data not to grow much?
<Paruza> its like 20k
<abartlet> that doesn't seem right
<RiXtEr> Paruza, any chance the /var/run/lock is full when samba is running and you do a df -h ?
<RiXtEr> sorry thats just /run/lock
<Paruza> its not
<abartlet> Paruza: is samba busy again?
<Paruza> oh um anyway the report shows 50% usage on ld-2.15.so
<Paruza> 10% on libsamba-util.so.0.0.1
<abartlet> yeah, did samba actually start?
<abartlet> or did it just exit?
<Paruza> it started as usual, and the one process was high cpu
<Paruza> root@main:/usr/local/samba/sbin# perf_3.2.0-52 record -g ./samba
<Paruza> [ perf record: Woken up 1 times to write data ]
<Paruza> [ perf record: Captured and wrote 0.017 MB perf.data (~721 samples) ]
<Paruza> it returns to the prompt rather quickly
<abartlet> what is in the logs?
<abartlet> ok, you need to work out how to make perf trace a background/child process
<abartlet> perhaps try this, start samba
<abartlet> and then run perf record -p PID
<abartlet> where PID is the CPU spinner
* Spydor has quit (Ping timeout: 264 seconds)
<Paruza> want me to upload this file? its 5 megs
<RiXtEr> Paruza, how big is it if you tar --xz -cf somefile.tar.xz perffile
<Paruza> its not big, sec ill upload it and give you a link
<Paruza> 93k btw
<RiXtEr> abartlet, ^^
<RiXtEr> Paruza, might be good to prefix people when you want them alerted of something :)
<Paruza> oh sorry
<RiXtEr> Paruza, no problem, just good to do (its essential in a busier chat room)
<abartlet> Paruza: I'll need you to run the 'perf report' side too, as it is based on your actual build
<abartlet> hopefully you built with debug symbols
<Paruza> pretty much just did ./configure && make && make install or somesuch
<Paruza> so I doubt it
<RiXtEr> Paruza, ./configure-developer is how you'd build with symbols
<RiXtEr> anyway I am out.
<Paruza> thanks for the help
<RiXtEr> no problem.
<RiXtEr> I am leaving you in my trainers hands :)
<Paruza> ill build a copy with symbols quick
<abartlet> I think you already did, the previous gdb output shoes up fine
<Paruza> ah ok
<abartlet> but if you have re-built, you will need to re-run
<Paruza> i did not
<abartlet> so, to use this tool
<abartlet> the plus signs allow you to explode a call graph
<abartlet> working out what is calling that function in libc for 36.11% of the time will be the important bit
<RiXtEr> Paruza, before I go to bed, what distro and version of it are you using (did you mention Ubuntu?)
<Paruza> it started yes
<Paruza> sorry was responding to something further up
<Paruza> yeah sec ill get the exact version
* Poster has quit (Ping timeout: 246 seconds)
<RiXtEr> Paruza, cat /etc/lsb-release I think
<Paruza> Ubuntu 12.04.2 LTS
<RiXtEr> is that 64 bit?
<Paruza> i think so yes
<RiXtEr> ok I am just finishing up my vm, but its 12.04.1
<Paruza> i suspect the updates will increment that
<RiXtEr> so I will update before I go to bed)
<Paruza> abartlet, http://focusedsupport.it/stuff/samba01.png
<Paruza> the text formatting was messing up so I just made an image
<abartlet> Paruza: is there any unusual network traffic?
<abartlet> this is a *lot* of ldb searches
<Paruza> i can take tcpdump quick, i havnt looked yet
<abartlet> also drill down to work out who is calling search_func eventually, in the first few branches
* Poster (~email@example.com) has joined
<Paruza> abartlet, http://focusedsupport.it/stuff/samba02.png
* Poster has quit (Remote host closed the connection)
* amitay has quit (Ping timeout: 248 seconds)
<abartlet> Paruza: ahah!
<abartlet> does your network trace show DNS traffic?
<abartlet> and which version is this?
<Paruza> well the samba logs show mostly dns updates
<abartlet> so, the DNS traffic is causing database updates, including deletes
<abartlet> which is where the cpu time is being spent
<abartlet> there are two things to note: The cod in master is a little faster for this, and nothing should ever be doing so many deletes that we can notice it in the call graph like this...
* geek has quit (Ping timeout: 276 seconds)
<Paruza> abartlet, there doesnt seem to be a very impressive amount of traffic
<Paruza> restarting samba and taking another sample
<abartlet> Paruza: thanks
<Paruza> i can send this cap if you would like to look at it, i'm not sure how interesting it is, theres some dns traffic but doesnt look terribly odd
<abartlet> also check localhost loopback
<Paruza> there wasnt any traffic on lo
<Paruza> within about 60 seconds or so
<Paruza> tested with a ping, the cap on lo does work, so theres no traffic there
<abartlet> so, I wonder where the DNS load is coming from
<abartlet> can you get back to the gdb, and get me the bt full?
<Paruza> just gdb samba pid , bt full?
<abartlet> it's clearly in there. I trust it wasn't just luck, can you run 'c'
<abartlet> can then ctrl-c it
<abartlet> and then bt full again
<abartlet> my guess is that we somehow process the same packet over and over again
<abartlet> kblin: happen to be around?
<abartlet> Paruza: while doing this, can you prove there is no additional DNS packets inbound?
<Paruza> i msged you the cap, that looks to be fairly representative
<Paruza> there is dns traffic i suspect all the time, but it didnt look like a suspicious quantity
<Paruza> we could poentially put in a iptables rule to block inbound dns for testing...
<Paruza> just to verify it quiets the process?
* violetina has quit (Remote host closed the connection)
* Devastator has quit (Ping timeout: 240 seconds)
* Devastator (~firstname.lastname@example.org) has joined
<Paruza> iptables -A INPUT -i eth0 -p tcp --dport 53 -j DROP
<Paruza> iptables -A INPUT -i eth0 -p udp --dport 53 -j DROP
<Paruza> its down to 6% cpu
<Paruza> and dns is pretty normal speed
<abartlet> so, one thing I noticed
<abartlet> it take so long to handle the DNS packet that the client sends another one
<abartlet> and then this backs it up further, and it becomes loop
<abartlet> each additional packet is processed in turn, and slows it down even more
<Paruza> if you are interested in the specs: http://codepad.org/JfLxneWh
<Paruza> not a terribly strong server, its getting replaced in a few weeks
<Paruza> but still, dns shouldnt kill it
<abartlet> so, the main issue is that each DNS packet is being processed twice
<abartlet> that's the real bug here
<Paruza> though it wouldnt get processed twice if it didnt take so long the first time right?
<abartlet> no, that's additional
<Paruza> ah ok
<abartlet> not only is it processed twice, the client gives up on the reply and tries again
<abartlet> Paruza: can you please file a bug "DNS updates processed twice"?
<Paruza> in the meantime do you think i should look at switching to the bind backend?
<abartlet> yeah, that should fix this
<abartlet> and knowing that would help confirm the bug, make sure to note that
* gregoryo has quit (Quit: Ex-Chat)
<Paruza> do you want me to paste this conversation in the bug report? or how detailed do you want this?
Note: I switched to the bind backend as a workaround and its working without issue.