Bug 7191 - WINS doesn't respond after > 86 #1c registrations. Patch attached
WINS doesn't respond after > 86 #1c registrations. Patch attached
Status: ASSIGNED
Product: Samba 3.3
Classification: Unclassified
Component: Nmbd
3.3.10
All All
: P3 normal
: ---
Assigned To: Jeremy Allison
Samba QA Contact
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-03-01 21:11 UTC by Craig Miskell
Modified: 2010-03-28 17:36 UTC (History)
1 user (show)

See Also:


Attachments
Patch to limit #1c registrations to 25 (3.18 KB, patch)
2010-03-01 21:11 UTC, Craig Miskell
no flags Details
Slightly modified version. (3.09 KB, patch)
2010-03-02 16:14 UTC, Jeremy Allison
no flags Details
Updated to fix the crash reported with the previous patch. (4.07 KB, patch)
2010-03-03 18:18 UTC, Jeremy Allison
no flags Details
Fix crash and a pathalogical case when registering 1b. (4.13 KB, patch)
2010-03-03 18:34 UTC, Jeremy Allison
no flags Details
Incorporated last suggestion. (4.19 KB, patch)
2010-03-03 19:55 UTC, Jeremy Allison
jra: review? (craig.miskell)
Details
Updated patch that also limits the number of records to 85 (5.52 KB, patch)
2010-03-04 14:14 UTC, Craig Miskell
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Craig Miskell 2010-03-01 21:11:01 UTC
Hi,
The Samba WINS server doesn't limit the list of #1c registrations to 25 as Windows apparently does (see http://lists.samba.org/archive/samba-technical/2001-October/016236.html, and mentions on MS technet).  When the list gets bigger than 86, the reply packet is > 576 bytes and hits security-type code in reply_netbios_packet that simply stops and never sends the reply packet.

We have hit this in production (~100 SMB servers all registering as #1c for the domain).

I'll attach a patch shortly which fixes it; it's running on our WINS server here with good success so far, although it's entirely possible I've missed some ancillary task (maybe to do with wins.dat or wins.tdb) that I'm not aware I need to do.
Comment 1 Craig Miskell 2010-03-01 21:11:56 UTC
Created attachment 5437 [details]
Patch to limit #1c registrations to 25
Comment 2 Jeremy Allison 2010-03-02 16:14:10 UTC
Created attachment 5443 [details]
Slightly modified version.

Hi Craig, can you take a look at this version of the patch. I made a few small changes.

1). Formatted more to 80 columns (Samba standard).
2). Added a call to :

wins_delete_all_tmp_in_memory_records();

just before the call to :

fetch_all_active_wins_1b_names();

as this is a temporary list used in normal 1b query processing, and the wins_delete_all_tmp_in_memory_records() ensures we don't have two copies on the list.

3). Added a call to:

wins_hook("delete", namerec, 0);

in the "if(!name1brec) {" part of the branch, as you were doing that in the other branch (and I think it needs doing every time we call remove_ip_from_name_record(namerec, namerec->data.ip[0]);.

Can you review for me please ? If you're ok with these changes I'll push into master and flag this for inclusion in 3.5.1.

Thanks a *lot* for this fix !

Jeremy.
Comment 3 Craig Miskell 2010-03-02 16:17:13 UTC
Hi,
Thanks for the feedback.  Your updated patch looks generally good; I'll get it into production today and let you know tomorrow if we run into any problems or not.
Comment 4 Jeremy Allison 2010-03-02 16:25:09 UTC
Thanks for the quick turn around on test. I'll await your reply and then push to master (if it passes your tests).

Cheers,

Jeremy.
Comment 5 Craig Miskell 2010-03-02 21:05:01 UTC
Just noted a segfault crash, although I wasn't running a debug build so didn't get a proper stacktrace/core file.  Running it again, will see how long it survives.  If it fails again, I'll revert to my patch and see if the problem remains.  

As a side note in case it gets hairy: what's the easiest way to get a debug build that will give a core file with a decent back trace?
Comment 6 Craig Miskell 2010-03-03 14:38:13 UTC
Found the problem with the updated patch.  The call to wins_delete_all_tmp_in_memory_records completely wipes the wins_server_subnet list, freeing all the records in it.  Unfortunately, one of those is the current 'namerec', so when we then try to use it later (i.e. finding/removing the oldest record), we segfault accessing the namerec (or it's data/ip addresses).  

What I think we need is a new delete function that just clears out 0x1b records, not the whole shebang.  We call that before doing the fetch_all_active_wins_1b_names.  It does seem odd to me that in the other places that do the fetch they delete *all* the records first.  Is this a different bug, or a mis-understanding on my part?

Note that the only reason I put the fetch there in the first place was because if I didn't, usually I wouldn't even get a 1b record that I knew to be there.  I hunted for other bits of code and found the call to fetch.  So, such a call might not even be necessary, if there's another way to find the 1b record.

I'm happy to code my suggested changes and test them, but I would really appreciate some advice on the sensibility of the approach first, as I'm starting to mess with more than just an inserted bit of code in the registration function.
Comment 7 Jeremy Allison 2010-03-03 18:07:15 UTC
Great catch. Thanks for testing. It isn't a bug in the other places as they haven't already found a namerec off the list.

I'll code up the fix and update the patch for you to test.

Jeremy.
Comment 8 Jeremy Allison 2010-03-03 18:18:19 UTC
Created attachment 5449 [details]
Updated to fix the crash reported with the previous patch.

Fixed version to avoid the crash. Please test and let me know !
Jeremy.
Comment 9 Jeremy Allison 2010-03-03 18:34:25 UTC
Created attachment 5450 [details]
Fix crash and a pathalogical case when registering 1b.

Ok, one more fix to ensure we cannot remove namerec and crash under any circumstances.

Please test and let me know.

Thanks.

Jeremy
Comment 10 Craig Miskell 2010-03-03 18:57:33 UTC
Bingo; it was crashing in the first few minutes before, but now it's survived for 5 already, with a number of registrations coming through.  I'll continue monitoring and will confirm in about 24 hours that it's all good (but I see no reason why it won't be now).

Thanks for your help.
Comment 11 Craig Miskell 2010-03-03 19:14:35 UTC
Just a thought:  The limitation to 25 entries is only supposed to be for 0x1c type records.  Should we have an

if(namerec->name.name_type == 0x1c)

wrapped around the 'while' section?  We could then remove that additional !=0x1b check you added.

It means other name_types are still subject to the "bug" with >86 entries,  but I think that fixing that should probably be done in send_wins_name_query_response instead; the 0x1c name_type will never have >25 entries, but others should be chopped off at the 86th entry (we have very little other choice of how to handle that many, in the absence of any clear specification of the protocol).

Your thoughts?

PS: I now see with my own eyes exactly why reverse-implementing existing protocols like this can be such a daunting task :)
Comment 12 Jeremy Allison 2010-03-03 19:55:38 UTC
Created attachment 5451 [details]
Incorporated last suggestion.

Try this fix - incorporates your last suggestion to restrict this behavior to 1c names only.

Jeremy.
Comment 13 Craig Miskell 2010-03-04 14:14:29 UTC
Created attachment 5459 [details]
Updated patch that also limits the number of records to 85

Sorry, not sure how to set the review flags properly for your processes.  I took your update and added in an 85 record limitation (based on an experienced 86-record limit, despite the 96-record theoretical limit).  I'm testing it now, and it looks good.  Will confirm later today if it experiences no problems.
Comment 14 Craig Miskell 2010-03-04 20:39:23 UTC
Hi,
Patched version has been running all day without problems so I'm pretty happy with the latest patch.  Will update/re-open if I do see any further issues, but I'm not expecting any.
Comment 15 Jeremy Allison 2010-03-15 18:22:30 UTC
Patch went into git master tree. It'll be in 3.6.0. Let me know if you need it in an earlier version.

Jeremy.
Comment 16 Lesley Walker 2010-03-15 18:41:18 UTC
Craig's just gone on leave for two weeks, but I'm pretty sure we would be happy to apply the patch ourselves if we happen to go to any intermediate versions. Many thanks for your good work.
Comment 17 Lesley Walker 2010-03-25 16:28:00 UTC
Update: We've found a problem. Somehow we ended up with two "1b" records for two different servers. One was the the correct PDC, the other was one of the BDCs. Discussed it with Craig on the phone, and he says he knows which bit of code to look at.  He'll be back in the office on Monday.
Comment 18 Craig Miskell 2010-03-28 17:36:56 UTC
I was thinking of something else; the only way I can see for the behaviour to occur is for more than one server to actually register as a 1b server.  Don't know how that happened, but I think it was just something we never noticed before, possibly because our WINS server wasn't properly responding in the past.  

I'll monitor for a while with some extra logging, and if it happens again, maybe I'll get closer to seeing why it does.

BTW, I've found a reference from MS for the implemented behaviour:
http://support.microsoft.com/kb/167365
It's slightly different to that described in the samba mailing list, but I think our current implementation is close enough.