Bug 8197 - winbind does not properly detect when a DC connection is dead
Summary: winbind does not properly detect when a DC connection is dead
Alias: None
Product: Samba 3.6
Classification: Unclassified
Component: Winbind (show other bugs)
Version: 3.6.0rc1
Hardware: All All
: P5 regression
Target Milestone: ---
Assignee: Karolin Seeger
QA Contact: Samba QA Contact
Depends on:
Blocks: 7803
  Show dependency treegraph
Reported: 2011-06-01 09:53 UTC by Volker Lendecke
Modified: 2011-09-08 10:25 UTC (History)
1 user (show)

See Also:

Possible better patch for master (needs testing) (4.38 KB, patch)
2011-06-01 11:31 UTC, Stefan Metzmacher
no flags Details
git-am fix for 3.6.0 (2.68 KB, patch)
2011-06-03 17:28 UTC, Jeremy Allison
metze: review+
Volker's fix for 3.6.x. (4.60 KB, patch)
2011-06-06 16:51 UTC, Jeremy Allison
no flags Details
Patchset (87 bytes, patch)
2011-06-07 12:38 UTC, Volker Lendecke
no flags Details
Patchset (5.67 KB, patch)
2011-06-07 12:39 UTC, Volker Lendecke
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Volker Lendecke 2011-06-01 09:53:52 UTC
5f6f719 is needed in 3.6.
Comment 1 Stefan Metzmacher 2011-06-01 11:31:54 UTC
Created attachment 6510 [details]
Possible better patch for master (needs testing)

Hi Volker,

5f6f71956460d6840c1433b59e20555268b622ac implements a different logic for this
compared to the select and epoll backends of tevent.

I think it would be much better if all backends have the same behavior.

Does the attached patch also fixes the problem for you?
If so I'd add a similar fix for the tevent poll backend.
If not I'm fine with having 5f6f71956460d6840c1433b59e20555268b622ac
as a solution for 3.6.x. But the longterm fix should look different to 5f6f71956460d6840c1433b59e20555268b622ac.

Comment 2 Volker Lendecke 2011-06-01 11:43:33 UTC
Ok, I'll add this as a customer specific patch then. Sorry for the noise.
Comment 3 Jeremy Allison 2011-06-01 16:43:15 UTC
Re-opening. We do need *some* fix for this in 3.6.0 - we can't just leave it as a master-only fix.

Comment 4 Volker Lendecke 2011-06-01 19:45:13 UTC
This isn't in master either. I reverted the patch.

Comment 5 Volker Lendecke 2011-06-01 19:45:47 UTC
BTW, I agreed with metze that this has to wait until winbind is fully async, i.e. has no child processes anymore.

Comment 6 Jeremy Allison 2011-06-01 21:40:20 UTC
Ok, after discussion with Volker, re-opening this as we need to address this in 3.6.0. I'll work with Metze to get a patch approved.

Comment 7 Jeremy Allison 2011-06-02 21:07:05 UTC
Ok, (trying to move this one along :-). I've read and evalutated Volker's patch and it seems fine to me (although why do you care that "smbd monitors EVENT_FD_WRITE in its callback, doing nothing" ? If the callback does nothing there's no harm in it calling that function, and the next time smbd tries to read from the socket it will return the error and close it.

Anyway, on to evaluating metze's patch next :-).

Comment 8 Jeremy Allison 2011-06-02 21:50:07 UTC
FYI. This :


is a good resource for looking at these kinds of things !

Comment 9 Volker Lendecke 2011-06-02 22:54:12 UTC
If you always make a socket writable also on POLLERR|POLLHUP, then you end up with a spinning smbd if a socket dies. Poll detects writability, jumps into smbd_server_connection_handler which first checks for writability. It jumps into smbd_server_connection_write_handler which does nothing but return back to the event loop. We poll again, detect POLLERR|POLLHUP and so on. smbd will never even attempt to read from the socket because when we detect writability we will not even look at it.

Comment 10 Jeremy Allison 2011-06-02 23:20:54 UTC
Doh - thanks ! Makes sense now. I think I have a really elegant+small fix for this bug though. Still working on it.

Comment 11 Jeremy Allison 2011-06-02 23:27:49 UTC
Ok, a second fix we need is to remove smbd_server_connection_write_handler() I think :-)

We need to remove the code chunks from smbd/process.c that call:

        if (flags & EVENT_FD_WRITE) {

as well as smbd_server_connection_write_handler(). But that's a fix for master only (and not part of this bug) I think :-).

Comment 12 Jeremy Allison 2011-06-03 00:48:14 UTC
Damn damn damn. My elegant fix doesn't (and can't) work :-(. Shame. Ok, looking closer.

Comment 13 Jeremy Allison 2011-06-03 01:19:20 UTC
Oh, wait a minute... I have a cunning plan :-). My elegant fix may work after all...

Comment 14 Jeremy Allison 2011-06-03 17:28:34 UTC
Created attachment 6515 [details]
git-am fix for 3.6.0

Ok, here is my "cunning plan" :-) to fix this. Metze is correct in that we should only detect fd errors on selecting for read, as this is how the select backend works (and most callers). So inside writev_send() in lib/async_req/async_sock.c ensure we always *start* by selecting for read - even if err_on_readability was set to false. We can detect the error condition in the write callback function instead.

I'm planning to add Metze's patch to master, as it's correct, but in order to fix the immediate winbindd issue I think this patch will do the job.

Please review and comment. Feel free to call me if you have any questions as I really enjoyed working on this one :-).

Comment 15 Jeremy Allison 2011-06-03 17:30:04 UTC
Comment on attachment 6515 [details]
git-am fix for 3.6.0

Adding metze as patch reviewer also.
Comment 16 Stefan Metzmacher 2011-06-06 13:32:50 UTC
Comment on attachment 6515 [details]
git-am fix for 3.6.0

If this is tested I guess it's fine, but please also get an ack from Volker.
Comment 17 Volker Lendecke 2011-06-06 13:36:22 UTC
Right now I'm experimenting to solve it inside async_smb exclusively. All other options seem too much under discussion.

Comment 18 Stefan Metzmacher 2011-06-06 13:37:47 UTC
Comment on attachment 6515 [details]
git-am fix for 3.6.0

If we take this change, this should also go into master and we should pick dbcdf3e39c359241b7 and 3c9b3b2befc524f21c to 3.6.
Comment 19 Stefan Metzmacher 2011-06-06 13:39:03 UTC
I think we should also fix this for 3.5.x (and maybe 3.4.x) as we're also using poll there.

Comment 20 Jeremy Allison 2011-06-06 16:51:53 UTC
Created attachment 6528 [details]
Volker's fix for 3.6.x.

Moved patch to correct bug.

Comment 21 Jeremy Allison 2011-06-06 18:24:30 UTC
Ok, in testing against Windows with vl's patch and my patch installed I get NT_STATUS_IO_TIMEOUT when using smbclient to put a large file. Testing with my patch only succeeded. Than my virtual machine crashed hard and took out my desktop with it :-).

I'm working this one hard today to get the correct patch reviewed.

Comment 22 Volker Lendecke 2011-06-06 19:01:57 UTC
Comment on attachment 6528 [details]
Volker's fix for 3.6.x.

Jeremy found it's broken. I'll submit a new one once I get around to it.
Comment 23 Jeremy Allison 2011-06-06 20:03:52 UTC
Ok - Karolin - the 3 patches are now in master.

Please cherry-pick :


from master to v3-6-test and this one is fixed.

Comment 24 Karolin Seeger 2011-06-07 07:16:08 UTC
(In reply to comment #23)
> Ok - Karolin - the 3 patches are now in master.
> Please cherry-pick :
> dbcdf3e39c359241b743a9455ae695e14a30caa9
> 3c9b3b2befc524f21c59f46ea9be1602b4b1bfe8
> 0efcc94fb834aeb03e8edc3034aa0cdeefdc0985
> from master to v3-6-test and this one is fixed.
> Jeremy.

Closing out bug report.

Comment 25 Volker Lendecke 2011-06-07 12:38:35 UTC
Created attachment 6537 [details]

New Patchset with fix. The first one is the same as the one previously sent, the second one is the fix for the problem Jeremy detected yesterday. For upstream I would squash the two.

Comment 26 Volker Lendecke 2011-06-07 12:39:39 UTC
Created attachment 6538 [details]

Gna, this time with --stdout.
Comment 27 Stefan Metzmacher 2011-09-08 10:25:56 UTC
Comment on attachment 6538 [details]

Volker what's the status of the 2nd part of this attachment, does commit d41d2e93f4e13e7975bcd8d4b7dc125f81ef2559 in master also fixes the problem? What was the situation where cli_smb_req_unset_pending() got called twice?