Bug 9991 - tevent needs to use monotonic clock (winbind stuck coming online)
Summary: tevent needs to use monotonic clock (winbind stuck coming online)
Status: NEW
Alias: None
Product: Samba 4.0
Classification: Unclassified
Component: Winbind (show other bugs)
Version: 4.0.6
Hardware: All All
: P5 normal (vote)
Target Milestone: ---
Assignee: Stefan Metzmacher
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-07-02 13:43 UTC by David Woodhouse
Modified: 2013-07-09 12:03 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description David Woodhouse 2013-07-02 13:43:54 UTC
Machine boots, winbind starts up. Not on VPN so should be offline.

User joins VPN. a NetworkManager dispatcher script runs 'smbcontrol winbind online'. But nothing seems to happen because:


[2013/07/02 16:04:55.541651, 10, pid=1115, effective(0, 0), real(0, 0), class=winbind] ../source3/winbindd/winbindd_cm.c:205(fork_child_dc_connect)
  fork_child_dc_connect: pid 1635 already checking for DC's.

This seems to persist for ever. Trying to take it offline and online again with smbcontrol just continues to yield log messages, half an hour later, reporting the same:

[2013/07/02 16:39:34.433513, 10, pid=1115, effective(0, 0), real(0, 0), class=winbind] ../source3/winbindd/winbindd_cm.c:205(fork_child_dc_connect)
  fork_child_dc_connect: pid 1635 already checking for DC's.


This is the offending pid 1635:

(gdb) bt
#0  0x0000003f0a2ea8e0 in __poll_nocancel () from /lib64/libc.so.6
#1  0x0000003f3a043958 in s3_event_loop_once () from /lib64/libsmbconf.so.0
#2  0x0000003f28e03bcd in _tevent_loop_once () from /lib64/libtevent.so.0
#3  0x0000003f28e04c0f in tevent_req_poll () from /lib64/libtevent.so.0
#4  0x0000003f2e8010be in tevent_req_poll_ntstatus () from /lib64/libtevent-util.so.0
#5  0x0000003f44a1310f in name_resolve_bcast () from /usr/lib64/samba/libgse.so
#6  0x0000003f44a14327 in internal_resolve_name () from /usr/lib64/samba/libgse.so
#7  0x0000003f44a14aa9 in get_dc_list () from /usr/lib64/samba/libgse.so
#8  0x0000003f44a159df in get_sorted_dc_list () from /usr/lib64/samba/libgse.so
#9  0x0000003f0e21b428 in get_dc_name () from /usr/lib64/samba/libads.so
#10 0x000000000043f3ca in get_dcs ()
#11 0x000000000043fa08 in check_domain_online_handler ()
#12 0x0000003f28e079ef in tevent_common_loop_timer_delay () from /lib64/libtevent.so.0
#13 0x0000003f3a043799 in run_events_poll () from /lib64/libsmbconf.so.0
#14 0x000000000044fe9a in fork_domain_child ()
#15 0x0000000000450cc5 in wb_child_request_trigger ()
#16 0x0000003f28e043f4 in tevent_common_loop_immediate () from /lib64/libtevent.so.0
#17 0x0000003f3a04360c in run_events_poll () from /lib64/libsmbconf.so.0
#18 0x0000003f3a0438f4 in s3_event_loop_once () from /lib64/libsmbconf.so.0
#19 0x0000003f28e03bcd in _tevent_loop_once () from /lib64/libtevent.so.0
#20 0x000000000042008a in main ()
Comment 1 David Woodhouse 2013-07-02 13:48:40 UTC
Looking through logs, it seems that time is jumping backwards by about three hours. The user's clock is probably off by that amount, and is being corrected by NTP during the boot process after winbind has already started up.

Could the offending 'stuck' winbind process actually be waiting for three hours for a timeout to expire, which was really only supposed to last a few seconds?

POSIX provides monotonic clocks which can be used for timeouts; it isn't necessary to use wall-clock time...
Comment 2 Björn Jacke 2013-07-02 19:27:54 UTC
Samba uses the monotonic clock at many places. At the most important place
however (libtevent) it monotonic time is not being used. This would mean major
rework of the tevent. I discussed this with metze when I did other monotonic
clock fixes as the tevent changes would affect all parts of samba that use
tevent. Unfortunately the tevent rework was postponed.
Comment 3 Simo Sorce 2013-07-02 20:09:58 UTC
(In reply to comment #2)
> Samba uses the monotonic clock at many places. At the most important place
> however (libtevent) it monotonic time is not being used. This would mean major
> rework of the tevent. I discussed this with metze when I did other monotonic
> clock fixes as the tevent changes would affect all parts of samba that use
> tevent. Unfortunately the tevent rework was postponed.

I do not fully agree that it is always in tevent interest to use a monotonic clock although if CLOCK_BOOTTIME is used it might work well for most uses.

Maybe we could have a tevent switch that we can tweak at initialization time to decide what to use ?
Comment 4 Björn Jacke 2013-07-02 20:23:53 UTC
as I discussed a while ago with metze, tevent would require two different interfaces for events that should happen

(1) at a certain time and one for events

(2) that happen after a certain ammount of time

Currently it tevent is entirely based on the (1) event model even though most tevent "consumers" actually need (2). And for (2) the monotonic clock is needed. 

For events of type (1) the realtime clock is needed. You cannot bring both together with CLOCK_BOOTTIME without running in other major problems, I'm pretty sure. (CLOCK_BOOTTIME is also Linux only and only available in kernels after 2011)
Comment 5 Simo Sorce 2013-07-02 20:56:32 UTC
(In reply to comment #4)
> as I discussed a while ago with metze, tevent would require two different
> interfaces for events that should happen
> 
> (1) at a certain time and one for events
> 
> (2) that happen after a certain ammount of time

I agree on these.

> Currently it tevent is entirely based on the (1) event model even though most
> tevent "consumers" actually need (2). And for (2) the monotonic clock is
> needed. 
> 
> For events of type (1) the realtime clock is needed.

By realtime you mean CLOCK_REALTIME or time(NULL) ?
I can see hwo different programs may want one or the other.

> You cannot bring both
> together with CLOCK_BOOTTIME without running in other major problems, I'm
> pretty sure. (CLOCK_BOOTTIME is also Linux only and only available in kernels
> after 2011)

For 2 CLOCK_BOOTTIME is what you want in many case (for example SSSD), with fallback to CLOCK_MONOTONIC if CLOCK_BOOTTIME is not available.
Comment 6 Björn Jacke 2013-07-03 08:30:50 UTC
(In reply to comment #5)
> By realtime you mean CLOCK_REALTIME or time(NULL) ?
> I can see hwo different programs may want one or the other.

they use the same clock, the clock_gettime(CLOCK_REALTIME) just has the sub-second resolution.

> For 2 CLOCK_BOOTTIME is what you want in many case (for example SSSD), with
> fallback to CLOCK_MONOTONIC if CLOCK_BOOTTIME is not available.

CLOCK_BOOTTIME is monotonic, too, which is a good thing and can be used instead of CLOCK_MONOTONIC but in which situations this is usefull in sssd? You have a monotonic clock but it says nothing about the system actual time. Imagine the system booted with the realtime set to the a week ago and NTP set the time right some time after booting. Why would you want to bother about the old realtime that CLOCK_BOOTTIME returns?
Comment 7 Simo Sorce 2013-07-03 15:49:19 UTC
(In reply to comment #6)
> CLOCK_BOOTTIME is monotonic, too, which is a good thing and can be used instead
> of CLOCK_MONOTONIC but in which situations this is usefull in sssd? You have a
> monotonic clock but it says nothing about the system actual time. Imagine the
> system booted with the realtime set to the a week ago and NTP set the time
> right some time after booting. Why would you want to bother about the old
> realtime that CLOCK_BOOTTIME returns?

The only difference between MONOTONIC and BOOTTIME is that it allows to account for the time the laptop is sleeping.

If you suspend the laptop for 2 hours and I had a job to be executed after X hours, I want it to execute at the right time not at X+2 hours.

CLOCK_MONOTONIC is not increased during sleep.

Both sssd and winbindd being client daemons are often suspended in laptop so they probably want to use CLOCK_BOOTTIME in prefernce.
Comment 8 Björn Jacke 2013-07-03 16:19:49 UTC
ah thanks - a nice detail! This is what we would generally prefer also for time_mono() and clock_gettime_mono() in Samba, I'll make a patch for that.

Back to tevent - it would be a good idea, if we could introduce such an additional "event to happen after $time" model as soon as possible. (Would this require a raising of the library version number?) We could change the callers in samba to the right event timer model one after the other later on.
Comment 9 Simo Sorce 2013-07-03 19:43:02 UTC
(In reply to comment #8)
> Back to tevent - it would be a good idea, if we could introduce such an
> additional "event to happen after $time" model as soon as possible. (Would this
> require a raising of the library version number?) We could change the callers
> in samba to the right event timer model one after the other later on.

The issue is not really about version, but whether we want to stay ABI compatible. I think we do.

In order to stay ABI compatible we would need to introduce new public interfaces to get a different input type.

I prefer adding new APIs in order to be able to use both absolute and relative times at the same time and to be backwards ABI compatible.

In this case we can decide either to raise the minor or major versions but we will should not raise the SONAME.