Bug 12853 - smbd panic if open() fails with errno = EINTR
smbd panic if open() fails with errno = EINTR
Product: Samba 4.1 and newer
Classification: Unclassified
Component: File services
All All
: P5 normal
: ---
Assigned To: Jeremy Allison
Samba QA Contact
Depends on:
  Show dependency treegraph
Reported: 2017-06-20 14:13 UTC by Lev
Modified: 2017-06-22 20:28 UTC (History)
3 users (show)

See Also:

Log file (23.06 KB, text/plain)
2017-06-20 14:13 UTC, Lev
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Lev 2017-06-20 14:13:26 UTC
Created attachment 13296 [details]
Log file

Errno EINTR is converted into NT_STATUS_RETRY. When open_file_ntcreate() gets status NT_STATUS_RETRY from open_file(), it calls schedule_async_open(). schedule_async_open() creates timer that panics smbd in 20sec:

(gdb) bt
#0  0x00007fbef1d9ac37 in read_alias_file (fname=<optimized out>, fname_len=<optimized out>) at localealias.c:335
#1  0x00007fbef34a349a in smb_panic_s3 (why=0x7fbef5c8666a "async open timeout") at ../source3/lib/util.c:822
#2  0x00007fbef5f8c14d in smb_panic (why=0x7fbef5c8666a "async open timeout") at ../lib/util/fault.c:166
#3  0x00007fbef5b34a43 in exit_server_common (how=SERVER_EXIT_ABNORMAL, reason=0x7fbef5c8666a "async open timeout") at ../source3/smbd/server_exit.c:240
#4  0x00007fbef5b34afa in smbd_exit_server (explanation=0x7fbef5c8666a "async open timeout") at ../source3/smbd/server_exit.c:258
#5  0x00007fbef2e60e1b in exit_server (reason=0x7fbef5c8666a "async open timeout") at ../source3/lib/smbd_shim.c:123
#6  0x00007fbef5aac0a5 in schedule_async_open_timer (ev=0x7fbef7946e20, te=0x7fbef79d1d70, current_time=..., private_data=0x7fbef7a489c0) at ../source3/smbd/open.c:2607
#7  0x00007fbef4f39ec3 in tevent_common_loop_timer_delay (ev=0x7fbef7946e20) at ../lib/tevent/tevent_timed.c:341
#8  0x00007fbef4f3b9d9 in epoll_event_loop (epoll_ev=0x7fbef79497b0, tvalp=0x7ffd48f56f70) at ../lib/tevent/tevent_epoll.c:659
#9  0x00007fbef4f3c234 in epoll_event_loop_once (ev=0x7fbef7946e20, location=0x7fbef5c95788 "../source3/smbd/process.c:4125") at ../lib/tevent/tevent_epoll.c:926
#10 0x00007fbef4f3911e in std_event_loop_once (ev=0x7fbef7946e20, location=0x7fbef5c95788 "../source3/smbd/process.c:4125") at ../lib/tevent/tevent_standard.c:114
#11 0x00007fbef4f328c3 in _tevent_loop_once (ev=0x7fbef7946e20, location=0x7fbef5c95788 "../source3/smbd/process.c:4125") at ../lib/tevent/tevent.c:533
#12 0x00007fbef4f32b0d in tevent_common_loop_wait (ev=0x7fbef7946e20, location=0x7fbef5c95788 "../source3/smbd/process.c:4125") at ../lib/tevent/tevent.c:637
#13 0x00007fbef4f391c0 in std_event_loop_wait (ev=0x7fbef7946e20, location=0x7fbef5c95788 "../source3/smbd/process.c:4125") at ../lib/tevent/tevent_standard.c:145
#14 0x00007fbef4f32bd8 in _tevent_loop_wait (ev=0x7fbef7946e20, location=0x7fbef5c95788 "../source3/smbd/process.c:4125") at ../lib/tevent/tevent.c:656
#15 0x00007fbef5adde16 in smbd_process (ev_ctx=0x7fbef7946e20, msg_ctx=0x7fbef79473a0, sock_fd=41, interactive=false) at ../source3/smbd/process.c:4125
#16 0x00007fbef6a4ed16 in smbd_accept_connection (ev=0x7fbef7946e20, fde=0x7fbef795f820, flags=1, private_data=0x7fbef795f790) at ../source3/smbd/server.c:1023
#17 0x00007fbef4f3bc16 in epoll_event_loop (epoll_ev=0x7fbef7946fa0, tvalp=0x7ffd48f57330) at ../lib/tevent/tevent_epoll.c:728
#18 0x00007fbef4f3c234 in epoll_event_loop_once (ev=0x7fbef7946e20, location=0x7fbef6a552b7 "../source3/smbd/server.c:1390") at ../lib/tevent/tevent_epoll.c:926
#19 0x00007fbef4f3911e in std_event_loop_once (ev=0x7fbef7946e20, location=0x7fbef6a552b7 "../source3/smbd/server.c:1390") at ../lib/tevent/tevent_standard.c:114
#20 0x00007fbef4f328c3 in _tevent_loop_once (ev=0x7fbef7946e20, location=0x7fbef6a552b7 "../source3/smbd/server.c:1390") at ../lib/tevent/tevent.c:533
#21 0x00007fbef4f32b0d in tevent_common_loop_wait (ev=0x7fbef7946e20, location=0x7fbef6a552b7 "../source3/smbd/server.c:1390") at ../lib/tevent/tevent.c:637
#22 0x00007fbef4f391c0 in std_event_loop_wait (ev=0x7fbef7946e20, location=0x7fbef6a552b7 "../source3/smbd/server.c:1390") at ../lib/tevent/tevent_standard.c:145
#23 0x00007fbef4f32bd8 in _tevent_loop_wait (ev=0x7fbef7946e20, location=0x7fbef6a552b7 "../source3/smbd/server.c:1390") at ../lib/tevent/tevent.c:656
#24 0x00007fbef6a4fb26 in smbd_parent_loop (ev_ctx=0x7fbef7946e20, parent=0x7fbef794a5d0) at ../source3/smbd/server.c:1390
#25 0x00007fbef6a51579 in main (argc=2, argv=0x7ffd48f57938) at ../source3/smbd/server.c:2051
Comment 1 Jeremy Allison 2017-06-20 16:07:07 UTC
this is by design. Under what circumstances can an open() on a disk file return -1,EINTR ?

Note the man page:

       EINTR  While blocked waiting to complete  an  open  of  a  slow  device
              (e.g.,  a FIFO; see fifo(7)), the call was interrupted by a sig‐
              nal handler; see signal(7).

are you doing an open on a slow device here ?
Comment 2 Ralph Böhme 2017-06-20 16:55:23 UTC
(In reply to Jeremy Allison from comment #1)
Be careful, this is man 2 open on Linux, POSIX leaves this door wide open and just states:

A signal was caught during open().

It seems we can't rely on open not returning -1/EINTR for a filesystem open.

Maybe we should use a different errno from vfs_aio_pthread to signal *real* async open in-progress, so we can distinguish it from a failing open with -1/EINTR.
Comment 3 Jeremy Allison 2017-06-20 16:59:49 UTC
Yeah let's find out the *exact* circumstances before we start rushing to make changes here. What OS, what filesystem, what type of file etc.

This code has been robust for a *long* while, so what changed for them ?
Comment 4 Jeremy Allison 2017-06-20 17:02:57 UTC
To be blunt - if open (2) on a filesystem file can return EINTR then we're screwed in many more code paths than just this one.

This has never been the case on just about any system I've ever worked on (and I've worked on a lot :-). On a fifo or filsysytem pipe etc. maybe. But not on a regular file.
Comment 5 Jeremy Allison 2017-06-20 17:11:37 UTC
If a system returns -1,EINTR on open (2)
the fix should be a custom patch that does:

$ git diff
diff --git a/source3/modules/vfs_default.c b/source3/modules/vfs_default.c
index d339f39aa06..cb87ad2b026 100644
--- a/source3/modules/vfs_default.c
+++ b/source3/modules/vfs_default.c
@@ -566,7 +566,10 @@ static int vfswrap_open(vfs_handle_struct *handle,
                goto out;

-       result = open(smb_fname->base_name, flags, mode);
+       do {
+               result = open(smb_fname->base_name, flags, mode);
+       } while (result == -1 && errno == EINTR);
        return result;

because that system is broken :-).
Comment 6 Ralph Böhme 2017-06-22 17:11:07 UTC
I guess what happens is that we somehow mistakenly did a blocking open on a file that is kernel oplocked by another process.

We do a blocking open on a file when be the oplock holding smbd released the oplock and we got signalled via dbwrap_watch on the locking.tdb record of the file.

So the real question is: why is the second blocking open still being blocked? Race condition? Something else?

This would also mean that checking for EINTR in vfs_default is probably not the correct fix.
Comment 7 Jeremy Allison 2017-06-22 17:33:04 UTC
But we shouldn't be doing the open on a file with a kernel oplock, we should have requested the holding smbd to remove that oplock before trying to do an open.

Looking briefly at the logic inside oplock.c we do remove the kernel oplock under lock, so the sending smbd shouldn't be proceeding until the kernel oplock is gone.
Comment 8 Ralph Böhme 2017-06-22 20:26:52 UTC
(In reply to Jeremy Allison from comment #7)
Exactly, something is fishy here. For some unknown reason we are doing a blocking open on a kernel oplocked file that later get's the EINTR.
Comment 9 Jeremy Allison 2017-06-22 20:28:31 UTC
Can we get a reproducer that shows how we end up in this position ?

I'm glad though - I was pretty sure that under normal conditions no system returns -1,EINTR on a disk file :-).