I have a problem that looks somewhat like https://bugzilla.samba.org/show_bug.cgi?id=3271. But I don't have enough detail on 3271 to say for sure, so here goes.
I capture the running pid of daemon server processes and register them via the ODBC-dblog patch. My application occasionally grabs the pid from the database and issues a kill for the registered server process.
The server runs on a linux 2.6.16 based kernel.
I frequently see the kill terminate the receiver process leaving the generator process stuck in sleep with the traceback detailed below. Occasionally the generator terminates and the receiver is left running similarly stuck in select(). I don't have a traceback at the moment. Can probably recreate that if needed.
I've found that by adding a exception fd_set to both the receiver's and the generator's select() call that both processes seem to flush successfully. Patch is attached.
Here's the generator's traceback:
#0 0x473ad402 in __kernel_vsyscall ()
#1 0x4748cddd in ___newselect_nocancel () from /lib/libc.so.6
#2 0x0806a397 in writefd_unbuffered (fd=0, buf=0xbfec9dfc "]", len=97)
#3 0x080692e9 in mplex_write (code=Variable "code" is not available.
) at io.c:1169
#4 0x080693db in io_multiplex_write (code=MSG_ERROR,
buf=0xbfecae88 "rsync error: received SIGINT, SIGTERM, or SIGHUP
(code 20) at rsync.c(260) [generator=2.6.8]\n", len=93) at io.c:1377
#5 0x0805c5b2 in rwrite (code=FERROR, buf=0xbfecae88 "rsync error:
received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(260)
len=93) at log.c:272
#6 0x0805c935 in rprintf (code=4294966782, format=0x8088eec "rsync
error: %s (code %d) at %s(%d) [%s=%s]\n") at log.c:385
#7 0x0805d86c in log_exit (code=20, file=0x8086ecb "rsync.c", line=260)
#8 0x08051e3d in _exit_cleanup (code=20, file=0x8086ecb "rsync.c",
line=260) at cleanup.c:154
#9 0x0804b395 in sig_int (val=15) at rsync.c:260
#10 <signal handler called>
#11 0x473ad402 in __kernel_vsyscall ()
#12 0x4748cddd in ___newselect_nocancel () from /lib/libc.so.6
#13 0x0806a397 in writefd_unbuffered (fd=0, buf=0xbfecc7bc "h\002",
len=620) at io.c:1074
#14 0x080692e9 in mplex_write (code=Variable "code" is not available.
) at io.c:1169
#15 0x0806937b in io_flush (flush_it_all=-4) at io.c:1191
#16 0x0806943b in read_timeout (fd=8, buf=0xbfeccd7c "", len=4) at
#17 0x080699d6 in read_loop (fd=8, buf=0xbfeccd7c "", len=4) at io.c:735
#18 0x08069a22 in read_msg_fd () at io.c:246
#19 0x08069e4d in get_redo_num (itemizing=1, code=FLOG) at io.c:406
#20 0x0804f9eb in generate_files (f_out=0, flist=0x82d8730,
local_name=0x0) at generator.c:1613
#21 0x08058bb1 in do_recv (f_in=0, f_out=0, flist=0x82d8730,
local_name=0x0) at main.c:715
#22 0x0805902a in start_server (f_in=0, f_out=0, argc=1, argv=Variable
"argv" is not available.
) at main.c:796
#23 0x080771cd in start_daemon (f_in=0, f_out=0) at clientserver.c:756
#24 0x08077b72 in daemon_main () at clientserver.c:882
#25 0x08059f54 in main (argc=0, argv=0x0) at main.c:1291
Created attachment 2169 [details]
Add exception fd_sets to recevier, generator's select()
I originally added the additional fd_sets to try to better identify the problem. I don't see the added "select error:" messages. But the process doesn't seem to hang anymore.
Good idea -- I should have done that long ago.
I have checked-in some changes very similar to your patch. The code currently just logs the warning (like you did) without taking any other action. We can decide later if some exceptions should be fatal.