Here is what's in smbd.log: [2015/03/30 02:49:06.359810, 1, pid=45373] ../source3/lib/messages_dgm.c:269(messaging_dgm_init) unix_msg_init failed: Address already in use [2015/03/30 02:49:06.366289, 0, pid=45373] ../source3/lib/messages.c:346(messaging_reinit) messaging_dgm_init failed: Address already in use [2015/03/30 02:49:06.366463, 0, pid=45373] ../source3/lib/util.c:480(reinit_after_fork) messaging_reinit() failed: NT_STATUS_ADDRESS_ALREADY_ASSOCIATED [2015/03/30 02:49:06.366687, 0, pid=45373] ../source3/smbd/server.c:623(smbd_accept_connection) reinit_after_fork() failed [2015/03/30 02:49:06.366874, 0, pid=45373] ../source3/lib/util.c:788(smb_panic_s3) PANIC (pid 45373): reinit_after_fork() failed [2015/03/30 02:49:06.368445, 0, pid=45373] ../source3/lib/util.c:899(log_stack_trace) BACKTRACE: 10 stack frames: #0 /tmw-nas-3p/samba/lib/libsmbconf.so.0'log_stack_trace+0x26 [0xfffffd7ff95275e6] #1 /tmw-nas-3p/samba/lib/libsmbconf.so.0'smb_panic_s3+0x21 [0xfffffd7ff95276b2] #2 /tmw-nas-3p/samba/lib/libsamba-util.so.0.0.1'smb_panic+0x31 [0xfffffd7ffa93e2c1] #3 /tmw-nas-3p/samba/sbin/smbd'smbd_accept_connection+0x341 [0x40bc11] #4 /tmw-nas-3p/samba/lib/libsmbconf.so.0'run_events_poll+0x168 [0xfffffd7ff953c808] #5 /tmw-nas-3p/samba/lib/libsmbconf.so.0's3_event_loop_once+0xd3 [0xfffffd7ff953ca63] #6 /tmw-nas-3p/samba/lib/private/libtevent.so.0.9.22'_tevent_loop_once+0x8d [0xfffffd7ffa0579cd] #7 /tmw-nas-3p/samba/lib/private/libtevent.so.0.9.22'tevent_common_loop_wait+0x23 [0xfffffd7ffa057b93] #8 /tmw-nas-3p/samba/sbin/smbd'main+0x138a [0x40d6ba] #9 /tmw-nas-3p/samba/sbin/smbd'_start+0x6c [0x4085dc] [2015/03/30 02:49:06.369651, 0, pid=45373] ../source3/lib/dumpcore.c:318(dump_core) dumping core in /tmw-nas-3p/samba/var/cores/smbd
We've struggled with this issue for a few weeks. Our debug version of samba shows that the socket file samba/var/cache/msg/<pid> file does not exist before the bind() call, but bind() fails with EADDRINUSE very sporadically, which is extremely weird. First we thought it might be an OS issue, so we dug deeper into the kernel implementation of the bind() and related stuff, we found out that when doing bind(), the kernel uses the vnode pointer of the socket file as a hash key, if such a key already exists in the hash table, it returns EADDRINUSE. The key is removed from the hash table when closing the socket fd. Apparently things can go wrong if the socket file is unlinked first, and then socket fd is closed, since the OS can reuse the vnode pointer for newer socket file once it is unlinked. Here is when the socket file is removed: source3/lib/unix_msg/unix_msg.c static int unix_dgram_free(struct unix_dgram_ctx *ctx) { if (ctx->send_queues != NULL) { return EBUSY; } if (ctx->send_pool != NULL) { int ret = pthreadpool_destroy(ctx->send_pool); if (ret != 0) { return ret; } ctx->ev_funcs->watch_free(ctx->pool_read_watch); } ctx->ev_funcs->watch_free(ctx->sock_read_watch); if (getpid() == ctx->created_pid) { /* If we created it, unlink. Otherwise someone else might * still have it open */ unlink(ctx->path); } close(ctx->sock); free(ctx->recv_buf); free(ctx); return 0; } I believe the unlink() should be moved after close(ctx->sock), which makes sense to me. This should fix the issue. My co-worker Ken Harris has a C program which can be used to reproduce this behavior. If needed, we can post it here.
(In reply to YOUZHONG YANG from comment #1) Wow, great analysis, thanks! To be honest I'd call this a Solaris bug. Of course I don't see a real reason to do it in the order we do it. Can you change the order and give it a try for a while and tell us if it gets fixed? Thanks, Volker
Just to reinforce Volker, once you've confirmed the change, send in the diff and we'll get it into master and all releases. Thanks a *LOT* for your analysis ! Jeremy.
Created attachment 10969 [details] patch
We performed stress testing last weekend, with the correct ordering of closing fd and unlinking socket file, no more smbd panics! The proposed patch is attached for your reference. Thanks a lot.
Created attachment 10980 [details] git-am fix for 4.2.next. Cherry-pick of fix that went into master.
Karolin please push for 4.2.next. Thanks !
This patch did not make it into 4.2.2. I see that the Status is NEEDINFO, but it looks like the information has been provided. Is anything else needed?
(In reply to Tom Schulz from comment #8) Looks like it was just something that Karolin missed (she is superhuman, but not infallible :-). Karolin, could you push this patch for 4.2.next ? Thanks, Jeremy.
(In reply to Jeremy Allison from comment #9) Wow, I have no idea why this one did not show up in my "assigned list"... Sorry for missing it!
Pushed to autobuild-v4-2-test.
(In reply to Karolin Seeger from comment #11) Pushed to v4-2-test. Closing out bug report. Thanks!