Bug 11217 - smbd crashes sporadically on Solaris platform - bind() fails with EADDRINUSE in unix_dgram_init()
Summary: smbd crashes sporadically on Solaris platform - bind() fails with EADDRINUSE ...
Status: RESOLVED FIXED
Alias: None
Product: Samba 4.1 and newer
Classification: Unclassified
Component: Other (show other bugs)
Version: 4.2.0
Hardware: All All
: P5 normal (vote)
Target Milestone: ---
Assignee: Karolin Seeger
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-04-14 19:01 UTC by YOUZHONG YANG
Modified: 2015-06-17 18:02 UTC (History)
1 user (show)

See Also:


Attachments
patch (954 bytes, patch)
2015-04-20 13:50 UTC, YOUZHONG YANG
no flags Details
git-am fix for 4.2.next. (1.23 KB, patch)
2015-04-22 23:53 UTC, Jeremy Allison
vl: review+
Details

Note You need to log in before you can comment on or make changes to this bug.
Description YOUZHONG YANG 2015-04-14 19:01:42 UTC
Here is what's in smbd.log: 

[2015/03/30 02:49:06.359810,  1, pid=45373] ../source3/lib/messages_dgm.c:269(messaging_dgm_init) 
  unix_msg_init failed: Address already in use 
[2015/03/30 02:49:06.366289,  0, pid=45373] ../source3/lib/messages.c:346(messaging_reinit) 
  messaging_dgm_init failed: Address already in use 
[2015/03/30 02:49:06.366463,  0, pid=45373] ../source3/lib/util.c:480(reinit_after_fork) 
  messaging_reinit() failed: NT_STATUS_ADDRESS_ALREADY_ASSOCIATED 
[2015/03/30 02:49:06.366687,  0, pid=45373] ../source3/smbd/server.c:623(smbd_accept_connection) 
  reinit_after_fork() failed 
[2015/03/30 02:49:06.366874,  0, pid=45373] ../source3/lib/util.c:788(smb_panic_s3) 
  PANIC (pid 45373): reinit_after_fork() failed 
[2015/03/30 02:49:06.368445,  0, pid=45373] ../source3/lib/util.c:899(log_stack_trace) 
  BACKTRACE: 10 stack frames: 
   #0 /tmw-nas-3p/samba/lib/libsmbconf.so.0'log_stack_trace+0x26 [0xfffffd7ff95275e6] 
   #1 /tmw-nas-3p/samba/lib/libsmbconf.so.0'smb_panic_s3+0x21 [0xfffffd7ff95276b2] 
   #2 /tmw-nas-3p/samba/lib/libsamba-util.so.0.0.1'smb_panic+0x31 [0xfffffd7ffa93e2c1] 
   #3 /tmw-nas-3p/samba/sbin/smbd'smbd_accept_connection+0x341 [0x40bc11] 
   #4 /tmw-nas-3p/samba/lib/libsmbconf.so.0'run_events_poll+0x168 [0xfffffd7ff953c808] 
   #5 /tmw-nas-3p/samba/lib/libsmbconf.so.0's3_event_loop_once+0xd3 [0xfffffd7ff953ca63] 
   #6 /tmw-nas-3p/samba/lib/private/libtevent.so.0.9.22'_tevent_loop_once+0x8d [0xfffffd7ffa0579cd] 
   #7 /tmw-nas-3p/samba/lib/private/libtevent.so.0.9.22'tevent_common_loop_wait+0x23 [0xfffffd7ffa057b93] 
   #8 /tmw-nas-3p/samba/sbin/smbd'main+0x138a [0x40d6ba] 
   #9 /tmw-nas-3p/samba/sbin/smbd'_start+0x6c [0x4085dc] 
[2015/03/30 02:49:06.369651,  0, pid=45373] ../source3/lib/dumpcore.c:318(dump_core) 
  dumping core in /tmw-nas-3p/samba/var/cores/smbd
Comment 1 YOUZHONG YANG 2015-04-14 19:22:26 UTC
We've struggled with this issue for a few weeks.

Our debug version of samba shows that the socket file samba/var/cache/msg/<pid> file does not exist before the bind() call, but bind() fails with EADDRINUSE very sporadically, which is extremely weird.

First we thought it might be an OS issue, so we dug deeper into the kernel implementation of the bind() and related stuff, we found out that when doing bind(), the kernel uses the vnode pointer of the socket file as a hash key, if such a key already exists in the hash table, it returns EADDRINUSE. The key is removed from the hash table when closing the socket fd.

Apparently things can go wrong if the socket file is unlinked first, and then socket fd is closed, since the OS can reuse the vnode pointer for newer socket file once it is unlinked.

Here is when the socket file is removed:
source3/lib/unix_msg/unix_msg.c
static int unix_dgram_free(struct unix_dgram_ctx *ctx)
{
	if (ctx->send_queues != NULL) {
		return EBUSY;
	}

	if (ctx->send_pool != NULL) {
		int ret = pthreadpool_destroy(ctx->send_pool);
		if (ret != 0) {
			return ret;
		}
		ctx->ev_funcs->watch_free(ctx->pool_read_watch);
	}

	ctx->ev_funcs->watch_free(ctx->sock_read_watch);

	if (getpid() == ctx->created_pid) {
		/* If we created it, unlink. Otherwise someone else might
		 * still have it open */
		unlink(ctx->path);
	}

	close(ctx->sock);
	free(ctx->recv_buf);
	free(ctx);
	return 0;
}

I believe the unlink() should be moved after close(ctx->sock), which makes sense to me. This should fix the issue.

My co-worker Ken Harris has a C program which can be used to reproduce this behavior. If needed, we can post it here.
Comment 2 Volker Lendecke 2015-04-14 19:51:39 UTC
(In reply to YOUZHONG YANG from comment #1)

Wow, great analysis, thanks!

To be honest I'd call this a Solaris bug. Of course I don't see a real reason to do it in the order we do it. Can you change the order and give it a try for a while and tell us if it gets fixed?

Thanks,

Volker
Comment 3 Jeremy Allison 2015-04-15 20:22:17 UTC
Just to reinforce Volker, once you've confirmed the change, send in the diff and we'll get it into master and all releases.

Thanks a *LOT* for your analysis !

Jeremy.
Comment 4 YOUZHONG YANG 2015-04-20 13:50:19 UTC
Created attachment 10969 [details]
patch
Comment 5 YOUZHONG YANG 2015-04-20 13:53:57 UTC
We performed stress testing last weekend, with the correct ordering of closing fd and unlinking socket file, no more smbd panics!

The proposed patch is attached for your reference.

Thanks a lot.
Comment 6 Jeremy Allison 2015-04-22 23:53:17 UTC
Created attachment 10980 [details]
git-am fix for 4.2.next.

Cherry-pick of fix that went into master.
Comment 7 Jeremy Allison 2015-04-23 19:52:43 UTC
Karolin please push for 4.2.next.

Thanks !
Comment 8 Tom Schulz 2015-06-09 14:01:30 UTC
This patch did not make it into 4.2.2. I see that the Status is NEEDINFO, but it looks like the information has been provided. Is anything else needed?
Comment 9 Jeremy Allison 2015-06-09 16:23:11 UTC
(In reply to Tom Schulz from comment #8)

Looks like it was just something that Karolin missed (she is superhuman, but not infallible :-).

Karolin, could you push this patch for 4.2.next ?

Thanks,

Jeremy.
Comment 10 Karolin Seeger 2015-06-09 19:22:28 UTC
(In reply to Jeremy Allison from comment #9)
Wow, I have no idea why this one did not show up in my "assigned list"...
Sorry for missing it!
Comment 11 Karolin Seeger 2015-06-09 19:23:21 UTC
Pushed to autobuild-v4-2-test.
Comment 12 Karolin Seeger 2015-06-17 18:02:04 UTC
(In reply to Karolin Seeger from comment #11)
Pushed to v4-2-test.
Closing out bug report.

Thanks!