11217 – smbd crashes sporadically on Solaris platform - bind() fails with EADDRINUSE in unix_dgram_init()

Bug 11217 - smbd crashes sporadically on Solaris platform - bind() fails with EADDRINUSE in unix_dgram_init()

Summary: smbd crashes sporadically on Solaris platform - bind() fails with EADDRINUSE ...

Status:	RESOLVED FIXED

Alias:	None

Product:	Samba 4.1 and newer
Classification:	Unclassified
Component:	Other (show other bugs)
Version:	4.2.0
Hardware:	All All

Importance:	P5 normal (vote)
Target Milestone:	---
Assignee:	Karolin Seeger
QA Contact:	Samba QA Contact

URL:
Keywords:

Depends on:
Blocks:

Reported:	2015-04-14 19:01 UTC by YOUZHONG YANG
Modified:	2015-06-17 18:02 UTC (History)
CC List:	1 user (show)

See Also:

Attachments
patch (954 bytes, patch) 2015-04-20 13:50 UTC, YOUZHONG YANG	no flags	Details
git-am fix for 4.2.next. (1.23 KB, patch) 2015-04-22 23:53 UTC, Jeremy Allison	vl: review+	Details
Show Obsolete (1) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description YOUZHONG YANG 2015-04-14 19:01:42 UTC

Here is what's in smbd.log: 

[2015/03/30 02:49:06.359810,  1, pid=45373] ../source3/lib/messages_dgm.c:269(messaging_dgm_init) 
  unix_msg_init failed: Address already in use 
[2015/03/30 02:49:06.366289,  0, pid=45373] ../source3/lib/messages.c:346(messaging_reinit) 
  messaging_dgm_init failed: Address already in use 
[2015/03/30 02:49:06.366463,  0, pid=45373] ../source3/lib/util.c:480(reinit_after_fork) 
  messaging_reinit() failed: NT_STATUS_ADDRESS_ALREADY_ASSOCIATED 
[2015/03/30 02:49:06.366687,  0, pid=45373] ../source3/smbd/server.c:623(smbd_accept_connection) 
  reinit_after_fork() failed 
[2015/03/30 02:49:06.366874,  0, pid=45373] ../source3/lib/util.c:788(smb_panic_s3) 
  PANIC (pid 45373): reinit_after_fork() failed 
[2015/03/30 02:49:06.368445,  0, pid=45373] ../source3/lib/util.c:899(log_stack_trace) 
  BACKTRACE: 10 stack frames: 
   #0 /tmw-nas-3p/samba/lib/libsmbconf.so.0'log_stack_trace+0x26 [0xfffffd7ff95275e6] 
   #1 /tmw-nas-3p/samba/lib/libsmbconf.so.0'smb_panic_s3+0x21 [0xfffffd7ff95276b2] 
   #2 /tmw-nas-3p/samba/lib/libsamba-util.so.0.0.1'smb_panic+0x31 [0xfffffd7ffa93e2c1] 
   #3 /tmw-nas-3p/samba/sbin/smbd'smbd_accept_connection+0x341 [0x40bc11] 
   #4 /tmw-nas-3p/samba/lib/libsmbconf.so.0'run_events_poll+0x168 [0xfffffd7ff953c808] 
   #5 /tmw-nas-3p/samba/lib/libsmbconf.so.0's3_event_loop_once+0xd3 [0xfffffd7ff953ca63] 
   #6 /tmw-nas-3p/samba/lib/private/libtevent.so.0.9.22'_tevent_loop_once+0x8d [0xfffffd7ffa0579cd] 
   #7 /tmw-nas-3p/samba/lib/private/libtevent.so.0.9.22'tevent_common_loop_wait+0x23 [0xfffffd7ffa057b93] 
   #8 /tmw-nas-3p/samba/sbin/smbd'main+0x138a [0x40d6ba] 
   #9 /tmw-nas-3p/samba/sbin/smbd'_start+0x6c [0x4085dc] 
[2015/03/30 02:49:06.369651,  0, pid=45373] ../source3/lib/dumpcore.c:318(dump_core) 
  dumping core in /tmw-nas-3p/samba/var/cores/smbd

Comment 1 YOUZHONG YANG 2015-04-14 19:22:26 UTC

We've struggled with this issue for a few weeks.

Our debug version of samba shows that the socket file samba/var/cache/msg/<pid> file does not exist before the bind() call, but bind() fails with EADDRINUSE very sporadically, which is extremely weird.

First we thought it might be an OS issue, so we dug deeper into the kernel implementation of the bind() and related stuff, we found out that when doing bind(), the kernel uses the vnode pointer of the socket file as a hash key, if such a key already exists in the hash table, it returns EADDRINUSE. The key is removed from the hash table when closing the socket fd.

Apparently things can go wrong if the socket file is unlinked first, and then socket fd is closed, since the OS can reuse the vnode pointer for newer socket file once it is unlinked.

Here is when the socket file is removed:
source3/lib/unix_msg/unix_msg.c
static int unix_dgram_free(struct unix_dgram_ctx *ctx)
{
	if (ctx->send_queues != NULL) {
		return EBUSY;
	}

	if (ctx->send_pool != NULL) {
		int ret = pthreadpool_destroy(ctx->send_pool);
		if (ret != 0) {
			return ret;
		}
		ctx->ev_funcs->watch_free(ctx->pool_read_watch);
	}

	ctx->ev_funcs->watch_free(ctx->sock_read_watch);

	if (getpid() == ctx->created_pid) {
		/* If we created it, unlink. Otherwise someone else might
		 * still have it open */
		unlink(ctx->path);
	}

	close(ctx->sock);
	free(ctx->recv_buf);
	free(ctx);
	return 0;
}

I believe the unlink() should be moved after close(ctx->sock), which makes sense to me. This should fix the issue.

My co-worker Ken Harris has a C program which can be used to reproduce this behavior. If needed, we can post it here.

Comment 2 Volker Lendecke 2015-04-14 19:51:39 UTC

(In reply to YOUZHONG YANG from comment #1)

Wow, great analysis, thanks!

To be honest I'd call this a Solaris bug. Of course I don't see a real reason to do it in the order we do it. Can you change the order and give it a try for a while and tell us if it gets fixed?

Thanks,

Volker

Comment 3 Jeremy Allison 2015-04-15 20:22:17 UTC

Just to reinforce Volker, once you've confirmed the change, send in the diff and we'll get it into master and all releases.

Thanks a *LOT* for your analysis !

Jeremy.

Comment 4 YOUZHONG YANG 2015-04-20 13:50:19 UTC

Created attachment 10969 [details]
patch

Comment 5 YOUZHONG YANG 2015-04-20 13:53:57 UTC

We performed stress testing last weekend, with the correct ordering of closing fd and unlinking socket file, no more smbd panics!

The proposed patch is attached for your reference.

Thanks a lot.

Comment 6 Jeremy Allison 2015-04-22 23:53:17 UTC

Created attachment 10980 [details]
git-am fix for 4.2.next.

Cherry-pick of fix that went into master.

Comment 7 Jeremy Allison 2015-04-23 19:52:43 UTC

Karolin please push for 4.2.next.

Thanks !

Comment 8 Tom Schulz 2015-06-09 14:01:30 UTC

This patch did not make it into 4.2.2. I see that the Status is NEEDINFO, but it looks like the information has been provided. Is anything else needed?

Comment 9 Jeremy Allison 2015-06-09 16:23:11 UTC

(In reply to Tom Schulz from comment #8)

Looks like it was just something that Karolin missed (she is superhuman, but not infallible :-).

Karolin, could you push this patch for 4.2.next ?

Thanks,

Jeremy.

Comment 10 Karolin Seeger 2015-06-09 19:22:28 UTC

(In reply to Jeremy Allison from comment #9)
Wow, I have no idea why this one did not show up in my "assigned list"...
Sorry for missing it!

Comment 11 Karolin Seeger 2015-06-09 19:23:21 UTC

Pushed to autobuild-v4-2-test.

Comment 12 Karolin Seeger 2015-06-17 18:02:04 UTC

(In reply to Karolin Seeger from comment #11)
Pushed to v4-2-test.
Closing out bug report.

Thanks!