Bug 11624 - Memory leak after smbc_free_context() on the session while accessing shares on DFS junctions
Memory leak after smbc_free_context() on the session while accessing shares o...
Status: RESOLVED FIXED
Product: Samba 4.1 and newer
Classification: Unclassified
Component: libsmbclient
4.1.12
All All
: P5 major
: ---
Assigned To: Karolin Seeger
Samba QA Contact
:
: 11195 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2015-11-30 09:15 UTC by hargagan
Modified: 2016-01-07 17:35 UTC (History)
4 users (show)

See Also:


Attachments
Proposed patch for fixing the memory leak due to left over cli connection structures after smbc_free_context() (2.66 KB, patch)
2015-11-30 09:15 UTC, hargagan
no flags Details
git-am fix for master (1.27 KB, patch)
2015-12-16 19:14 UTC, Jeremy Allison
no flags Details
git-am fix for 4.3.next, 4.2.next. (1.52 KB, patch)
2015-12-21 23:56 UTC, Jeremy Allison
vl: review+
jra: review? (shargagan)
Details

Note You need to log in before you can comment on or make changes to this bug.
Description hargagan 2015-11-30 09:15:54 UTC
Created attachment 11629 [details]
Proposed patch for fixing the memory leak due to left over cli connection structures after smbc_free_context()

There are AD DFS junctions and accessing using smbclient library. Noticing a gradual increase in memory and doubted a leak. After running in valgrind noticed these even after smbc_free_context(). Expected that cli_shutdown() in SMBC_remove_unused_server(). One such valgrind error shows this :

==11691== 24,800 (125 direct, 24,675 indirect) bytes in 1 blocks are definitely lost in loss record 1,038 of 1,045
==11691==    at 0x4C297EE: malloc (vg_replace_malloc.c:299)
==11691==    by 0xB447D42: talloc_strdup (in /usr/lib64/libtalloc.so.2.1.1)
==11691==    by 0x95909F8: cli_set_mntpoint (in /usr/lib64/samba/liblibsmb.so)
==11691==    by 0x9592ED4: cli_resolve_path (in /usr/lib64/samba/liblibsmb.so)
==11691==    by 0x8AFE372: cacl_get (in /usr/lib64/libsmbclient.so.0.2.1)
==11691==    by 0x8B000F8: SMBC_getxattr_ctx (in /usr/lib64/libsmbclient.so.0.2.1)

After analysing the code found that it doesn't cleanup all the cli connections in the srv->cli. It goes to head and matches if the 'cli' is equal to 'cli_head'. If not it doesn't cleanup all the subsidiary connections.

Proposing a patch which is cleaning up cli connections in the cli-list during the SMBC_remove_unused_server(). In this giving a new function cli_cleanup() which will do this job and is called from SMBC_remove_unused_server() instead of cli_shutdown().
Comment 1 Jeremy Allison 2015-12-10 00:48:07 UTC
How are you calling smbc_free_context() ? Note that the docs for that function state:

-----------------------------------------------------------
 * @note            It is advised to clean up all the contexts with shutdown_ctx set to 1
 *                  just before exit()'ing. When shutdown_ctx is 0, this function can be
 *                  use in periodical cleanup functions for example.
-----------------------------------------------------------

When is the bug being reported ? Is it when you're exiting the program ? Are you calling smbc_free_context(ctx, 1) before exit ?

Can you show me your code that calls smbc_free_context() please ?
Comment 2 hargagan 2015-12-10 08:41:52 UTC
Hi Jeremy,

Thanks for looking at the bug. In our code using SMBC_free_context() with '0' and it is called for the cleanup of the context.
>>>>
SMBC_FreeCtx((SMBCCTX*)smbcContext, 0);
>>>>

This is called from the cleanup routine periodically, which is being called for a session context once it is idle for some time or not used.

Here is what I think is happening. The srv->cli is the list and it is updated on every new connection created through this, for eg, DFS targets. I have seen having one connection for IPC$, other for root server, and other for DFS server. It is clearing only one of them and not freeing the remaining two for every 'srv' connection, which over a long run and multiple sessions, is causing leak. As per my analysis here the 'srv->cli' that is used in the cli_shutdown() which is called from SMBC_remove_unused_server() is not passing the head of the list ('DLIST_HEAD') and hence cli_shutdown doesn't clear off all the subsidiary cli(s) in the list. It only clears off last added 'cli' entry in the srv->cli list. For a share access on a DFS setup, I hope I am able to explain the problem that I am seeing.

Thanks,
Comment 3 Jeremy Allison 2015-12-10 18:25:35 UTC
So this is a long-lived application. Is the context being re-used at this point or is it being shut down ?

I ask as calling:

smbc_free_context(ctx, 1);

would seem to do the complete shutdown required.

Once you've called smbc_free_context() on the ctx, do you re-use it or do you create a new one ?
Comment 4 Jeremy Allison 2015-12-10 22:34:15 UTC
(In reply to hargagan from comment #2)

OK, I've looked into this really carefully, and I don't see where srv->cli can be anything but the primary connection to a server to resolve a path. srv->cli should always be the list head for subsidiary DFS connections - it's never accessed or changed by any DLIST_XXX macros.

Do you check the return from smbc_free_context() in your code ?

If it returns 1 then there are outstanding files open on one of the connections, and thus the underlying memory won't get freed.

See the comment in the description of smbc_free_context():

/*
 * Free a context
 *
 * Returns 0 on success. Otherwise returns 1, the SMBCCTX is _not_ freed
 * and thus you'll be leaking memory if not handled properly.
 *
 */
int
smbc_free_context(SMBCCTX *context,
                  int shutdown_ctx)

Are you checking for a return of 1 ?
Comment 5 Chris 2015-12-14 12:50:14 UTC
Sounds like this may be related to my issue reported here https://bugzilla.samba.org/show_bug.cgi?id=11195
Comment 6 Jeremy Allison 2015-12-15 05:01:15 UTC
Yes, that looks similar - but I still can't see where srv->cli can be anything but the primary connection to a server to resolve a path.

As I said in the previous comment, srv->cli should always be the list head for subsidiary DFS connections. Can you show me any internal libsmbclient code that modifies srv->cli once the connection is made ? I just don't see it.

If you can give me a simple method to reproduce this (i.e. a simple primary server + DFS subsidiaries) that I can see this problem in action then I can reproduce and fix.
Comment 7 hargagan 2015-12-15 07:32:10 UTC
(In reply to Jeremy Allison from comment #4)
Hi Jeremy,

This is a long lived application and the context is getting freed and not re-used thereafter. For the next request, there is a new context created. I don't have the return code check but I feel the problem is because of incomplete free of the cli lists. The valgrind logs also shows the leak of elements of 'struct cli' and the cli itself. The logs doesn't show any leak for the file structures. Those look pretty clean. 

As chris has also noticed in his setup, there is a fd leak seen in the DFS scenario. This is the same which I am also seeing along with the memory leak in a long running system. The current logic of cli_shutdown() cleans only the cli entry or all its subsidiary based on if it is the DLIST_HEAD().

On your question regarding the modification of the cli list, one such place is cli_cm_connect() called from cli_resolve_path() for resolving the DFS junction. Here  is a sample gdb trace after there are series of smb calls for a file location :

***********************************************************************************************************************************************************************************************************
(gdb) p *((SMBCCTX*)0x6b5df0)->internal->server_cache
$5 = {server_name = 0x6c0350 "WIN2K8HGS", share_name = 0x6be640 "*IPC$", workgroup = 0x6c2ab0 "MY2K8AD", username = 0x6c2a40 "aduser2@my2k8ad.com", server = 0x6c00e0, next = 0x6b6980, 
  prev = 0x6b6980}
(gdb) p *((SMBCCTX*)0x6b5df0)->internal->server_cache->next
$6 = {server_name = 0x6bda90 "my2k8ad.com", share_name = 0x6b9460 "TESTNS_MEM_LEAK", workgroup = 0x6b9e10 "MY2K8AD", username = 0x6a5130 "aduser2@my2k8ad.com", server = 0x6bda40, 
  next = 0x0, prev = 0x6c0a10}
(gdb) p *((SMBCCTX*)0x6b5df0)->internal->server_cache->next->server
$7 = {cli = 0x6b7e10, dev = 18446744072946730101, no_pathinfo = false, no_pathinfo2 = false, no_pathinfo3 = false, no_nt_session = false, pol = {handle_type = 0, uuid = {time_low = 0, 
      time_mid = 0, time_hi_and_version = 0, clock_seq = "\000", node = "\000\000\000\000\000"}}, last_echo_time = 1801213, next = 0x0, prev = 0x6c00e0}
(gdb) p *((SMBCCTX*)0x6b5df0)->internal->server_cache->next->server->cli
$8 = {prev = 0x6c1a00, next = 0x0, rap_error = 0, raw_status = {v = 3221225506}, map_dos_errors = true, domain = 0x6b7c10 "MY2K8AD", user_name = 0x6b92d0 "aduser2@my2k8ad.com", 
  password = 0x6b9250 "password@123", server_type = 0x6a5120 "", server_os = 0x6b9be0 "", server_domain = 0x6b7550 "MY2K8AD", share = 0x6bcfc0 "TESTNS_MEM_LEAK", dev = 0x0, timeout = 20000, 
  initialised = 1, win95 = 0, is_guestlogin = false, server_posix_capabilities = 0, requested_posix_capabilities = 0, backup_intent = false, pipe_list = 0x0, use_kerberos = false, 
  fallback_after_kerberos = false, use_ccache = true, pw_nt_hash = false, got_kerberos_mechanism = true, use_oplocks = false, dfs_mountpoint = 0x6b7f50 "", conn = 0x6b9f00, 
  remote_realm = 0x0, smb1 = {pid = 4345, vc_num = 4345, session = 0x6ba670, tcon = 0x6ba3b0}, smb2 = {session = 0x6bc6b0, tcon = 0x6b9800, open_handles = 0x0}}
(gdb) p *((SMBCCTX*)0x6b5df0)->internal->server_cache->next->server->cli->prev
$9 = {prev = 0x6c07d0, next = 0x6b7e10, rap_error = 0, raw_status = {v = 3221225701}, map_dos_errors = true, domain = 0x6c5e40 "M0x6c1a00, Y2K8AD", user_name = 0x6c57c0 "aduser2@my2k8ad.com", 
  password = 0x6c5740 "password@123", server_type = 0x6c3430 "", server_os = 0x6c1990 "", server_domain = 0x6c2440 "MY2K8AD", share = 0x6c56c0 "target_ms1", dev = 0x0, timeout = 20000, 
  initialised = 1, win95 = 0, is_guestlogin = false, server_posix_capabilities = 0, requested_posix_capabilities = 0, backup_intent = false, pipe_list = 0x0, use_kerberos = false, 
  fallback_after_kerberos = false, use_ccache = true, pw_nt_hash = false, got_kerberos_mechanism = true, use_oplocks = false, dfs_mountpoint = 0x6c5500 "\\TESTNS_MEM_LEAK\\test_folder", 
  conn = 0x6c4230, remote_realm = 0x0, smb1 = {pid = 4345, vc_num = 4345, session = 0x6c22d0, tcon = 0x6c2220}, smb2 = {session = 0x6c0b70, tcon = 0x6c7430, open_handles = 0x0}}
(gdb) p *((SMBCCTX*)0x6b5df0)->internal->server_cache->next->server->cli->prev->prev
$10 = {prev = 0x0, next = 0x6c1a00, rap_error = 0, raw_status = {v = 3221225701}, map_dos_errors = true, domain = 0x6c1bb0 "MY2K80x6c1a00, AD", user_name = 0x6c1530 "aduser2@my2k8ad.com", 
  password = 0x6c14b0 "password@123", server_type = 0x6c0340 "", server_os = 0x6c02d0 "", server_domain = 0x6c2920 "MY2K8AD", share = 0x6c3800 "IPC$", dev = 0x0, timeout = 20000, 
  initialised = 1, win95 = 0, is_guestlogin = false, server_posix_capabilities = 0, requested_posix_capabilities = 0, backup_intent = false, pipe_list = 0x0, use_kerberos = false, 
  fallback_after_kerberos = false, use_ccache = true, pw_nt_hash = false, got_kerberos_mechanism = true, use_oplocks = false, dfs_mountpoint = 0x6c28b0 "", conn = 0x6c2540, 
  remote_realm = 0x0, smb1 = {pid = 4345, vc_num = 4345, session = 0x6c0d40, tcon = 0x6c09b0}, smb2 = {session = 0x6c15b0, tcon = 0x6c4880, open_handles = 0x0}}
***********************************************************************************************************************************************************************************************************


As you can see, the smbc_context has two servers in the server_cache. I am going into second server entry and there I have found three cli entries. The one pointed by server->cli is 0x6b7e10 but that is not the DLIST_HEAD. That is somewhere down which is 0x6c07d0 which is for IPC$ share. Also there is one more intermediate one, 0x6c1a00, which is for "target_ms1" the DFS target. While doing cli_shutdown() it is only cleaning server->cli, 0x6b7e10 leaving behind the other two entries.
Comment 8 Jeremy Allison 2015-12-15 18:54:19 UTC
> On your question regarding the modification of the cli list, one such place is
> cli_cm_connect() called from cli_resolve_path() for resolving the DFS junction.

No, that's not correct.

srv->cli

is passed into here as referring_cli. The list manipulation here is: 
DLIST_ADD_END(referring_cli, cli, struct cli_state *), which add the next connection cli to the end of the list headed by referring_cli. So srv->cli should still point to referring_cli as the head of the list, and all subsidiary DFS connections should be on the list with srv->cli as the list head.

You still haven't shown me where this happens.

Note I'm not denying that it does happen - just that the place you think it does isn't the right place.

I need to find the right place that is breaking the assumption that srv->cli is head of the list:

srv->cli --> dfs_cli1 --> dfs_cli2 --> .... --> dfs_cliN --> NULL

That is an assumption that the code depends upon to work correctly. When we find what is breaking that assumption, we'll find the underlying real bug.

Secondly you *MUST* check error returns from functions such as smbc_free_context() or you can't possibly have robust code to ship to customers. These functions return errors for a reason.

You still haven't given me what I asked for, which is sample code and an easy setup that will allow me to reproduce. When I have that, finding the code that breaks the list assumption should be easy.
Comment 9 hargagan 2015-12-16 13:52:40 UTC
Hi Jeremy,

Looked at the code and found this. The issue that I am seeing is due to the macro DLIST_ADD_END() :

#define DLIST_ADD_END(list, p, type)                    \
do { \
        if (!(list)) { \
                DLIST_ADD(list, p); \
        } else { \
                DLIST_ADD_AFTER(list, p, (list)->prev); \
        } \
} while (0)

#define DLIST_ADD_AFTER(list, p, el) \
do { \
        if (!(list) || !(el)) { \
                DLIST_ADD(list, p); \
        } else { \
                (p)->prev = (el);   \
                (p)->next = (el)->next;         \
                (el)->next = (p);               \
                if ((p)->next) (p)->next->prev = (p);   \
                if ((list)->prev == (el)) (list)->prev = (p); \
        }\
} while (0)

#define DLIST_ADD(list, p) \
do { \
        if (!(list)) { \
                (p)->prev = (list) = (p);  \
                (p)->next = NULL; \
        } else { \
                (p)->prev = (list)->prev; \
                (list)->prev = (p); \ ##### it is adding p to list->prev ???
                (p)->next = (list); \
                (list) = (p); \
        } \
} while (0)

The (list)->prev is causing the nodes to be added before 'srv->cli'. So I add new node (c1) to srv->cli, where both prev and next is null, it will get added before srv->cli in the list, so the new list will look like this :

c1-> (srv->cli)

similarly : c1->c2->(srv->cli)

Now I do DLIST_HEAD() on srv->cli, it will give 'c1' as the head, hence the leak is seen.

We are using 'list->prev = p' in DLIST_ADD(), this could be the problem. I may be wrong here, but please see if this makes sense.

Thanks,
Comment 10 Jeremy Allison 2015-12-16 19:04:03 UTC
This makes sense. The underlying issue is we're doing the following inside SMBC_server_internal():

srv->cli = c;

to initialize the list, which doesn't correctly set srv->cli->prev == srv->cli.

What we need to be doing is:

DLIST_ADD(srv->cli, c);

which will correctly initialize the list head, and then everything else should work as expected.

I'll upload a patch shortly, please test it to make sure it fixes the issue.

Thanks a *LOT* for your patience on this, I knew there was an underlying reason I was missing !

Cheers,

Jeremy.
Comment 11 Jeremy Allison 2015-12-16 19:14:56 UTC
Created attachment 11730 [details]
git-am fix for master

This should be the minimal fix I think. Can you test this and get back to me ?
Comment 12 Jeremy Allison 2015-12-16 19:16:06 UTC
*** Bug 11195 has been marked as a duplicate of this bug. ***
Comment 13 Jeremy Allison 2015-12-21 23:56:46 UTC
Created attachment 11738 [details]
git-am fix for 4.3.next, 4.2.next.

Directly cherry-picked from master.
Comment 14 Jeremy Allison 2015-12-22 23:45:54 UTC
Re-assigning to Karolin for inclusion in 4.3.next, 4.2.next.
Comment 15 Karolin Seeger 2016-01-05 08:45:45 UTC
(In reply to Jeremy Allison from comment #14)
Pushed to autobuild-v4-[2|3]-test.
Comment 16 Karolin Seeger 2016-01-07 17:35:47 UTC
(In reply to Karolin Seeger from comment #15)
Pushed to both branches.
Closing out bug report.

Thanks!