Bug 8308 - rsync: exclude.c:532: change_local_filter_dir: Assertion `dir_depth < 4096/2+1' failed
Summary: rsync: exclude.c:532: change_local_filter_dir: Assertion `dir_depth < 4096/2+...
Status: RESOLVED FIXED
Alias: None
Product: rsync
Classification: Unclassified
Component: core (show other bugs)
Version: 3.0.8
Hardware: x86 Linux
: P5 major (vote)
Target Milestone: ---
Assignee: Wayne Davison
QA Contact: Rsync QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-07-17 02:03 UTC by Dave Johnson
Modified: 2014-05-26 21:19 UTC (History)
0 users

See Also:


Attachments
Don't pass an inactive dir to change_local_filter_dir() (335 bytes, patch)
2014-04-20 21:10 UTC, Wayne Davison
no flags Details
Add some debugging if the prior patch didn't fix anything (850 bytes, patch)
2014-04-20 21:20 UTC, Wayne Davison
no flags Details
More inactive-flist entry ignoring (1.69 KB, patch)
2014-04-20 21:41 UTC, Wayne Davison
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Dave Johnson 2011-07-17 02:03:35 UTC
I'm seeing the following assert:

rsync --archive --devices --specials --sparse --one-file-system --verbose --whole-file --numeric-ids --delete-before --hard-links --delete-excluded --exclude-from=/share/alliance-bind-local2/rsync.exclude.all --include-from=/share/alliance-bind-local2/rsync.include.all --include-from=/share/alliance-bind-local2/rsync.include.drive_a '--include=*/' '--exclude=*' --prune-empty-dirs /share/alliance-bind-local2/ /media/backup1/alliance-bind-local2/
building file list ... done
rsync: exclude.c:532: change_local_filter_dir: Assertion `dir_depth < 4096/2+1' failed.
rsync: connection unexpectedly closed (9 bytes received so far) [sender]
rsync error: error in rsync protocol data stream (code 12) at io.c(601) [sender=3.0.8]



The include/exclude rules I am using are trying to accomplish the
following:
 * exclude everything listed in rsync.exclude.all
 * include everything listed in either rsync.include.all or include.drive_a
 * exclude everything else

The 3 files contain:

$ cat /share/alliance-bind-local2/rsync.exclude.all
/backup/snapshots/**
$ cat /share/alliance-bind-local2/rsync.include.all
/rsync.**
$ cat /share/alliance-bind-local2/rsync.include.drive_a
/videos/dvd/**


Note that the paths of any file in the source and dest (both included or
excluded) are nowhere near 2K in length.  the longest path is probably
about 300 characters.

running in gdb seems to indicate the file in question is in the root directory
and matches the 1st include rule (/rsync.**)


Any thoughts?



$ rsync --version
rsync  version 3.0.8  protocol version 30
Copyright (C) 1996-2011 by Andrew Tridgell, Wayne Davison, and others.
Web site: http://rsync.samba.org/
Capabilities:
    64-bit files, 64-bit inums, 32-bit timestamps, 64-bit long ints,
    socketpairs, hardlinks, symlinks, IPv6, batchfiles, inplace,
    append, no ACLs, xattrs, iconv, symtimes

rsync comes with ABSOLUTELY NO WARRANTY.  This is free software, and you
are welcome to redistribute it under certain conditions.  See the GNU
General Public Licence for details.


(gdb) c
Continuing.

Breakpoint 6, change_local_filter_dir (
    dname=0xfffd3c5c "rsync.include.drive_b", dlen=21, dir_depth=3695)
    at exclude.c:532
532             assert(dir_depth < MAXPATHLEN/2+1);
(gdb) bt
#0  change_local_filter_dir (dname=0xfffd3c5c "rsync.include.drive_b", 
    dlen=21, dir_depth=3695) at exclude.c:532
#1  0x080544c3 in do_delete_pass () at generator.c:548
#2  0x08059ac0 in generate_files (f_out=1, local_name=0x0) at generator.c:2251
#3  0x08064c9e in do_recv (f_in=0, f_out=1, local_name=0x0) at main.c:843
#4  0x080656a4 in start_server (f_in=0, f_out=1, argc=1, argv=0xfffd6e30)
    at main.c:953
#5  0x08066738 in child_main (argc=2, argv=0xfffd6e30) at main.c:960
#6  0x0807c92e in local_child (argc=2, argv=0xfffd6e30, f_in=0xfffd8ed0, 
    f_out=0xfffd8ecc, child_main=0x8066710 <child_main>) at pipe.c:156
#7  0x080633e9 in do_cmd (cmd=0x0, machine=0x0, user=0x0, 
    remote_argv=0x80b39b4, remote_argc=0, f_in_p=0xfffd8ed0, 
    f_out_p=0xfffd8ecc) at main.c:469
#8  0x08065f4b in main (argc=Cannot access memory at address 0x0
) at main.c:1275
(gdb) p dir_depth
$5 = 3695
(gdb) up
#1  0x080544c3 in do_delete_pass () at generator.c:548
548                             change_local_filter_dir(fbuf, strlen(fbuf), F_DEPTH(file));
(gdb) info local
file = (struct file_struct *) 0xf73f738c
fbuf = "rsync.include.drive_b\000b"
st = {st_dev = 65024, __pad1 = 0, __st_ino = 323281, st_mode = 16877, 
  st_nlink = 3, st_uid = 0, st_gid = 0, st_rdev = 0, __pad2 = 0, 
  st_size = 4096, st_blksize = 4096, st_blocks = 8, st_atim = {
    tv_sec = 1310865354, tv_nsec = 0}, st_mtim = {tv_sec = 1305164231, 
    tv_nsec = 0}, st_ctim = {tv_sec = 1305556305, tv_nsec = 0}, 
  st_ino = 323281}
j = 1295
(gdb) p *file
$6 = {dirname = 0x0, modtime = 0, len32 = 1290, mode = 0, flags = 0, 
  basename = ""}
(gdb) p *((union file_extras*)(file) - (1))
$7 = {num = 3695, unum = 3695}
(gdb)
Comment 1 John 2014-04-09 01:31:48 UTC
I also see this bug with the latest sources for: 3.1.1pre1 and i have a directory tree i can reproduce this bug with easily, on a local machine. No remote transfer required.

It is related to the --prune-empty-dirs option. Removing this from my command line makes the error go away.

http://sourceforge.net/p/luckybackup/discussion/873564/thread/864b0b61/
"Indeed, rsync seems to bring this error up when specific filter rules (include/exclude patterns) are used and then it tries to delete files at the destination."

here is what gdb gives me (note, i have added some debug to the source and recompiled, so line 623 in exclude.c was the assert statement at around line 613 in the original source:

Program received signal SIGABRT, Aborted.
0x0000003e66032925 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
64        return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0  0x0000003e66032925 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x0000003e66034105 in abort () at abort.c:92
#2  0x0000003e6602ba4e in __assert_fail_base (fmt=<value optimized out>, assertion=0x45073c "dir_depth < 4096/2+1", file=0x4505c1 "exclude.c",
    line=<value optimized out>, function=<value optimized out>) at assert.c:96
#3  0x0000003e6602bb10 in __assert_fail (assertion=0x45073c "dir_depth < 4096/2+1", file=0x4505c1 "exclude.c", line=623, function=0x450ca0 "change_local_filter_dir")
    at assert.c:105
#4  0x00000000004183d0 in change_local_filter_dir (dname=0x7fffffff8bb0 "contourd.ss/.Rational/Version_History/rivers/process_connection_utils.h,v", dlen=73,
    dir_depth=3005) at exclude.c:623
#5  0x000000000040cd1b in do_delete_pass () at generator.c:362
#6  0x0000000000411fa3 in generate_files (f_out=1, local_name=<value optimized out>) at generator.c:2180
#7  0x000000000041d161 in do_recv (f_in=<value optimized out>, f_out=1, local_name=0x0) at main.c:945
#8  0x000000000041d7f5 in do_server_recv (f_in=0, f_out=1, argc=<value optimized out>, argv=<value optimized out>) at main.c:1057
#9  start_server (f_in=0, f_out=1, argc=<value optimized out>, argv=<value optimized out>) at main.c:1091
#10 0x000000000041d935 in child_main (argc=<value optimized out>, argv=<value optimized out>) at main.c:1064
#11 0x00000000004380b8 in local_child (argc=2, argv=0x7fffffffae10, f_in=0x7fffffffde7c, f_out=0x7fffffffde78, child_main=0x41d920 <child_main>) at pipe.c:167
#12 0x000000000041bafa in do_cmd (cmd=<value optimized out>, machine=0x0, user=<value optimized out>, remote_argv=<value optimized out>,
    remote_argc=<value optimized out>, f_in_p=0x7fffffffde7c, f_out_p=0x7fffffffde78) at main.c:535
#13 0x000000000041e758 in start_client (argc=3, argv=0x680230) at main.c:1393
#14 main (argc=3, argv=0x680230) at main.c:1633

Note also, that in the source code, I've added this code directly above the assert that triggers this failure:

	if ( dir_depth >= MAXPATHLEN/2+1 ) {
	  rprintf(FINFO, "dname=[%s] dlen=[%d] dir_depth=[%d] cur_depth=[%d] MAXPATHLEN/2+1=[%d] \n", dname, dlen, dir_depth, cur_depth, MAXPATHLEN/2+1);
	}

And this is never triggered. I have  therefore removed the if, and printed this info unconditionally, and i find that the immediately before the assert fails, dir_depth still only has a value of 1.

I have also noticed that sometimes the rsync aborts without actually triggering the assert. And likewise, if the assert is removed, rsync still aborts at much the same place in the code.

This all seems very odd to me. It's as if something is modifying the value of dir_depth underneath this code... 


I am very rusty with my gdb usage - will try to get a breakpoint in and catch it before it aborts and get some better info out of it.
Comment 2 John 2014-04-09 01:38:35 UTC
Should've actually added a comment on this bit:

http://sourceforge.net/p/luckybackup/discussion/873564/thread/864b0b61/
"Indeed, rsync seems to bring this error up when specific filter rules
(include/exclude patterns) are used and then it tries to delete files at the
destination."


I included mention of that, to show that those people are also seeing this bug.
Comment 3 John 2014-04-09 02:04:50 UTC
Some slightly better info from gdb, caught program with bad value of dir_depth, before crash. There is no way this dir suddenly out of the blue has a depth of 1622.

Looks like i need to recompile my rsync with debug info etc thoughm to get better info:

(gdb) break exclude.c:615 if dir_depth >= 1000
Breakpoint 2 at 0x4182f9: file exclude.c, line 615.
(gdb) c
Continuing.

Breakpoint 2, change_local_filter_dir (dname=0x7fffffff8bb0 "RPMs/rpm_lib.ss/.Rational/Version_History/ran-lads2/specs/LADS-rl2gs-lib-1-mbt_ws.spec,v", dlen=88,
    dir_depth=1622) at exclude.c:615
615             if ( dir_depth >= MAXPATHLEN/2+1 ) {
(gdb) p dir_depth
$2 = 1622
(gdb) bt
#0  change_local_filter_dir (dname=0x7fffffff8bb0 "RPMs/rpm_lib.ss/.Rational/Version_History/ran-lads2/specs/LADS-rl2gs-lib-1-mbt_ws.spec,v", dlen=88, dir_depth=1622)
    at exclude.c:615
#1  0x000000000040cd1b in do_delete_pass () at generator.c:362
#2  0x0000000000411fa3 in generate_files (f_out=1, local_name=<value optimized out>) at generator.c:2180
#3  0x000000000041d161 in do_recv (f_in=<value optimized out>, f_out=1, local_name=0x0) at main.c:945
#4  0x000000000041d7f5 in do_server_recv (f_in=0, f_out=1, argc=<value optimized out>, argv=<value optimized out>) at main.c:1057
#5  start_server (f_in=0, f_out=1, argc=<value optimized out>, argv=<value optimized out>) at main.c:1091
#6  0x000000000041d935 in child_main (argc=<value optimized out>, argv=<value optimized out>) at main.c:1064
#7  0x00000000004380b8 in local_child (argc=2, argv=0x7fffffffae10, f_in=0x7fffffffde7c, f_out=0x7fffffffde78, child_main=0x41d920 <child_main>) at pipe.c:167
#8  0x000000000041bafa in do_cmd (cmd=<value optimized out>, machine=0x0, user=<value optimized out>, remote_argv=<value optimized out>,
    remote_argc=<value optimized out>, f_in_p=0x7fffffffde7c, f_out_p=0x7fffffffde78) at main.c:535
#9  0x000000000041e758 in start_client (argc=3, argv=0x680230) at main.c:1393
#10 main (argc=3, argv=0x680230) at main.c:1633
(gdb) local
Undefined command: "local".  Try "help".
(gdb) info local
cur_depth = 1
filt_array = {0x0 <repeats 2049 times>}
__PRETTY_FUNCTION__ = "change_local_filter_dir"
(gdb) up
#1  0x000000000040cd1b in do_delete_pass () at generator.c:362
362                             change_local_filter_dir(fbuf, strlen(fbuf), F_DEPTH(file));
(gdb) info local
file = 0x7ffff0694380
fbuf = "RPMs/rpm_lib.ss/.Rational/Version_History/ran-lads2/specs/LADS-rl2gs-lib-1-mbt_ws.spec,v\000tes\000l\000es\000l\000\000l\000\000s\000l\000es\000l\000\000l\000tes\000l\000\000v\000tes\000l\000s\000l", '\000' <repeats 858 times>"\220, \226\377\377\377\177\000\000\200\226\377\377\377\177\000\000\377\377\377\377\000\000\000\000\360\253\377\377\377\177\000\000\312\352D\000\000\000\000\000gD\004f>\000\000\000\024", '\000' <repeats 383 times>"\336, \352D", '\000' <repeats 25 times>, "\002", '\000' <repeats 539 times>...
st = {st_dev = 2059, st_ino = 39845946, st_nlink = 2, st_mode = 17917, st_uid = 1070, st_gid = 1062, __pad0 = 0, st_rdev = 0, st_size = 4096, st_blksize = 4096,
  st_blocks = 8, st_atim = {tv_sec = 1396936899, tv_nsec = 233215828}, st_mtim = {tv_sec = 1396936901, tv_nsec = 344215827}, st_ctim = {tv_sec = 1396936901,
    tv_nsec = 344215827}, __unused = {0, 0, 0}}
j = <value optimized out>
(gdb) p file
$3 = (struct file_struct *) 0x7ffff0694380
(gdb) p F_DEPTH(file)
No symbol "F_DEPTH" in current context.
Comment 4 John 2014-04-09 03:05:08 UTC
from rsync.h:
	union file_extras {
		int32 num;
		uint32 unum;
	};
	#define REQ_EXTRA(f,ndx) ((union file_extras*)(f) - (ndx))
	#define F_DEPTH(f) REQ_EXTRA(f, 1)->num


from generator.c:
	struct file_struct *file = cur_flist->sorted[j];
	rprintf(FINFO, "Sizeof pointer to file struct = [%d]\n", sizeof (file) );
	f_name(file, fbuf);
	if (!(file->flags & FLAG_CONTENT_DIR)) {
	change_local_filter_dir(fbuf, strlen(fbuf), F_DEPTH(file));

When the program runs we see this from the code i added to generator.c:
	Sizeof pointer to file struct =[8]

So... my C is pretty rusty... and it is difficult to make any sense of this code, but this looks like madness to me... you're using crazy macros to do 32bit pointer arithmetic on a 64-bit pointer.

ie when you run F_DEPTH(file), you're subtracting ndx, in this case 1, from the file pointer, but you're doing so with a macro that casts the the pointer to union file_extras* which is a 32bit pointer, not a 64bit one. so you're going to end up subtracting 4 bytes from the pointer where you needed to subtract 8.

Madness!
Comment 5 John 2014-04-09 04:03:11 UTC
woops, it's not just a 64bit pointer being treated as 32.

When you run macro F_DEPTH(file) you're actually subtracting the size of a 32bit integer, ie 4 bytes, but file is a pointer to structures like this:

struct file_struct {
	const char *dirname;	/* The dir info inside the transfer */
	time_t modtime;		/* When the item was last modified */
	uint32 len32;		/* Lowest 32 bits of the file's length */
	uint16 mode;		/* The item's type and permissions */
	uint16 flags;		/* The FLAG_* bits for this item */
	const char basename[1];	/* The basename (AKA filename) follows */
};

The size of file_struct is 32 bytes

So when you use the F_DEPTH macro you're subtracting 4 bytes from a pointer that should be decremented 32 bytes at a time.... 

Still crazy.
Comment 6 John 2014-04-09 04:38:00 UTC
Oh god this is awful. But it is making more sense now - these macros are supposed to be accessing various members internal to the file_struct structure.

This is hideous though. I cannot believe what a mess this is.
Comment 7 Wayne Davison 2014-04-20 21:10:54 UTC
Created attachment 9862 [details]
Don't pass an inactive dir to change_local_filter_dir()

See if this patch fixes the issue you're seeing.
Comment 8 Wayne Davison 2014-04-20 21:17:58 UTC
The per-file data structure is a combination of a C struct and an array of extra (32-bit) values that are stored right before the struct.  The array of values allows the size to vary based on the options, which saves memory for a really large file list.

I added a patch to the bug with a guess at what might be wrong.  See if that makes any difference.  If it does not, I'm going to attach a debugging patch that will allow us to see if the depth values are getting corrupted by the pruning code (since it abuses the DEPTH value when figuring out if a directory chain should be removed or not).  Note that it uses fprintf(stderr, ...) calls instead of rprintf(), which helps them to get output even if rsync dies (your early debugging that uses rprintf() probably left the overflow output into the protocol buffers when the assert caused rsync to die).
Comment 9 Wayne Davison 2014-04-20 21:20:02 UTC
Created attachment 9863 [details]
Add some debugging if the prior patch didn't fix anything

If there is still a bug, use this patch to add some debugging info and we will see if the F_DEPTH() values are OK prior to the pruning code, and corrupt afterwards.
Comment 10 Wayne Davison 2014-04-20 21:41:53 UTC
Created attachment 9864 [details]
More inactive-flist entry ignoring

This patch adds a few more calls to F_IS_ACTIVE() than by first one did.