Bug 8654 - link-by-hash: Fix (non-exploitable) buffer overflow
Summary: link-by-hash: Fix (non-exploitable) buffer overflow
Status: RESOLVED FIXED
Alias: None
Product: rsync
Classification: Unclassified
Component: core (show other bugs)
Version: 3.1.0
Hardware: All All
: P5 minor (vote)
Target Milestone: ---
Assignee: Wayne Davison
QA Contact: Rsync QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-12-12 02:50 UTC by Chris Dunlop
Modified: 2011-12-17 03:06 UTC (History)
0 users

See Also:


Attachments
link-by-hash: Fix (non-exploitable) buffer overflow (937 bytes, patch)
2011-12-12 02:50 UTC, Chris Dunlop
no flags Details
Improve generation of hash file name (1.50 KB, patch)
2011-12-12 03:07 UTC, Chris Dunlop
no flags Details
Improve documentation (847 bytes, patch)
2011-12-12 03:44 UTC, Chris Dunlop
no flags Details
Expand the --link-by-hash manual entry (1.10 KB, patch)
2011-12-12 06:16 UTC, Chris Dunlop
no flags Details
Improve generation of hash file name (1.50 KB, patch)
2011-12-13 00:02 UTC, Chris Dunlop
no flags Details
Improve generation of hash file name (1.50 KB, patch)
2011-12-13 00:13 UTC, Chris Dunlop
no flags Details
Improve generation of hash file name (1.62 KB, patch)
2011-12-13 02:48 UTC, Chris Dunlop
no flags Details
Expand the --link-by-hash manual entry (1.10 KB, patch)
2011-12-13 03:38 UTC, Chris Dunlop
no flags Details
Code cleanup (1.79 KB, patch)
2011-12-14 03:40 UTC, Chris Dunlop
no flags Details
link-by-hash: checksum hash required when copying file (2.16 KB, patch)
2011-12-17 03:06 UTC, Chris Dunlop
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Chris Dunlop 2011-12-12 02:50:29 UTC
Created attachment 7167 [details]
link-by-hash: Fix (non-exploitable) buffer overflow

The link-by-hash.diff patch contains a buffer overflow: the size of the 'hash' buffer on the stack allows for one extra character beyond the size needed for the text version of the hash but it needs two: one for the '/' directory separator and another for the trailing null.

This was caught by noticing a compile warning:

gcc -std=gnu99 -I. -I../rsync -g -O2 -DHAVE_CONFIG_H -Wall -W  -c ../rsync/hashlink.c -o hashlink.o
../rsync/hashlink.c: In function ‘make_hash_name’:
../rsync/hashlink.c:46: warning: array subscript is above array bounds

The overflow is non-exploitable as it harmlessly overwrites the following dst variable with a null just before using asprintf() to set the dst variable again.

But it should be fixed - see attached patch, against "packaging/branch-from-patch ../rsync-patches/link-by-hash.diff"
Comment 1 Chris Dunlop 2011-12-12 03:07:19 UTC
Created attachment 7168 [details]
Improve generation of hash file name

...and whilst we're in there fixing the buffer overflow, perhaps consider the attached patch to clean up the same make_hash_name() function.
Comment 2 Chris Dunlop 2011-12-12 03:44:11 UTC
Created attachment 7169 [details]
Improve documentation

...another clean up, to clarify that link may be done by MD5 hash
Comment 3 Chris Dunlop 2011-12-12 06:16:28 UTC
Created attachment 7170 [details]
Expand the --link-by-hash manual entry

Another documentation improvement
Comment 4 Chris Dunlop 2011-12-13 00:02:55 UTC
Created attachment 7173 [details]
Improve generation of hash file name

An improved version of the improvement to make_hash_name()

Changes from previous version:

Kill unused variable
Use assert() to check for buffer overflow
Further code clean up to reduce "increment noise"
Comment 5 Chris Dunlop 2011-12-13 00:13:09 UTC
Created attachment 7174 [details]
Improve generation of hash file name

Augh! Sorry for the noise. The previous "Improve generation of hash file name" changed the length of the top level directory name from 8 characters to 2 characters (part of some tests I was doing). This version doesn't have that inadvertent change but contains the other improvements.
Comment 6 Chris Dunlop 2011-12-13 02:48:36 UTC
Created attachment 7175 [details]
Improve generation of hash file name

That will teach me to patch on the run.

The previous version didn't compile as it left out an external variable. And the assert() was incorrect and would trigger every time.

This version is compile and run tested.
Comment 7 Chris Dunlop 2011-12-13 03:23:19 UTC
Comment on attachment 7169 [details]
Improve documentation

Correction: it's always an MD4 hash
Comment 8 Chris Dunlop 2011-12-13 03:38:27 UTC
Created attachment 7176 [details]
Expand the --link-by-hash manual entry

Correction: it's always an MD4 hash
Comment 9 Chris Dunlop 2011-12-14 03:40:08 UTC
Created attachment 7177 [details]
Code cleanup

Make local functions static
Comment 10 Wayne Davison 2011-12-16 03:11:55 UTC
Thanks for pointing out the various problems with the patch, and providing some suggested fixes.

I made several changes:

- I changed the code to use the file's MD5 checksum when talking to a modern rsync (3.0.0 and newer).  This means one less checksum to compute, and most people don't use an rsync older than 3.x these days.  It does allow the use of an MD4 checksum for supporting an older sender, but it requires the user to force this mode using --checksum-seed=1 --protocol=29 (which will make the old-style MD4 checksums consistent).  The documentation recommends using one or the other -- obviously using both would waste space.  This change is incompatible with an old directory setup, but someone could convert the link hierarchy via a simple script.

- I unified the checksum's conversion to hex in a single function that log.c and linkhash.c use.  I don't like the use of sprintf() in a loop, so the code continues to do its own hex conversion, just with better size checking.

- I included your daemon setting support, your manpage additions, your static-function tweaks, and hopefully covered all the issues you found.

- I made the hashlinks.c routines less chatty by putting the output into a new --debug=hashlink option.

You can see the latest version of the patch via gitweb:

http://gitweb.samba.org/?p=rsync-patches.git

Thanks!
Comment 11 Chris Dunlop 2011-12-17 03:06:25 UTC
Created attachment 7202 [details]
link-by-hash: checksum hash required when copying file

(In reply to comment #10)
> Thanks for pointing out the various problems with the patch, and providing some
> suggested fixes.
> 
> I made several changes:

Looks good!

Your additional --debug=hashlink option was almost word for word identical to a patch I had sitting here. (I'm learning to not sent off patches as soon as I create them!)

BTW, referencing your rsync-patches commit message, the surname is "Dunlop" (like the tyres) rather than "Dunlap" :-)

See additional patch here: link-by-hash requires the checksum hash when it's copying a file so that it can properly link into the link farm. In the absence of this patch I'm seeing it trying to link to the all-zeros hash name (perhaps if a previous file had carried a checksum the copy file would link to that previous checksum?).


Given existing link farms are already affected by the change to the MD5 hash, it's worth considering another change...

There's a general issue with the link farm directory structure, currently using an 8-char top level directory and 24-char second level directory, before getting to the actual link files.

As the whole point of md5 is to evenly distribute the hash, this means you'll essentially end up with a bucket-load of top level directories (up to 16^8), each containing one second level directory, each containing one link file.  That's pretty expensive in terms of directory inodes and blocks and associated I/O etc.

Perhaps a better structure would be a two-char top level directory (giving 256 in all) and 30-char second level directory, so the top level directories each contain a reasonable number of second level directories.

That would still leave all your second level directories with one or very few link files, so maybe even better would be to replace the second level directories directly with the link files, with a file extension to indicate different files mapping to the same hash. E.g. instead of hash[0-7]/hash[8-32]/0, have hash[0-1]/hash[2-32]_00. Even though the top level directories could end up with large numbers of files, if you have that many files in the first place you're likely using a non-relic file system with b-tree indexed directories.