7454 – assertion failed in finish_hard_link()

Bug 7454 - assertion failed in finish_hard_link()

Summary: assertion failed in finish_hard_link()

Status:	RESOLVED FIXED

Alias:	None

Product:	rsync
Classification:	Unclassified
Component:	core (show other bugs)
Version:	3.0.8
Hardware:	x86 Linux

Importance:	P3 normal (vote)
Target Milestone:	---
Assignee:	Wayne Davison
QA Contact:	Rsync QA Contact

URL:
Keywords:

Depends on:
Blocks:

Reported:	2010-05-26 06:29 UTC by Bas Zoetekouw
Modified:	2011-06-15 19:26 UTC (History)
CC List:	1 user (show)

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Bas Zoetekouw 2010-05-26 06:29:15 UTC

I am rsyncing a localbackup to a remote machine.  Local machine is running Debian stable on amd64, remoet is running FreeBSD on i386.  Both machines have rsync 3.07.

I consistently get the following error after a few minutes rsyncing:

| Assertion failed: (node != NULL && node->data != NULL), function finish_hard_link, file hlink.c, line 536.
| rsync: connection unexpectedly closed (1544 bytes received so far) [sender]
| rsync error: unexplained error (code 255) at io.c(601) [sender=3.0.7]

Sometimes this happens after 4 minutes, sometimes after 25 minutes, and never reproducibly on the same file.

Please let me know if you need some further information to debug this.

Comment 1 Wayne Davison 2010-05-29 09:51:02 UTC

I have never seen this bug triggered, so if you can distill things down to a set of sharable files that will demonstrate the bug, that would be great.  Also, what are your command-line options?

Comment 2 Wayne Davison 2010-05-29 10:02:46 UTC

Another way to potentially help is to upgrade both sides to 3.1.0dev (via the git repo or the latest nightly tar file) and use --debug=hlink3 for the transfer.  Capture all the output, and if rsync fails, compress it, and email it to me.

Comment 3 Bas Zoetekouw 2010-07-12 07:30:58 UTC

Sorry for my long silence on this bug.  It's rather tricky to track down, as I can only trigger it with a large tree (approx 1M files) and lots and lots of hardlinks.
I can still reliably reproduce it though, also with a recent git version of rsync,  both when transferring linux-to-linux and linux-to-freebsd.

I'll let you know when I can reproduce it with a generated tree of files.

Comment 4 Richard 2011-01-29 07:28:12 UTC

I have the same issue, only triggered with a large tree with lots of hard links. One side is Linux (QNAP NAS) and the other side is OS X both running 3.0.7.

Comment 5 Wayne Davison 2011-01-29 21:53:08 UTC

(In reply to comment #4)
> I have the same issue, only triggered with a large tree with lots of hard
> links. One side is Linux (QNAP NAS) and the other side is OS X both running
> 3.0.7.

What are the options that you are using?  Are you using --delete without --owner (-o, implied by -a), perchance?

Comment 6 Bas Zoetekouw 2011-01-30 02:22:38 UTC

I am indeed.  I've als,o noticed that the bug doesn't occur when using -a

Comment 7 Bas Zoetekouw 2011-01-30 02:24:37 UTC

I noticed I haven't reported the command line I'm using yet.  Here it is:

 rsync -rlHpztqP --delete --bwlimit=90 --exclude '/*/foo/' --exclude '/*/bar/' --exclude '.ssh/id*' --exclude '.gnupg/secring.gpg' daily.* weekly.* monthly.*  user@remotehost:

Comment 8 Wayne Davison 2011-01-30 19:57:00 UTC

Thank goodness!  I had been totally stumped as to how this bug could be happening beyond data corruption occurring, and it would appear that both of you are being bitten by a corruption bug that can occur with an incremental-recursion transfer when --delete is used but --owner is not active.  This bug was just recently identified and fixed, and will be released in 3.0.8.

I'm going to go out on a limb and say that this bug is also now fixed (in git).  If someone tests the git code and disagrees, we can open this back up (but I'm doubtful that is the case).

Comment 9 Bas Zoetekouw 2011-01-31 13:33:24 UTC

I think it is indeed fixed in latest git.  I started a run this afternoon, and it is still running after 4 hours, whereas it would abort within ten minutes before.

Thanks!

Comment 10 jamiec 2011-06-11 06:43:15 UTC

Hi Wayne,

I am curious how this issue solve, so I check git log to find the diff.
Is this issue solved in following commitment?
     commit 57edc4808f566fbaa58ec96bc7e543b1ccb92ab9
     Author: Matt McCutchen <matt@mattmccutchen.net>
     Date:   Sat Jan 29 19:25:53 2011 -0800
     Avoid changing file_extra_cnt during deletion.

Thanks in advance. 

Best Regards,
Jamie Chen

Comment 11 Wayne Davison 2011-06-15 19:26:28 UTC

Yes, 57edc48 was the 3.1.0dev commit on the master branch.  The commits on the b3.0.x branch were more contorted, since I first fixed the bug one way, and then changed to use Matt's idiom.  You can see that set of changes as a single diff if you do this:

git diff c825514 83b94ef