8075 – getting "failed verification -- update discarded" errors

Bug 8075 - getting "failed verification -- update discarded" errors

Summary: getting "failed verification -- update discarded" errors

Status:	NEW

Alias:	None

Product:	rsync
Classification:	Unclassified
Component:	core (show other bugs)
Version:	3.0.8
Hardware:	x64 Linux

Importance:	P5 normal (vote)
Target Milestone:	---
Assignee:	Wayne Davison
QA Contact:	Rsync QA Contact

URL:
Keywords:

Depends on:
Blocks:

Reported:	2011-04-11 14:33 UTC by Mark Overmeer
Modified:	2011-06-04 21:19 UTC (History)
CC List:	0 users

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Mark Overmeer 2011-04-11 14:33:42 UTC

After months of flawless mirroring, the rsync daemon process stopped working. Some files get downloaded but at the end removed with "failed verification -- update discarded (will try again)"

The problems showed-up when I was running a 3.0.2-3.0.7 connection, but
is still present with 3.0.8 on both sides (which is used below). The files are a few hundred megabytes each and do not change nor have disk read errors. Some files work, some don't... it might be something related to a pattern within the file or the filename.

Running the process with -vvv shows this:

got file_sum
recv_generator(2882ns_envisat_isabel/ASA_IMS_1PNDPA20060528_190038_000000162048_00085_22183_1026.N1,131)
recv_files(2882ns_envisat_isabel/ASA_IMS_1PNDPA20061015_190042_000000162052_00085_24187_1024.N1)
2882ns_envisat_isabel/ASA_IMS_1PNDPA20061015_190042_000000162052_00085_24187_1024.N1
got file_sum
recv_generator(2882ns_envisat_isabel/ASA_IMS_1PNDPA20061015_190042_000000162052_00085_24187_1024.N1,132)
[receiver] _exit_cleanup(code=2, file=rsync.c, line=652): about to call exit(2)
[generator] _exit_cleanup(code=12, file=io.c, line=601): about to call exit(12)
WARNING: 2882ns_envisat_isabel/ASA_IMS_1PNDPA20060528_190038_000000162048_00085_22183_1026.N1 failed verification -- update discarded (will try again).
WARNING: 2882ns_envisat_isabel/ASA_IMS_1PNDPA20061015_190042_000000162052_00085_24187_1024.N1 failed verification -- update discarded (will try again).
File-list index 133 not in 960 - 1144 (read_ndx_and_attrs) [receiver]
rsync error: protocol incompatibility (code 2) at rsync.c(652) [receiver=3.0.8]
rsync: connection unexpectedly closed (998 bytes received so far) [generator]
rsync error: error in rsync protocol data stream (code 12) at io.c(601) [generator=3.0.8]

I hope this is sufficient information to find the cause.

Comment 1 Wayne Davison 2011-06-04 15:45:26 UTC

This is probably a read error on the sending side.  When rsync gets a read error from the OS, it ensures that the checksum won't match the data that was sent so that the file will be discarded by the receiver.  Look for earlier errors in the transfer, or errors in the logs.

Comment 2 Mark Overmeer 2011-06-04 21:05:07 UTC

(In reply to comment #1)
> This is probably a read error on the sending side.  When rsync gets a read
> error from the OS, it ensures that the checksum won't match the data that was
> sent so that the file will be discarded by the receiver.  Look for earlier
> errors in the transfer, or errors in the logs.

Read errors would show in dmesg, but there are no read errors. I can also access these files directly without errors. Besides, it is very reproducible. All files are hundreds of megs; it may have something to do with that.  Or, with a newer rsync on an older glibc.

If you want, I can provide access to the rsyncd.

Comment 3 Kevin Korb 2011-06-04 21:09:54 UTC

I have seen this happen when either of the systems has a bad RAM chip.  Often this causes files beyond a certain size to checksum incorrectly.  I would suggest running memtest86 on both systems (especially if one is not using ECC RAM) just to make sure that there isn't a flaky DIMM causing this issue.

Comment 4 Mark Overmeer 2011-06-04 21:19:35 UTC

(In reply to comment #3)
> I have seen this happen when either of the systems has a bad RAM chip.  Often
> this causes files beyond a certain size to checksum incorrectly.  I would
> suggest running memtest86 on both systems (especially if one is not using ECC
> RAM) just to make sure that there isn't a flaky DIMM causing this issue.

There is a chance that this is causing the problem, although after a reboot the same files showed the problems. Those were huge files, which increases the chance to hit the memory flaw... I will attempt a memory check (bit hard: no physical access to the machine myself) Will take me a few days.

Thanks for the hint.