4168 – Random file has vanished when syncing clock with ntpdate/hwclock ?

Bug 4168 - Random file has vanished when syncing clock with ntpdate/hwclock ?

Summary: Random file has vanished when syncing clock with ntpdate/hwclock ?

Status:	RESOLVED WORKSFORME

Alias:	None

Product:	rsync
Classification:	Unclassified
Component:	core (show other bugs)
Version:	2.6.4
Hardware:	x86 Linux

Importance:	P3 major (vote)
Target Milestone:	---
Assignee:	Wayne Davison
QA Contact:	Rsync QA Contact

URL:
Keywords:

Depends on:
Blocks:

Reported:	2006-10-12 09:04 UTC by Guy Baconniere
Modified:	2009-12-13 00:31 UTC (History)
CC List:	2 users (show)

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Guy Baconniere 2006-10-12 09:04:59 UTC

We have one master server with 2 GB and 1'331'703 inodes of files
rsynced in parallels by two slave server at the same time (by a cron job).

Randomly the rsync failed on both slaves with a list of files
who sems to be vanished but the list of files are not the same on
the two slaves who rsync the same source, the same time.
The master source don't have any modification.

On the beginning "ignore-errors" was used so all vanished files was
removed from all slaves but still exist on the master.

file has vanished: "xxxxxxxxxxxxxxx13"
file has vanished: "xxxxxxxxxxxxxxx42" done
deleting xxxxxxxxxxxxxxx13.
deleting xxxxxxxxxxxxxxx42.

After reading your docs I have removed "ignore-errors" to avoid
removing files when file vanished detected !

file has vanished: "xxxxxxxxxxxxxxx13"
file has vanished: "xxxxxxxxxxxxxxx42" 
done
IO error encountered -- skipping file deletion

After some research on all bugs reports about file vanished
I found https://bugzilla.samba.org/show_bug.cgi?id=2607

Several months ago I have switched the cron job of ntpdate
from cron.daily to cron.hourly.

My master source and two slaves launch cron ntpdate/hwclock
every hours at 17 *. The rsync batch starts at 00 * and takes
about 20 to 30 minutes to finish.

I suspect rsync to failed with random vanished file during the
clock sync.

Please could you confirm this theory ?


Debian Sarge with rsync 2.6.4-6
rsync --recursive -e ssh \
--rsh="ssh" \
--verbose \
--recursive \
--links \
--perms \
--times \
--omit-dir-times \
--whole-file \
--one-file-system \
--ignore-existing \
--delete-before \
--partial \
--timeout=10800 \
--exclude=_TRANSFERED \
--no-blocking-io \
--ipv4 \
$user@$host:$masterdir/ \
$localdir/

rsync starts at the same time on both slaves

slave1
file has vanished: "xxxxxxxxxxxxxxx13"
file has vanished: "xxxxxxxxxxxxxxx42" 
done
IO error encountered -- skipping file deletion

slave2
file has vanished: "xxxxxxxxxxxxxxx03"
file has vanished: "xxxxxxxxxxxxxxx56" 
done
IO error encountered -- skipping file deletion

Thanks for your feedback

--

Infomaniak Network SA
Guy Baconniere <baco@infomaniak.ch>
Unix System Administrator
Certified Linux Engineer (RHCE, LPIC-2)
Avenue de la Praille 26
1227 Carouge (Geneva)
Switzerland (CH)
Phone +41 (0)22 820 3541
Fax +41 (0)22 820 3546
AS29222 / BACO-RIPE

Comment 1 Guy Baconniere 2006-10-13 03:30:06 UTC

I have setup a rsync of the same source on two servers
just before the sync of the clock by ntpdate/hwclock

I have random vanished files and all vanished files wasn't
modified. Both server who rsync at the same times report
the same files modified but not the same files vanished !

This is a sample output of my rsync test. You can found the
rsync command line on my main post.

        SLAVE1 (RSYNC BY CRON 08:16:03=>08:34:21 + NTPDATE 08:17:00)
       /
MASTER 
       \SLAVE2 (RSYNC BY CRON 08:16:02=>08:34:11 + NTPDATE 08:17:00)
       

*** SLAVE2 OUTPUT ***

Received: from slave2 (localhost [127.0.0.1])
	by slave2 (8.13.4/8.13.4) with ESMTP id k9D6YGxA012270
	for <admin@slave2>; Fri, 13 Oct 2006 08:34:21 +0200
Received: (from admin@localhost)
	by slave2 (8.13.4/8.13.4/Submit) id k9D6G2dj011455
	for admin; Fri, 13 Oct 2006 08:16:02 +0200
Date: Fri, 13 Oct 2006 08:16:02 +0200
Subject: Cron <admin@slave2> /usr/local/bin/dosync-debug.sh

receiving file list ... 
file has vanished: "/files/test/skiWusho"
file has vanished: "/files/test/AskeotGa"
file has vanished: "/files/test/SiNegwak"
done
IO error encountered -- skipping file deletion
test/ourmyisA
test/bueHojNo
test/onHedkad

sent 125 bytes  received 20720380 bytes  18948.79 bytes/sec
total size is 1190536  speedup is 0.06
rsync warning: some files vanished before they could be transferred (code 24) at main.c(1166)
receiving file list ... done

sent 20 bytes  received 743 bytes  508.67 bytes/sec
total size is 0  speedup is 0.00

*** SLAVE1 OUTPUT ***

Received: from slave1 (localhost [127.0.0.1])
	by slave1 (8.13.4/8.13.4) with ESMTP id k9D6YAKA021356
	for <admin@slave1>; Fri, 13 Oct 2006 08:34:11 +0200
Received: (from admin@localhost)
	by slave1 (8.13.4/8.13.4/Submit) id k9D6G3Pb020551
	for admin; Fri, 13 Oct 2006 08:16:03 +0200
Date: Fri, 13 Oct 2006 08:16:03 +0200
Subject: Cron <admin@slave1> /usr/local/bin/dosync-debug.sh

receiving file list ... 
file has vanished: "/files/test/CimitAju"
file has vanished: "/files/test/Vibnewyi"
done
IO error encountered -- skipping file deletion
test/ourmyisA
test/bueHojNo
test/onHedkad

sent 125 bytes  received 20720590 bytes  19088.64 bytes/sec
total size is 1190536  speedup is 0.06
rsync warning: some files vanished before they could be transferred (code 24) at main.c(1166)
receiving file list ... done

sent 20 bytes  received 743 bytes  508.67 bytes/sec
total size is 0  speedup is 0.00

Comment 2 Wayne Davison 2006-10-25 21:48:51 UTC

What seems to be happening here is that rsync gets the name of a file via readdir(), and then tries to stat the file to figure out its type and other attirbutes, and the stat() (or lstat()) call fails.  This causes rsync to leave the file out of the file list.  So, the real question is: what is affecting the stat() calls on your system?  This doesn't sound like an rsync problem.  Are the files coming from an NFS filesystem?  Perhaps the tweaking of the time causes a glitch in the NFS protocol?

Comment 3 Guy Baconniere 2006-10-26 05:30:20 UTC

The master server isn't a NFS Server nor a NFS mounted volume.
All files are stored on a XFS partition with "noatime" flag.

The rsync use ssh as transport maybe this could also affect
rsync ?

So this random problem with stat() call can be related to
a bug in XFS filesystem or maybe to the kernel cache of the
filesystem.

The cumulative size of all files is 123M but the numer of 
files is very important about 638'860 inodes

The server is running a Kernel 2.6.13.1. I will try to upgrade 
to the 2.6.18.1.

Comment 4 roland 2009-08-07 04:21:35 UTC

>I have setup a rsync of the same source on two servers
>just before the sync of the clock by ntpdate/hwclock

you should not sync the clock via ntpdate at runtime, as your system time will not be continous and do some jump into the future or into the past. this can lead to all sorts of weird side effects. use ntpd instead. 
you can use ntpdate on bootup, but i would not use it at runtime.


please disable ntpdate for testing and prove that the issue disappears. 

if it still persists, then it`s another problem, not related to ntp(date).
if it disappears, then it`s NOT a rsync problem but a problem introduced by using a sub-optimal method of setting/syncing the system time.

mind that ntpdate is deprecated, too.

this is from the ntpd manpage:

     In  some  cases  it may not be practical for ntpd  to run continuously. A common workaround has
       been to run the ntpdate  program from a cron  job at designated times.  However,  this  program
       does not have the crafted signal processing, error checking and mitigation algorithms of ntpd .
       The -q  option is intended for this purpose. Setting this option will cause ntpd  to exit  just
       after  setting  the  clock for the first time. The procedure for initially setting the clock is
       the same as in continuous mode; most applications will probably  want  to  specify  the  iburst
       keyword  with  the  server   configuration  command. With this keyword a volley of messages are
       exchanged to groom the data and the clock is set in about a minute. If nothing is heard after a
       couple  of  minutes,  the  daemon times out and exits. After a suitable period of mourning, the
       ntpdate  program may be retired.

Comment 5 Wayne Davison 2009-12-13 00:31:40 UTC

Unlikely to be an rsync bug.