Bug 7545 - Fwd: having ~/.ccache on NFS makes 'gcc --version' take 30 seconds
Summary: Fwd: having ~/.ccache on NFS makes 'gcc --version' take 30 seconds
Status: CLOSED FIXED
Alias: None
Product: ccache
Classification: Unclassified
Component: ccache (show other bugs)
Version: 3.0pre1
Hardware: Other Linux
: P3 normal
Target Milestone: 3.1
Assignee: Joel Rosdahl
QA Contact: Joel Rosdahl
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-07-03 07:25 UTC by Ville Skyttä
Modified: 2010-09-16 12:40 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Ville Skyttä 2010-07-03 07:25:44 UTC
Forwarding bug report from Red Hat Bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=610853

---

Description of problem:
I had my home directory (and thus ~/.ccache) mounted on an NFS share, to share
it between machines.  I was then running the autoconf testsuite with high
parallelism (number of active cores + 2), and noticed that I was having windows
of processor utilization dropping to nearly 0%, with tests taking a LOONG time
to complete.  Upon investigation, I noticed that when the tests were sluggish,
'time gcc --version' would take 30 seconds.

Version-Release number of selected component (if applicable):
$ rpm -q gcc ccache
gcc-4.4.4-10.fc14.x86_64
ccache-3.0-0.2.pre1.fc14.x86_64

How reproducible:
very

Steps to Reproduce:
1. Point ~/.ccache to an NFSv3 mount.
2. git clone git://git.sv.gnu.org/autoconf.git
3. cd autoconf
4. autoreconf -vfi
5. make
6. make check TESTSUITEFLAGS=-j$(($(nproc) + 2))
7. monitor processor utilization during the exercise

Actual results:
During sequences where multiple processes are trying to use gcc at once (around
test 250 or so in the autoconf testsuite), I noticed that processor utilization
was severely dropping, and tests were taking forever to complete. 
Investigating partial test output to date showed that tests were getting stuck
on 'gcc --version', and I was able to reproduce this in another console, with
'time gcc --version' showing 30 seconds of elapsed time.

Using both strace and ltrace showed that a slow 'gcc --version' was invariably
getting stuck on fcntl() call in this portion of the process:
open("/home/remote/eblake/.ccache/stats", O_RDWR) = 4
fcntl(4, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=1}) = 0

In other words, the act of trying to lock ~/.ccache/stats is causing lock
contention over NFS, which results in long timeouts for things to serialize
correctly, and as a result, ccache performance was needlessly suffering.

Expected results:
Testsuite should complete within a few minutes, with nearly 100% processor
utilization on all cores during the test.

File locking should NOT cause such a severe performance degradation,
particularly for something as trivial as 'gcc --version'.  Furthermore, using
fcntl for file locking is inherently broken:
http://0pointer.de/blog/projects/locking.html
If ccache needs locking, it should use alternatives such as atomic mkdir() or
symlink() calls, rather than fcntl() locking, particularly if ~/.ccache is not
a local drive.

Additional info:
I was able to work around the issue by relocating ~/.ccache to be a symlink to
a local directory, at which point NFS locking speed no longer interferes, and
my autoconf testsuite completed faster.
Comment 1 Joel Rosdahl 2010-08-01 10:56:07 UTC
ccache 3.1 will contain two changes to tackle the problem:

1. Update one of the 16 $CCACHE_DIR/[0-9a-f]/stats files for things like "gcc --version" in one of the 16 subdirectories (selected pseudo-randomly) instead of $CCACHE_DIR/stats. This will reduce lock contention.

2. As suggested, use symlinks for locking instead of POSIX locks.
Comment 2 Joel Rosdahl 2010-09-06 12:13:33 UTC
Implemented and will be in ccache 3.1.