Bug 7816 - get_tmpname() can create invalid UTF-8 filenames
Summary: get_tmpname() can create invalid UTF-8 filenames
Status: RESOLVED FIXED
Alias: None
Product: rsync
Classification: Unclassified
Component: core (show other bugs)
Version: 3.0.8
Hardware: Sparc Solaris
: P3 minor (vote)
Target Milestone: ---
Assignee: Wayne Davison
QA Contact: Rsync QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-11-24 09:12 UTC by Michael Salmon
Modified: 2011-01-03 22:13 UTC (History)
0 users

See Also:


Attachments
A simple heuristic that tries to avoid split high-bit characters (495 bytes, patch)
2010-11-24 13:32 UTC, Wayne Davison
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Salmon 2010-11-24 09:12:29 UTC
get_tmpname() creates filenames consisting of the directory, a dot, some bytes from the filename and .XXXXXX\0. No consideration is made for the fact that UTF-8 characters can be several bytes long and arbitrarily truncating the name can create an invalid UTF-8 sequence. Normally this isn't a problem but if the filesystem strictly enforces UTF-8 then the temp file cannot be created and the transfer fails.

An example of the problem is:

sending incremental file list
MS_Röj.icon
rsync: mkstemp "/fan/data/.MS_R\#303.001058" failed: Permission denied (13)

ö in UTF-8 is \#303\#266.

We got around the problem by specifying --inplace which avoids the temp file.

I think that the easiest way to handle the problem is to replace all characters in the file name with # if bit 7 is set.
Comment 1 Wayne Davison 2010-11-24 13:32:17 UTC
Created attachment 6086 [details]
A simple heuristic that tries to avoid split high-bit characters
Comment 2 Wayne Davison 2010-11-24 13:35:37 UTC
Most of the time the name won't be trimmed, as it only happens if the path is long enough that the temp name needs more room to add the unique suffix.  The attached patch is a simple heuristic that triggers if the name gets trimmed and there is a high-bit character as both the first-trimmed character and the last retained character.  In such a case, we'll just make the name shorter (removing all dangling high-bit characters).  If we end up with just a leading dot for the name, the trimming will stop, and the name will be kinda sad, but still usable.
Comment 3 Wayne Davison 2011-01-03 22:13:03 UTC
This fix will be in 3.0.8.