Bug 2790 - Add support for converting filenames into different encodings
Summary: Add support for converting filenames into different encodings
Status: RESOLVED FIXED
Alias: None
Product: rsync
Classification: Unclassified
Component: core (show other bugs)
Version: 2.6.5
Hardware: All Mac OS X
: P3 enhancement (vote)
Target Milestone: ---
Assignee: Wayne Davison
QA Contact: Rsync QA Contact
URL:
Keywords:
: 3362 (view as bug list)
Depends on:
Blocks:
 
Reported: 2005-06-11 11:40 UTC by Bob Friesenhahn
Modified: 2009-11-01 16:17 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Bob Friesenhahn 2005-06-11 11:40:21 UTC
When mirroring a directory tree with "rsync -aq --delete-after", there are many
error messages reported similar to:
rsync: mkstemp "/backup/home/kuwe/.kdb/.\376\254\026T.zTE32v" failed: Invalid
argument (22)
It seems that extended characters are converted to octal codes by the protocol
but have not been converted back to 8-bit codes prior to passing to mkstemp, so
mkstemp fails under Mac OS X.  The problem does not occur under FreeBSD or
Solaris.  Apparently Mac OS X rejects the backslashes in file names while
FreeBSD and Solaris accept them.
The same problem is observed with the rsync that came with Mac OS X (2.6.2).
Comment 1 Wayne Davison 2005-06-11 23:40:19 UTC
Rsync does not change the filenames it transfers in any way, so if the OS
refuses to store a certain sequence of characters, that is currently out of
rsync's hands. (The backslashes you see is just rsync's way of outputting
high-bit characters is a visible manner.)  It would be nice if rsync supported
some kind of filename transformation support so that conversion to and from
UTF-8 (or whatever) would be possible.

OS X is known to reject certain multi-byte high-bit characters that aren't
compatible with its own high-bit character encoding.  Your current choices are
to (1) change the character encoding on the source FS to match the encoding of
the destination FS, making the names compatible; (2) not use high-bit character
sequences that conflict between OSes; (3) pre-process the files to convert
high-bit characters into sequences that won't fail; (4) use the fname-conv.diff
patch in the patches dir to enhance rsync with some basic name-conversion
support; (5) help to create a better filename-conversion solution.

I didn't really like the solution in the fname-conv.diff because it typically
results in a huge number of forked command calls, one for each filename
processed.  It is a very versatile solution, but is probably overkill for what
rsync really needs: the optional(!) ability to use iconv() on the filenames it
sends (transferring names in UTF-8 and converting the names via library calls to
the local encoding needed).  If someone would like to work on a solution for
this, please let me know.
Comment 2 Bob Friesenhahn 2005-06-12 09:18:36 UTC
If this is indeed an artifact of the receiving OS, then I suggest an addition in
the RSync FAQ to cover this issue.  In particular, even though a Mac OS-X system
may have large disks handy, it may not be an ideal OS to support periodic
"backups" via rsync for European users or other users who may naturally use
extended characters.
Comment 3 Wayne Davison 2006-01-16 20:51:15 UTC
I've been working on a filename-conversion solution that uses the iconv() function.  After putting a bunch of thought into various designs, I think I have a good solution that I have coded up as a patch for the latest CVS version (also available in the latest "nightly" tar file).  You can grab the patch here:

http://opencoder.net/iconv.diff

This is a fairly early version, so if anyone would like to help with the testing, please be sure to be fairly cautious with it at first.

The patch doesn't have any changes to the configure script only because I already checked those into CVS (since they were pretty small -- they just check for things like iconv_open(), iconv.h, etc.).
Comment 4 Wayne Davison 2006-01-20 11:02:05 UTC
*** Bug 3362 has been marked as a duplicate of this bug. ***
Comment 5 Tomasz Chmielewski 2006-03-29 04:11:51 UTC
I'm just curious.

Suppose I have rsyncd running on Windows, and want to copy files from there to a Linux machine.
The problem is that some characters (like German umlauts - ü, ö, ä etc.) are changed to ? when saved on the Linux side.

Where should I apply this patch?
rsync on a Windows side? rsync on a Linux side? Both?
Comment 6 Tomasz Chmielewski 2006-03-29 05:57:19 UTC
It seems it has to be applied on both sides :)

Anyway, it seems to be impossible to build rsync on Windows (Cygwin) with this patch:

gcc -I. -I. -g -O2 -DHAVE_CONFIG_H -Wall -W  -c log.c -o log.o
log.c: In function `rwrite':
log.c:231: error: `iconv_t' undeclared (first use in this function)
log.c:231: error: (Each undeclared identifier is reported only once
log.c:231: error: for each function it appears in.)
log.c:231: error: parse error before "ic"
make: *** [log.o] Error 1

This patch is sure useful only with Linux - Windows or Linux - Mac.
Comment 7 Wayne Davison 2006-04-02 00:02:34 UTC
(In reply to comment #6)
> Anyway, it seems to be impossible to build rsync on Windows (Cygwin)

Make sure that you have whatever "dev" package is needed to compile a program that uses the iconv library.  Without that, the configure script will not find the availability of the iconv_open() function call that is needed to enable all the iconv code.  Then, be sure to re-run configure.
Comment 8 Akinori MUSHA 2007-02-16 05:39:38 UTC
For platforms where libiconv functions are prefixed with `libiconv_' instead of `iconv_', HAVE_ICONV_OPEN is undefined while HAVE_ICONV and the build fails.

I think `#include <iconv.h>' should be added to iconv_open() detection or HAVE_LIBICONV_OPEN should be added and or'd with HAVE_ICONV_OPEN.

I hit the problem on OS X, worked it around simply by adding `#define HAVE_ICONV_OPEN 1' to config.h, and finally transfer between FreeBSD and OS X worked like a charm.

Thanks for the work!
I'd really like to see this included in near future.

Regards,
Comment 9 Carsten Bormann 2007-10-30 11:09:13 UTC
The current solution appears to be somewhat confused about what it is trying to solve.

There are three filename encodings: the one in the client fs, the transfer encoding, the one in the server fs.
Client needs to know client-fs and transfer, server needs to knoe server-fs and transfer.
Trying to mush up any two of the three leads to pain.

There are also three scenarios:

-- sane: common transfer encoding (UTF-8 in NFC).  Server and client need to know local conventions; as in current --iconv=., they probably can figure that out.

-- compatible: The server may not know about iconv.  So the client has to do all the conversions.  This is almost support now, except that the client sends an iconv option to the server that this does not understand.

-- fast: if both sides have the same encoding, the whole thing should be skipped.  This is also compatible (it is the way it works right now).

Because of compatibility, "sane" probably needs an option to switch it on. It may also need client-side and server-side overrides to help these two out if they can't guess or guess wrong.
Compatible also needs an option to switch it on, and parameters to control the conversion.  It is by definition client-side only; the client needs to be told what the server needs (and also may need help in guessing its own encoding).  (For symmetry, it is also conceivable to add a server-side compatible option as part of the ssh-options.)
Fast is the current (2.x) default and probably should stay the default for compatibility.

So I propose (names are descriptive, but not optimal yet):

--encoding-aware: Switches on sane.
--client-encoding: supplies (overrides) value for client-side encoding for sane.
--server-encoding: supplies (overrides) value for server-side encoding for sane.
--transfer-encoding: overrides the transfer-encoding (default: UTF-8 NFC).
--server-encoding-unaware: don't tell the server anything, but do everything on client-side.
--client-encoding-unaware: inverse (if you want to do that).

Maybe combining --encoding-aware and --server-encoding-unaware into one --client-encoding-aware is better.
Maybe combining --encoding-aware and --client-encoding-unaware into one --server-encoding-aware is better.
In both cases, this is somewhat confusing, because you want to keep the sane transfer coding unless you are in the compatible case.

The only switch that needs a single-character form is --encoding-aware, which should get part of finger memory like -a for most rsync users.

Comment 10 Matt McCutchen 2007-10-30 17:15:03 UTC
(In reply to comment #9)
> The current solution appears to be somewhat confused about what it is trying to
> solve.

Rather, you appear to be overcomplicating the problem.

> There are three filename encodings: the one in the client fs, the transfer
> encoding, the one in the server fs.
> Client needs to know client-fs and transfer, server needs to knoe server-fs and
> transfer.
> Trying to mush up any two of the three leads to pain.

Rsync isn't like MySQL, which tags every string value with its encoding, and I don't see why we would want to make it that way.  Instead, the rsync sender and receiver each treat filenames as plain sequences of bytes, in accordance with the POSIX filesystem API on which rsync relies so heavily.  --iconv merely allows you to make the sender and receiver byte sequences differ by an encoding conversion because this is often useful.

> -- compatible: The server may not know about iconv.  So the client has to do
> all the conversions.  This is almost support now, except that the client sends
> an iconv option to the server that this does not understand.

This is the only thing you propose that rsync does not already support, and I think it is a natural addition to rsync.  Currently, if iconv is enabled, each process converts strings from its local encoding to UTF-8 before sending them over the wire and converts strings from UTF-8 to its local encoding after reading them from the wire.  Rsync should let the user specify another encoding in place of UTF-8.

Specifically, I propose two options to specify the conversion, if any, to be applied on each end: --iconv-client=CLIENT,WIRE and --iconv-server=WIRE,SERVER .  (There's no reason rsync shouldn't allow the two values of WIRE to be different, although this would rarely be useful.)  --iconv=CLIENT,SERVER then stands for --iconv-client=CLIENT,UTF-8 --iconv-server=UTF-8,SERVER .  A "compatible" copy with a UTF-8 client and an ISO-8859-1 server could be achieved by --iconv-client=UTF-8,ISO-8859-1 .

> The only switch that needs a single-character form is --encoding-aware, which
> should get part of finger memory like -a for most rsync users.

I think --iconv=. or --encoding-aware is too special-purpose to "need" a single-character form in the main version of rsync.  If you use it frequently, you can always define your own popt alias.  This is what Wayne recommended for my favorite "sane" option, --chmod=ugo=rwX .
Comment 11 Carsten Bormann 2007-10-30 19:22:30 UTC
> Currently, if iconv is enabled, each
> process converts strings from its local encoding to UTF-8 before sending them
> over the wire and converts strings from UTF-8 to its local encoding after
> reading them from the wire.  

I must admit I didn't get this at all from the documentation (I assume when you say "UTF-8" you mean "UTF-8 NFC", which may be its conventional meaning in the Samba world, I don't know).

> Specifically, I propose two options to specify the conversion, if any, to be
> applied on each end: --iconv-client=CLIENT,WIRE and --iconv-server=WIRE,SERVER

Sounds good to me.  WIRE might default to "UTF-8" to make things even simpler for the most sane cases.

> I think --iconv=. or --encoding-aware is too special-purpose to "need" a
> single-character form in the main version of rsync.  If you use it frequently,
> you can always define your own popt alias.  This is what Wayne recommended for
> my favorite "sane" option, --chmod=ugo=rwX .

As I live in heterogeneous environments, I'm not so sure about that, but then there is always RSYNC_ICONV (which would need to be able to set the new options, too, hmm).

Thanks for the quick and sane reply!

Comment 12 Matt McCutchen 2009-10-30 10:48:20 UTC
The stable rsync has had --iconv for a while now.  Is there a reason this bug isn't marked fixed?  (If someone wants to pursue my idea from comment #10, they should enter a new bug or at least narrow the summary of this one.)
Comment 13 Wayne Davison 2009-11-01 16:17:31 UTC
You're right, this bug should be closed, so I'm doing so now.

The idea of doing an rsync version that does all the conversions on one side so that it can work with an older, non-iconv rsync is not something that I plan to implement.

Note that it would require that the converting side cache both versions of the names (because it needs to be able to sort by the remote names, and access files via the local names), and thus bloat the amount of required memory for the transfer.