Bug 14401 - unicode character conversion problem from MacOS to Linux despite iconv
Summary: unicode character conversion problem from MacOS to Linux despite iconv
Status: RESOLVED WORKSFORME
Alias: None
Product: rsync
Classification: Unclassified
Component: core (show other bugs)
Version: 3.1.3
Hardware: All All
: P5 normal (vote)
Target Milestone: ---
Assignee: Wayne Davison
QA Contact: Rsync QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-06-04 16:22 UTC by Tormen
Modified: 2020-06-06 22:15 UTC (History)
1 user (show)

See Also:


Attachments
File breaking the rsync --iconv=utf-8-mac,utf-8 -- "ls" in "Terminal" App (311.23 KB, image/png)
2020-06-04 17:58 UTC, Tormen
no flags Details
File breaking the rsync --iconv=utf-8-mac,utf-8 (186.52 KB, application/x-gzip)
2020-06-04 18:00 UTC, Tormen
no flags Details
File breaking the rsync --iconv=utf-8-mac,utf-8 -- in Finder (1.72 MB, image/png)
2020-06-04 18:11 UTC, Tormen
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Tormen 2020-06-04 16:22:13 UTC
// SOURCE (initiating rsync):
ProductName:	Mac OS X
ProductVersion:	10.15.4
BuildVersion:	19E287

Homebrew rsync:
rsync  version 3.1.3  protocol version 31
Copyright (C) 1996-2018 by Andrew Tridgell, Wayne Davison, and others.
Web site: http://rsync.samba.org/
Capabilities:
    64-bit files, 64-bit inums, 64-bit timestamps, 64-bit long ints,
    socketpairs, hardlinks, symlinks, IPv6, batchfiles, inplace,
    append, ACLs, xattrs, iconv, symtimes, no prealloc, file-flags

// TARGET:
Debian Linux 10.4
Linux 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2 (2020-04-29) x86_64 GNU/Linux

Debian rsync:
rsync  version 3.1.3  protocol version 31
Copyright (C) 1996-2018 by Andrew Tridgell, Wayne Davison, and others.
Web site: http://rsync.samba.org/
Capabilities:
    64-bit files, 64-bit inums, 64-bit timestamps, 64-bit long ints,
    socketpairs, hardlinks, symlinks, IPv6, batchfiles, inplace,
    append, ACLs, xattrs, iconv, symtimes, prealloc

rsync comes with ABSOLUTELY NO WARRANTY.  This is free software, and you
are welcome to redistribute it under certain conditions.  See the GNU
General Public Licence for details.

Problem:
2020/06/04 17:12:21 [12205] [sender] cannot convert filename: Users/me/Library/Mail/V7/59923E9C-ACCC-45B0-B179-4CD4EA4D87D5/Sent Messages.mbox/DEED0205-D544-48AF
-BDB2-40C0E6D5380C/Data/5/3/5/1/Attachments/1535951/3/<F0><9F><9B><84> Danke! Ihre Buchung ist besta<CC><88>tigt: Dolly Waikiki.eml (Illegal byte sequence)
2020/06/04 17:12:21 [12205] IO error encountered -- skipping file deletion

ls on the filename shows:
Comment 1 Wayne Davison 2020-06-04 17:38:34 UTC
Macs use a weird utf-8-mac encoding that you need to make sure you're specifying. If the iconv library complains about the encoding, then either the encoding name is wrong or you have an invalid file that isn't named with the specified encoding.

So, if you're running rsync on a mac and copying to/from a linux host, you should be able to specify:

--iconv=utf-8-mac,utf-8

to specify the local & remote charset.
Comment 2 Tormen 2020-06-04 17:57:40 UTC
Hi @wayne. I only now noticed that my comment got truncated at the "ls" part.

I had provided afterwards also the rsync command used:

'/usr/local/bin/rsync'  -e 'ssh -p 53146 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null' -D --numeric-ids --links --hard-links --one-file-system --itemize-changes --times --recursive --perms --owner --group --stats --human-readable --partial --delete --iconv=utf-8-mac,utf-8 --compress --log-file '/Users/admin/.rsync-tmbackup/2020-06-04-195558.log' --exclude-from '/LINKS/etc/rsync-tmbackup.exclude-from.macado'  -- '/System/Volumes/Data/' 'root@jolie:/jbackup/macado/System.Volumes.Data/2020-06-04-1955.50___FULL-BACKUP/'

As you can see I already use(d) "--iconv=utf-8-mac,utf-8"

As the file is an Email stored by MacOS Mail app the encoding should be MacOS standard and not untypical for a Mac.
Comment 3 Tormen 2020-06-04 17:58:42 UTC
Created attachment 16018 [details]
File breaking the rsync --iconv=utf-8-mac,utf-8 -- "ls" in "Terminal" App
Comment 4 Tormen 2020-06-04 18:00:50 UTC
Created attachment 16019 [details]
File breaking the rsync --iconv=utf-8-mac,utf-8
Comment 5 Tormen 2020-06-04 18:04:10 UTC
> If the iconv library complains about the encoding, then either the encoding name is wrong or you have an invalid file that isn't named with the specified encoding.

How can I "zoom in" here to verify if the filename is encoded in utf-8-mac or not ?
Comment 6 Tormen 2020-06-04 18:11:17 UTC
Created attachment 16020 [details]
File breaking the rsync --iconv=utf-8-mac,utf-8 -- in Finder
Comment 7 SATOH Fumiyasu 2020-06-05 06:14:22 UTC
The `
Comment 8 SATOH Fumiyasu 2020-06-05 06:16:18 UTC
The U+1F6C4 (BAGGAGE CLAIM emoji, <F0><9F><9B><84> in UTF-8) is a Unicode character and is located in surrogate pairs, but the UTF-8-MAC encoding by macOS's iconv(3) does not support surrogate pairs.

Try to compile your rsync binary with my hacked GNU libiconv:
https://github.com/fumiyas/libiconv-utf8mac

...

Bugzilla does not support surrogate pairs...
Comment 9 Wayne Davison 2020-06-05 06:26:17 UTC
One other thing you could do when sending files to Linux is to not translate the names. This is because Linux can create a filename with oddball character sequences (unlike macos) so it can store and retrieve the raw filenames just fine for something like backup and restore. It would just cause various names to display with "?" chars when listed on Linux, or displayed with escape sequences when listed with `ls -b`:

\360\237\233\204\ Danke!\ Ihre\ Buchung\ ist\ bestätigt:\ Dolly\ Waikiki.eml

In any case, this sounds like an iconv issue, not an rsync issue.
Comment 10 Tormen 2020-06-05 10:10:08 UTC
Thank you very much !!!!! for your comments.

I agree that the problem I pointed out would be then a limitation of iconv.

The comment about not converting was really helpful. I was not sure about leaving the --conv off, but it makes sense. And I will go down this road and hope that I always have a MAC to restore too ;)

This ticket can now be closed... but I left it open, as I was not sure what status to pick in this case.