hash2 defines some characters such as \, | are FLAG_ILLEGAL at init_tables(). But at least in Japanese these characters (between 0x40 and 0x7e) occur the 2nd byte of a CHARACTER. So we need to set "mangling method = hash" to treat Japanese correctly. At least it should be written in Samba HOWTO collection. This also happens in Samba 2.2 series, but Samba 2.2.x's default is "hash". In hash2 method, once to convert string to Unicode and to match the ILLEGAL chars/filenames, this problem may probably be solved.
I changed the severity of this bug from minor to major, because "mangling method = hash" causes another problem (see Bug 190). This means that we cannot support Japanese correctly.
In addition, hash2 counts the length of "unix charset" string, but this should be that of "dos charset" strings. In Japanese, some characters have different length between "dos charset" and "unix charset". Currently, this may not cause any real problems, but is not good.
Here is a temporary patch. I checked all Japanese character in "unix charset = CP932/EUCJP-MS" are displayed correctly on Intel Linux. Currently this patch should not work on Big Endian machine, so I will check on the Big Endian machine this weekend. ----- CUT HERE ----- --- mangle_hash2.c.org Sun Jun 29 00:43:11 2003 +++ mangle_hash2.c Tue Jul 1 06:57:05 2003 @@ -444,32 +444,42 @@ static BOOL is_legal_name(const char *name) { - const char *dot_pos = NULL; + smb_ucs2_t *dot_pos = NULL; BOOL alldots = True; size_t numdots = 0; + smb_ucs2_t *ucs2name, *ucs2namep; - while (*name) { - if (FLAG_CHECK(name[0], FLAG_ILLEGAL)) { - return False; - } - if (name[0] == '.') { - dot_pos = name; + ucs2name = acnv_uxu2(name); + ucs2namep = ucs2name; + + while (*ucs2namep) { + if (ucs2namep[0] < 0x80) { + if (FLAG_CHECK(ucs2namep[0], FLAG_ILLEGAL)) { + SAFE_FREE(ucs2name); + return False; + } + if (ucs2namep[0] == UCS2_CHAR('.')) { + dot_pos = ucs2namep; numdots++; } else { alldots = False; } - name++; + } + ucs2namep++; } if (dot_pos) { - if (alldots && (numdots == 1 || numdots == 2)) + if (alldots && (numdots == 1 || numdots == 2)) { + SAFE_FREE(ucs2name); return True; /* . or .. is a valid name */ - + } /* A valid long name cannot end in '.' */ - if (dot_pos[1] == '\0') + if (dot_pos[1] == '\0') { + SAFE_FREE(ucs2name); return False; + } } - + SAFE_FREE(ucs2name); return True; } ----- CUT HERE -----
moving to extended character component
Reassign this to me and accept bug.
Doing a ucs2 conversion here is *very* expensive (we call this a *lot*), however if we are going to do one, it would solve the related issue that we need to mangle invalid multibyte strings.
>Doing a ucs2 conversion here is *very* expensive (we call this a *lot*) I think so. At the view of multibyte(non-ASCII chars) processing, I think to fix Samba internal encode to UCS-2 is the best way (Currently depends on unix charset). I hope that for future release of Samba 3.0.x ( or Samba 3.2.x or 4.x). Now: wired - UCS-2 (default) internal - depends on unix charset filename - depends on unix charset file contents(smb.conf/username map file/log files and etc.) - depend on unix charset swat - depend on display charset The best way I think: wired - UCS-2 internal - UCS2 swat - UTF-8 file contents - UTF-8 filename - depends on unix charset using UCS-2 as internal code, 1) we avoid the expensive (and meaningless) code conversion between unix charset to UCS-2 during calling string functions, 2) most string length problems that I reported will be automatically avoided and I think that using fixed length charset will reduce other bugs. using UTF-8 as file contents, 1) we can display ASCII chars as ascii, 2) UCS-2 information is not omitted by conversion because UTF-8 is arithmetically calculated from UCS-2, 3) we need not convert (or rebuild) contents of files when we change the unix charset Anyway, unix charset should be used only for file names or other strings that UNIX users are explicitly affected. At least smb.conf, log files, username map files are directly used by Samba administrators only. And if SWAT displays UTF-8 strings, most browser can recognize.
Jeremy, here's another multibyte issue
Ok - the simplest fix seems to be to prefix the ascii check for an illegal character with a check for a multi-byte string. I'll add this in. Jeremy.
Ok, I have checked in a fix for this. Please CVS checkout the SAMBA_3 tree and let me know if this is fixed. Jeremy
Why don't we check for the full multi-byte status of the string? If the charset is utf8, this would be the ideal point to ensure we mangle invalid utf8, rather than sending the unconverted string to the client (which is what will happen if the invalid unix charsets are allowd to get to our new 'improved' convert_string() routines).
I checked the current SAMBA_3_0 branch and it works well under Japanese environments: Commonly using Samba 3.0.0 CVS version + patched libiconv-1.8 1) Debian/GNU Linux 3.0 + unix charset = CP932 2) Debian/GNU Linux 3.0 + unix charset = EUCJP-MS 3) Debian/GNU Linux 3.0 + unix charset = CP932 + vfs_cap 4) Solaris 9 SPARC Edition + unix charset = CP932 5) Solaris 9 SPARC Edition + unix charset = EUCJP-MS Thanks!
monyo says ok.
originally reported against one of the 3.0.0rc[1-4] releases. Cleaning up non-production versions.
sorry for the same, cleaning up the database to prevent unecessary reopens of bugs.