Bug 186 - mangling method = hash2 can't work with multibyte strings
Summary: mangling method = hash2 can't work with multibyte strings
Status: CLOSED FIXED
Alias: None
Product: Samba 3.0
Classification: Unclassified
Component: Extended Characters (show other bugs)
Version: 3.0.0preX
Hardware: All All
: P2 major
Target Milestone: 3.0.0rc2
Assignee: Jeremy Allison
QA Contact:
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2003-06-23 10:25 UTC by TAKAHASHI Motonobu
Modified: 2005-08-24 10:16 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description TAKAHASHI Motonobu 2003-06-23 10:25:53 UTC
hash2 defines some characters such as \, | are FLAG_ILLEGAL at init_tables().
But at least in Japanese these characters (between 0x40 and 0x7e) occur the 2nd 
byte of a CHARACTER.
So we need to set "mangling method = hash" to treat Japanese correctly.
At least it should be written in Samba HOWTO collection.

This also happens in Samba 2.2 series, but Samba 2.2.x's default is "hash".

In hash2 method, once to convert string to Unicode and to match the ILLEGAL 
chars/filenames, this problem may probably be solved.
Comment 1 TAKAHASHI Motonobu 2003-06-26 09:52:51 UTC
I changed the severity of this bug from minor to major, 
because "mangling method = hash" causes
another problem (see Bug 190).
This means that we cannot support Japanese correctly.

 
Comment 2 TAKAHASHI Motonobu 2003-06-29 09:36:53 UTC
In addition, hash2 counts the length of "unix charset" string, but this should
be that of "dos charset" strings.

In Japanese, some characters have different length between "dos charset" and
"unix charset".

Currently, this may not cause any real problems, but is not good.
Comment 3 TAKAHASHI Motonobu 2003-07-01 10:56:53 UTC
Here is a temporary patch.
I checked all Japanese character in "unix charset = CP932/EUCJP-MS" are 
displayed correctly on Intel Linux.

Currently this patch should not work on Big Endian machine, so I will check on 
the Big Endian machine this weekend.

----- CUT HERE -----
--- mangle_hash2.c.org  Sun Jun 29 00:43:11 2003
+++ mangle_hash2.c      Tue Jul  1 06:57:05 2003
@@ -444,32 +444,42 @@

 static BOOL is_legal_name(const char *name)
 {
-       const char *dot_pos = NULL;
+       smb_ucs2_t *dot_pos = NULL;
        BOOL alldots = True;
        size_t numdots = 0;
+       smb_ucs2_t *ucs2name, *ucs2namep;

-       while (*name) {
-               if (FLAG_CHECK(name[0], FLAG_ILLEGAL)) {
-                       return False;
-               }
-               if (name[0] == '.') {
-                       dot_pos = name;
+       ucs2name = acnv_uxu2(name);
+       ucs2namep = ucs2name;
+
+       while (*ucs2namep) {
+         if (ucs2namep[0] < 0x80) {
+           if (FLAG_CHECK(ucs2namep[0], FLAG_ILLEGAL)) {
+             SAFE_FREE(ucs2name);
+             return False;
+           }
+       if (ucs2namep[0] == UCS2_CHAR('.')) {
+                       dot_pos = ucs2namep;
                        numdots++;
                } else {
                        alldots = False;
                }
-               name++;
+         }
+         ucs2namep++;
        }

        if (dot_pos) {
-               if (alldots && (numdots == 1 || numdots == 2))
+         if (alldots && (numdots == 1 || numdots == 2)) {
+       SAFE_FREE(ucs2name);
                        return True; /* . or .. is a valid name */
-
+         }
                /* A valid long name cannot end in '.' */
-               if (dot_pos[1] == '\0')
+         if (dot_pos[1] == '\0') {
+       SAFE_FREE(ucs2name);
                        return False;
+         }
        }
-
+       SAFE_FREE(ucs2name);
        return True;
 }
----- CUT HERE -----
Comment 4 Gerald (Jerry) Carter (dead mail address) 2003-07-15 09:34:20 UTC
moving to extended character component
Comment 5 Alexander Bokovoy 2003-07-15 11:01:54 UTC
Reassign this to me and accept bug.
Comment 6 Andrew Bartlett 2003-08-06 21:57:27 UTC
Doing a ucs2 conversion here is *very* expensive (we call this a *lot*), however
if we are going to do one, it would solve the related issue that we need to
mangle  invalid multibyte strings.   
Comment 7 TAKAHASHI Motonobu 2003-08-10 12:10:41 UTC
>Doing a ucs2 conversion here is *very* expensive (we call this a *lot*)

I think so.
At the view of multibyte(non-ASCII chars) processing, I think to fix Samba 
internal encode to UCS-2 is the best way (Currently depends on unix charset).

I hope that for future release of Samba 3.0.x ( or Samba 3.2.x or 4.x).

Now:
wired - UCS-2 (default)
internal - depends on unix charset
filename - depends on unix charset
file contents(smb.conf/username map file/log files and etc.) - depend on unix 
charset 
swat - depend on display charset

The best way I think:
wired - UCS-2
internal - UCS2
swat - UTF-8
file contents - UTF-8
filename - depends on unix charset

using UCS-2 as internal code, 
1) we avoid the expensive (and meaningless) code conversion between unix 
charset to UCS-2 during calling string functions,
2) most string length problems that I reported will be automatically avoided 
and I think that using fixed length charset will reduce other bugs. 

using UTF-8 as file contents, 
1) we can display ASCII chars as ascii,
2) UCS-2 information is not omitted by conversion
   because UTF-8 is arithmetically calculated from UCS-2,
3) we need not convert (or rebuild) contents of files when we change the unix 
charset

Anyway, unix charset should be used only for file names or other strings that 
UNIX users are explicitly affected.
At least smb.conf, log files, username map files are directly used by Samba 
administrators only.
And if SWAT displays UTF-8 strings, most browser can recognize.
Comment 8 Gerald (Jerry) Carter (dead mail address) 2003-08-26 12:14:05 UTC
Jeremy,  here's another multibyte issue
Comment 9 Jeremy Allison 2003-08-27 11:26:55 UTC
Ok - the simplest fix seems to be to prefix the
ascii check for an illegal character with a check
for a multi-byte string. I'll add this in.

Jeremy.
Comment 10 Jeremy Allison 2003-08-27 12:02:51 UTC
Ok, I have checked in a fix for this.
Please CVS checkout the SAMBA_3 tree and let me know if this is fixed.
Jeremy
Comment 11 Andrew Bartlett 2003-08-27 15:44:06 UTC
Why don't we check for the full multi-byte status of the string?

If the charset is utf8, this would be the ideal point to ensure we mangle
invalid utf8, rather than sending the unconverted string to the client (which is
what will happen if the invalid unix charsets are allowd to get to our new
'improved' convert_string() routines).

Comment 12 TAKAHASHI Motonobu 2003-08-29 07:36:07 UTC
I checked the current SAMBA_3_0 branch and 
it works well under Japanese environments:

Commonly using Samba 3.0.0 CVS version + patched libiconv-1.8
1) Debian/GNU Linux 3.0 + unix charset = CP932
2) Debian/GNU Linux 3.0 + unix charset = EUCJP-MS
3) Debian/GNU Linux 3.0 + unix charset = CP932 + vfs_cap
4) Solaris 9 SPARC Edition + unix charset = CP932
5) Solaris 9 SPARC Edition + unix charset = EUCJP-MS

Thanks!
Comment 13 Gerald (Jerry) Carter (dead mail address) 2003-08-29 07:53:07 UTC
monyo says ok.
Comment 14 Gerald (Jerry) Carter (dead mail address) 2005-02-07 09:06:30 UTC
originally reported against one of the 3.0.0rc[1-4] releases.
Cleaning up non-production versions.
Comment 15 Gerald (Jerry) Carter (dead mail address) 2005-08-24 10:16:41 UTC
sorry for the same, cleaning up the database to prevent unecessary reopens of bugs.