When Samba parses a share name, it looks for closing bracket for its termination sequentially. The code for a closing bracket is (0x5D) ascii code. However, for some MB character sets such as Japanese CP932 and Chinese GB18030, there are some characters which own '0x5D' code in their second bytes or fourth bytes. Because the parser reads the share name byte by byte, it cannot identify a '0x5D' code as a part of a MB char but wrongly identifies the code as the end of share name definition. Hence, upon loading smb.conf, if a MB char with '0x5D' is used within a share name definition, its name would be terminated in the middle of definition.
This bug is duplicated with BUG#462
Sorry, my mistake. Please ignore my previous comment.
Created attachment 373 [details] A patch for loading smb.conf twice, once in unix charset and once in UTF-8 This patch (partially) solves BUG #962 and #957 by loading smb.conf twice. In the first phase the parser looks for unix charset, and in the second phase it converts chars into UTF-8, parse them and convert them back to the original unix charset before passing over to other functions. It works if unix charset is correctly identified in the first phase. That is, if smb.conf is defined as [global] server string = <0xXX><0x5c> unix charset = CP932 where <0xXX><0x5c> is a MB character with <0x5c> corresponds to '\' in ascii code, parser fails to load unix charset in the first phase (refer BUG #957) I know this is not a neat solution, but there is nothing much we can do unless we change the way of handling smb.conf.
are we (in the latest SAMBA 3.0 cvs) in better shape for this now ?
No, unless you've done something to param/param.c, it is impossible to fix this bug. Same principle applies to Bug #957.
can't see this getting fixed in Samba 3.
>can't see this getting fixed in Samba 3. Its severity is very high in Japan. Unless this bug is fixed, in Japan we can hardly say Samba is i18n'ed and Japanese-ready. I think that to change that smb.conf is always written in UTF-8 regardless of unix charset is a good resolution. This change will also fix BUG#1069 and BUG#496. P.S. I think the encoding of all files such as smb.conf, tdb files and etc... must be fixed (probably UTF-8 is better), should not depend on unix charset.
No, we can't arbitrarily force utf8 for smb.conf. This will break a lot of smb.confs. I thought we'd always specified that the "unix charset" and "dos codepage" entries *must* come first for an smb.conf to be read correctly in the native codepage. I even remember writing some docs to that effect.... If we tell users that the "unix charset" entry must come first in the smb.conf - does this fix the problem in Japan ? If so, then this is a documentation issue. As I recall this was the way it was supposed to work (unix charset must come first if you need mb characters in smb.conf). Jeremy.
Created attachment 974 [details] How a share name with char 0x955d is broken in share list. (In reply to comment #8) > No, we can't arbitrarily force utf8 for smb.conf. This will > break a lot of smb.confs. Hmmm..., > I thought we'd always specified that the "unix charset" and > "dos codepage" entries *must* come first for an smb.conf to be read > correctly in the native codepage. Of course, yes. But this problem occurs even if we write {unix,dos} charset first in smb.conf. I attached a sample image to show how the shares are shown. Ths smb.conf is writte like: ----- [global] dos charset = CP932 unix charset = CP932 ... [<95><5b>] comment = 0x955b [<95><5c>] comment = 0x955c [<95><5d>] comment = 0x955d [<95><5e>] comment = 0x955e > does this fix the problem in Japan ? Unfortunately no.
Ok so the bug looks to be no mb processing when looking for ']' characters within smb.conf stanza processing. This is a (relatively) simple fix within param/param.c - as ']' is the only special character looked for (ok, maybe some spaces as well). All we need do is correctly change param/param.c to process the current mb unix character set and ensure all the docs say that "unix charset" must come first. Do you concurr ? This is a much easier fix than loading twice or converting to utf8. I'll look at this for 3.0.12. Jeremy.
concurr(In reply to comment #10) > All we need do is correctly > change param/param.c to process the current mb unix character set > and ensure all the docs say that "unix charset" must come first. > Do you concurr ? Yes, I think so. > I'll look at this for 3.0.12. > Jeremy. OK, thanks.
Created attachment 978 [details] Committed patch. Ok, this is the fix I've committed. Please test with Japanese character sets. Jeremy.
I think this is now fixed in SVN. Remember to set : unix charset = "XXXX" as the first entry in your [global] section in the smb.conf if you want to use MB sharenames in that character set. Jeremy.
Created attachment 987 [details] smb.conf for testing (In reply to comment #13) > I think this is now fixed in SVN. Umm..., it seems to not be fixed yet. I checked on Debian GNU/Linux 3.0 on x86. Endian issue? Attachment is my testing smb.conf
Ok, I've checked on Fedora core 3 and the problem seems to be that the C library doesn't recognise a locale of CP932. The new code in Samba correctly recognises the 0x955d character as one character with this smb.conf (I checked by putting a breakpoint on FindSectionEnd in param/param.c - the first breakpoint is triggered by [global], the second with [0x955d] - when I look at where the code thinks the end of the section is it correctly finds the second ']' character in the ascii stream - meaning it knows the 0x955d is one character. I can't display it using smbclient or looking at the smb.conf using gedit as the C library on Linux complains with "Locale not supported by C library" when I set it to "cp932". How are you testing this ? Jeremy.
The new code in param/param.c is definately parsing the 0x955d as one character. I've tried to set the code page on a Win2k3 box to 932 here at connectathon but it's failing with "invalid code page" (I'm guessing as it's not a Japanese version of Windows). I'm going to need some help debugging this if it's not displaying right on a Windows client. I know the sharename is being set correctly to 0x955d in Samba when it parses the share. Jeremy.
Ok - I checked on the wire between smbclient and the latest svn code with this smb.conf - we *are* returning 0x955d as the sharename. Also :-) I was suprised (but pleaesed :-) to see that smbclient *WAS* displaying the correct Japanese SJIS character 0x955d (I looked it up on the web). Then I realised that smbclient converts from DOS charset to UNIX charset when reading from the RAP call, then from UNIX charset (cp932 in this case) to *display* charset (utf8 on Fedora Core 3) when printing the string ! So it was correct. I'm still convinced this bug is fixed. Jeremy.
(In reply to comment #17) > I'm still convinced this bug is fixed. > Jeremy. I'm sorry, this is my mistake. I forgot "make install" and the SVN version of Samba was not installed. Now I checked the correct SVN version and find this bug is fixed. Sorry again. >I've tried to set the code page on a Win2k3 box to 932 here at connectathon but >it's failing with "invalid code page" (I'm guessing as it's not a Japanese >version of Windows). If you have MSDN, you can set codepage 932 (or other code pages) to install MUI (Multilingual User Interface) on English version of Windows. > Also :-) I was suprised > (but pleaesed :-) to see that smbclient *WAS* displaying the correct Japanese > SJIS character 0x955d (I looked it up on the web). Then I realised that > smbclient converts from DOS charset to UNIX charset when reading from the RAP > call, then from UNIX charset (cp932 in this case) to *display* charset (utf8 on > Fedora Core 3) when printing the string ! So it was correct. For the sake of dprintf() and i18n feature of Samba 3.0, the client commands support displaying Japanese character correctly as far as I examined :-) Thanks.
sorry for the same, cleaning up the database to prevent unecessary reopens of bugs.