Bug 2040 - Samba fails with some encodings having odd case rules
Samba fails with some encodings having odd case rules
Status: CLOSED FIXED
Product: Samba 3.0
Classification: Unclassified
Component: Extended Characters
3.0.7
All Linux
: P3 normal
: none
Assigned To: Björn Jacke
Samba QA Contact
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2004-11-15 20:58 UTC by Recai Oktas
Modified: 2006-02-01 10:51 UTC (History)
2 users (show)

See Also:


Attachments
Patch to fix the failure of odd case conversions (8.08 KB, patch)
2004-11-15 20:59 UTC, Recai Oktas
no flags Details
The correct patch (8.03 KB, patch)
2004-11-16 05:24 UTC, Recai Oktas
no flags Details
Revised patch (7.94 KB, patch)
2004-11-17 16:49 UTC, Recai Oktas
no flags Details
Test script to validate the patch (2.24 KB, text/plain)
2004-11-17 16:54 UTC, Recai Oktas
no flags Details
Output of the test script for the unpatched (current) case (903 bytes, text/plain)
2004-11-17 17:00 UTC, Recai Oktas
no flags Details
Output of the test script for the patched case (899 bytes, text/plain)
2004-11-17 17:01 UTC, Recai Oktas
no flags Details
locale fix for ASCII compat string functions (437 bytes, patch)
2004-12-05 09:32 UTC, Björn Jacke
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Recai Oktas 2004-11-15 20:58:46 UTC
We have a problem regarding the Samba's character case conversions under Turkish
locales: tr_TR (ISO-8859-9) and tr_TR.UTF-8.  Turkish has an odd property wrt
case conversions.  Certain ASCII characters ('i' and 'I') of Turkish alphabet
turn to multi-byte characters during the case changes.  This problem is so
common in that, I'll leave the words to i18nguy:

    http://www.i18nguy.com/unicode/turkish-i18n.html

As a result of this oddity: Samba (1) totally fails under tr_TR.ISO-8859-9
though using a service name without i/I and any special Turkish character, since
the built-in service name IPC$ has an 'I'; (2) almost totally fails under
tr_TR.UTF-8 when using a service name with i/I chars.

I've successfully tested the patch attached for Turkish.  Permanently changing
the {upcase,lowcase}.dat for Turkish could not be a solution, so I'm using a
dynamic schema incorporating a hook mechanism triggered under a Turkic language
environment (as for now, Turkish and Azeri languages).  The patch provides a
somewhat generic mechanism to handle such type of oddities, so I think it may be
useful for some other languages having similar requirements.

As part of the corrections, I've also fixed some wrong assumptions in certain
string functions.  These changes may show a little negative impact regarding the
optimizations of string optimization, though I think they could be tolerated. 
However, if you prefer to keep the optimization level, we could implement a set
of string wrappers of function pointers which would be frobnicated on startup
according to locale. (I've prototyped and tested such a code.  But I believe it
would be complicated the things.)

Let me know if you need extra information.  I would also be glad for suggestions.

P.S. Hope I've selected the right component for the bug report.
Comment 1 Recai Oktas 2004-11-15 20:59:52 UTC
Created attachment 775 [details]
Patch to fix the failure of  odd case conversions
Comment 2 Björn Jacke 2004-11-16 04:02:09 UTC
the question is: should we at all do a setlocale(LC_ALL, "") or should we force
an LC_ALL setlocale call for en_US.UTF-8, which enforces ASCII compatible case
conversion. At which points do we need special locale depending case
conversions, which for example converts "I" to "dotless i"?
Comment 3 Recai Oktas 2004-11-16 05:19:41 UTC
First of all, I apologize to have sent the wrong patch (too much tired these
days).  Attached is the correct one, which also includes support for a new
locale: tr_CY[1] (Turkish locale for Cyprus).  Please ignore the first patch and
consider using the new one.

Regarding your question, please note that as far I observe during my tests,
charcnv.c issues a setlocale(LC_ALL, "") whenever switching to multi-byte mode
anyway.  This happens for the Turkish service names.  And for the latter part of
your question, I have traced this I->'dotless i' conversion in smbd/service.c
make_connection-->str_lower_m(service) activated for reply_tconn_X against an
smbclient call.  For ISO-8859-9 encoding this line simply evaluates an
"$[dotless i]PC' for the built-in $IPC.

This patch just suggests a fix for the string processing infrastructure for
Turkish like languages.  It won't totally fix the "Turkish chars in service
names" problem, since we should also change some other things at the client
side.  The problem with the 'i/I' will not be completely resolved, but the patch
will somewhat improve the situation.

For example:  
  service name --> FOO<I with dot above>BAR

  This works:
    smbclient //<netbios_name>/<service name written as it is> 
  
  But this doesn't work even applying the patch:
    smbclient //<netbios_name>/<service name written as lower case: fooibar>

I've prepared another patch against libsmb/cliconnect.c and will submit a
separate bug report for this issue (or should I?), though I'm not so sure it
fits the rules for service names in NETBIOS protocol.

[1] tr_CY is newly introduced locale which you can find the relevant bugzilla
entry as follows:

    http://sources.redhat.com/bugzilla/show_bug.cgi?id=531
Comment 4 Recai Oktas 2004-11-16 05:21:23 UTC
Comment on attachment 775 [details]
Patch to fix the failure of  odd case conversions

Obsolete the wrong patch
Comment 5 Recai Oktas 2004-11-16 05:24:34 UTC
Created attachment 776 [details]
The correct patch

The correct patch which also incorporates support for tr_CY locale.
Comment 6 Björn Jacke 2004-11-16 05:41:01 UTC
how do Turish Windows 2k/XP versions react when you do for example have a share
called "fooibar"? Can it be reached by another Turkish Windows 2k/XP unter the
name "FOOIBAR" and/or under the name "FOO<I with dot above>BAR"?
Comment 7 Recai Oktas 2004-11-16 05:52:38 UTC
Sorry, I don't use Windows/XP, don't have a Windows box at the moment.  I've
used Linux clients during the tests performed here.  Will try this later.

Some other notes... setlocale(LC_ALL, "") in the patch is only needed to
determine the so-called 'lang_speciality'.  We could temporarily switch to
native locale and after the lang_speciality has been determined, could restore
the prior locale state.  But asides for all string operations (tolower/toupper),
we need to work in native locale for some unusual operations, for example when
falling back to lame case tables creation in load_case_tables().

Though not evident at first, my patch also fixes another minor bug.  In the
current code base, load_case_tables() is called before the globals
initialization.  As a result, lp_use_mmap() always evaluates to False since the
Globals.bUseMmap flag has not been set, hence mmap is not utilized.
Comment 8 Recai Oktas 2004-11-17 16:46:07 UTC
Comment on attachment 776 [details]
The correct patch

Obsoleted by revised patch.
Comment 9 Recai Oktas 2004-11-17 16:49:31 UTC
Created attachment 782 [details]
Revised patch

I've revised the patch with some sane changes.
Comment 10 Recai Oktas 2004-11-17 16:54:11 UTC
Created attachment 783 [details]
Test script to validate the patch

This script runs the 'torture/t_push_ucs2' and 't_strcmp' with some UTF-8
encoded Turkish test input.  You should run it under the tr_TR.UTF-8 locale,
check the results against the 'equal' and 'non-equal' string comparisons.
Comment 11 Recai Oktas 2004-11-17 17:00:16 UTC
Created attachment 784 [details]
Output of the test script for the unpatched (current) case

I'm attaching the output of test script for the unpatched case, for your
conveniency.
Comment 12 Recai Oktas 2004-11-17 17:01:06 UTC
Created attachment 785 [details]
Output of the test script for the patched case

And here is the one for the patched case.
Comment 13 Björn Jacke 2004-11-18 02:47:38 UTC
to know at which places we should fix things we really first need to know how
Windows reacts. Please try to find out what I wrote in #2.
Comment 14 Recai Oktas 2004-11-18 11:02:26 UTC
Well, I've finally managed to arrange some tests with Windows boxes.  Here are
the test results:
  
  Shared names: fooibar bazIbar foo<I above dot>baz
  Server side: Debian GNU/Linux Sarge with Samba 3.0.7
  Client side: Turkish WinXP and Turkish Win98

Case 1.
  Samba server locale: tr_TR.ISO-8859-9
  Server unix charset: ISO-8859-9
  Shared names were all encoded in ISO-8859-9

Result 1
  Total failure for both clients.  No machine in the network neighborhood.

Case 2
  Samba server locale: tr_TR.UTF-8
  Server unix charset: (left as default, that is, UTF-8)
  Shared names were all encoded in UTF-8

Result 2
  Machine appeared in the network neighborhood.

  WinXP:
    Shared names were appeared as follows:
      fooibar             --> fooibar
      bazIbar             --> bazIbar
      foo<I above dot>baz --> foo<I above dot>baz

    Connections:
      fooibar --> failure, couldn't connect.  Logged as 'fooIbar'.
      bazIbar --> success, couldn't connect.  Logged as it is.
      foo<I above dot>baz --> success.  Logged as it is.

  Win98:
    Shared name were appeared as follows:
      fooibar             --> fooibar
      bazIbar             --> bazIbar
      foo<I above dot>baz --> foo (yes, 'foo')

    Connections:
      fooibar  --> failure, couldn't connect. Shared name logged as 'fooIbar'.
      bazIbar  --> success.  Shared name logged as it is.
      foo<I above dot>baz --> failure.  Shared name logged as 'foo'.

I couldn't repeat the tests for the patched case.  I'll make this in a few days.
Comment 15 Björn Jacke 2004-11-19 07:22:28 UTC
thanks a lot for your tests! There is however another important test we need to do:

Create a share like "fooibar" on a Turkish Windows 2k/XP and try connect to the
share from another Turkish Windows 2k/XP machine to "FOOIBAR" and to "fooibar"
and see which of them are accessable. This dottet/dotless i is a nightmare ;-)
Comment 16 Recai Oktas 2004-11-22 21:59:42 UTC
Hi Björn,

I've created a shared as 'fooibar' on a WinXP box and attempted to connect it
from another XP box:

  //host/fooibar OK
  //host/FOOIBAR OK
  //host/foo<I dot above>bar (or FOO<I dot above>BAR) FAILED

This seems to be the same i/<I dot above> issue experienced with DOS filenames:

  mkdir fooibar; dir fooibar --> listed as FOOIBAR
  cd FOOIBAR  --> OK, but cd FOO<I dot above>BAR --> FAILED

  mkdir FOO<I dot above>BAR; cd FOO<I dot above>BAR --> OK

Now, where should we go from here?  Should we create a very minimal patch just
fixes this i/<I dot above> issue?  (The patch should also address the totally
failed case of ISO-8859-9 locale)
Comment 17 Björn Jacke 2004-12-05 09:29:41 UTC
can you please try the attached fix?
Comment 18 Björn Jacke 2004-12-05 09:32:12 UTC
Created attachment 826 [details]
locale fix for ASCII compat string functions
Comment 19 Jeremy Allison 2004-12-09 22:41:04 UTC
Ok, I'm going to add the simple locale fix for the next release.
Can someone confirm this is all that is needed for the fix ?
Jeremy.
Comment 20 Recai Oktas 2004-12-11 07:29:22 UTC
Sorry for the late response.  I'll be able to test it in the next week.

At the first glance, the patch seems fine to me; it is simple and not so
invasive wrt my patch.  But please note that it doesn't directly address the
Turkish irregularity in the built-in multi-byte string library of samba, which
my invasive patch targets.  But of course, it should solve the problems
experienced by Turkish users, and there is little chance to hit another Turkish
related bug as far as the scope of samba string operations concerned.

Ok, as I said before, I hope to report the test result in a few days.
Comment 21 Recai Oktas 2004-12-13 05:09:20 UTC
I confirm that this patch works.  I can now access to samba shares with 'i/I'
characters.  I've tested it for the worst case, that is, smbd was running under
tr_TR (ISO-8859-9) locale.  It even works with 'idotless' and 'Idotabove'. 
Thanks for your efforts.
Comment 22 Björn Jacke 2004-12-13 05:26:49 UTC
thanks for your tests! Reguarding your comments in #20, that we do not address
the i/I irregularity in the mb-functions. Yes, that's true, but Windows also
does not address the i/I rules and it is ASCII-compatible, so this is the
easiest and cleanes way to go.
Comment 23 Gerald (Jerry) Carter 2005-08-24 10:19:33 UTC
sorry for the same, cleaning up the database to prevent unecessary reopens of bugs.
Comment 24 Björn Jacke 2006-01-07 15:30:06 UTC
Recai, can you please test Samba 3.0.22 when it's out and confirm that this dotless i issue did not come up again? There was done a change in the code so that we no longer switch the locale to C but use alternative case functions instead. Thanks in advance
Bjoern
Comment 25 Recai Oktas 2006-01-07 19:44:31 UTC
(In reply to comment #24)
> Recai, can you please test Samba 3.0.22 when it's out and confirm that this
> dotless i issue did not come up again? There was done a change in the code so
> that we no longer switch the locale to C but use alternative case functions
> instead. Thanks in advance

Hi Björn,

Thanks for the notice!  Regarding this issue, as your message implies, there has been a regression in the latest versions (i.e. 3.0.20b).  Unfortunately I won't be able to make a test for some time (due to my workload these days).  But I've contacted to one of my friend, and I hope he will deal with the issue.
Comment 26 S.Ça&#287;lar Onur 2006-02-01 08:44:30 UTC
(In reply to comment #25)
> (In reply to comment #24)
> > Recai, can you please test Samba 3.0.22 when it's out and confirm that this
> > dotless i issue did not come up again? There was done a change in the code so
> > that we no longer switch the locale to C but use alternative case functions
> > instead. Thanks in advance

Sorry for long delay, here is the some test results under tr_TR-UTF-8 locale;

caglar@pardus source $ svn info
URL: svn://svnanon.samba.org/samba/branches/SAMBA_3_0/source
Revision: 13042

With these sharings all linux to linux, linux to windows, windows to linux cases seems works without a problem. 

[paylasim]
comment = Ortak Paylasim Alani
path = /home/samba
read only = no
guest ok = yes
create mask = 0777

[payla&#351;&#305;m]
comment = Ortak Paylasim Alani
path = /home/samba
read only = no
guest ok = yes
create mask = 0777

[çÇöÖ&#351;&#350;i&#304;&#287;&#286;üÜ&#305;I]
comment = Ortak Paylasim Alani
path = /home/samba
read only = no
guest ok = yes
create mask = 0777

but 2 more problem exists. First one is smbclient still cant understand utf8 chars;

pardus samba # smbclient -L pardus
Password:
Domain=[PARDUS] OS=[Unix] Server=[Samba 3.0.22pre1-SVN-build-13042]

        Sharename       Type      Comment
        ---------       ----      -------
        paylasim        Disk      Ortak Paylasim Alani
        payla           Disk      Ortak Paylasim Alani
        çÇöÖ        Disk      Ortak Paylasim Alani
        IPC$            IPC       IPC Service (pardus - is istasyonu)
        ADMIN$          IPC       IPC Service (pardus - is istasyonu)
Domain=[PARDUS] OS=[Unix] Server=[Samba 3.0.22pre1-SVN-build-13042]

        Server               Comment
        ---------            -------
        PARDUS               pardus - is istasyonu

and smbmount gives seq fault at least for me.
Comment 27 Björn Jacke 2006-02-01 10:51:04 UTC
Thanks for the tests! For smbmount/smbfs issues see bug #1920 ...