Received: from webgate.proteosys.de (mail.proteosys-ag.com [62.225.9.49]) by lucy.proteosys (8.11.0/8.9.3/SuSE Linux 8.9.3-0.1) with ESMTP id f4CFegf14070 for ; Sat, 12 May 2001 17:40:42 +0200 Received: by webgate.proteosys.de (8.11.0/8.11.0) with ESMTP id f4CFef711671 . for ; Sat, 12 May 2001 17:40:41 +0200 MIME-Version: 1.0 Received: from mail.Uni-Mainz.DE (mailserver1.zdv.Uni-Mainz.DE [134.93.8.30]) by mailgate2.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f4CFee019512 for ; Sat, 12 May 2001 17:40:40 +0200 (MET DST) Content-Type: multipart/alternative; boundary="----_=_NextPart_001_01C0DAF9.E6C8B900" Received: from mailgate1.zdv.Uni-Mainz.DE (mailgate1.zdv.Uni-Mainz.DE [134.93.8.56]) by mail.Uni-Mainz.DE (8.9.3/8.9.3) with ESMTP id RAA18263 for ; Sat, 12 May 2001 17:40:40 +0200 (MEST) X-MimeOLE: Produced By Microsoft Exchange V6.5 Received: from mail.listserv.gmd.de (mail.listserv.gmd.de [192.88.97.5]) by mailgate1.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f4CFeeU19763 for ; Sat, 12 May 2001 17:40:40 +0200 (MET DST) Received: from mail.listserv.gmd.de (192.88.97.5) by mail.listserv.gmd.de (LSMTP for OpenVMS v1.1a) with SMTP id <15.E7A2B360@mail.listserv.gmd.de>; Sat, 12 May 2001 17:39:05 +0200 Received: from RELAY.URZ.UNI-HEIDELBERG.DE by RELAY.URZ.UNI-HEIDELBERG.DE (LISTSERV-TCP/IP release 1.8b) with spool id 495443 for LATEX-L@RELAY.URZ.UNI-HEIDELBERG.DE; Sat, 12 May 2001 17:40:36 +0200 Received: from ix.urz.uni-heidelberg.de (mail.urz.uni-heidelberg.de [129.206.119.234]) by relay.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id RAA26949 for ; Sat, 12 May 2001 17:40:34 +0200 (MET DST) Received: from relay.uni-heidelberg.de (relay.uni-heidelberg.de [129.206.100.212]) by ix.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id RAA24648 for ; Sat, 12 May 2001 17:40:35 +0200 Received: from mail.umu.se (custer.umdac.umu.se [130.239.8.14]) by relay.uni-heidelberg.de (8.10.2+Sun/8.10.2) with ESMTP id f4CFeYQ09240 for ; Sat, 12 May 2001 17:40:35 +0200 (MET DST) Received: from [130.239.137.13] (mariehemsv093.sn.umu.se [130.239.137.13]) by mail.umu.se (8.8.8/8.8.8) with ESMTP id RAA25619; Sat, 12 May 2001 17:40:34 +0200 (MET DST) In-Reply-To: <200105112029.f4BKT3707962@smtp.wanadoo.es> Return-Path: X-Sender: lars@abel.math.umu.se x-mime-autoconverted: from quoted-printable to 8bit by relay.urz.uni-heidelberg.de id RAA26950 Content-class: urn:content-classes:message Subject: Re: Multilingual Encodings Summary 2.2 Date: Sat, 12 May 2001 16:40:32 +0100 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: From: =?iso-8859-1?Q?Lars_Hellstr=F6m?= Sender: "Mailing list for the LaTeX3 project" To: "Multiple recipients of list LATEX-L" Reply-To: "Mailing list for the LaTeX3 project" Status: R X-Status: X-Keywords: X-UID: 4047 This is a multi-part message in MIME format. ------_=_NextPart_001_01C0DAF9.E6C8B900 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable At 23.24 +0200 01-05-11, Javier Bezos wrote: >Frank wrote: >> LaTeX conceptually has only three levels: source, ICR, Output > >However, I still think that it's necessary separating code processing >from text processing. Both concepts are mixed up by TeX (and >therefore LaTeX) making, say, uppercasing a tricky thing. Remember >that \uppercase only changes chars in the code, and that >\MakeUppercase first expands the argument and then applies a set of >transformation (including math but not including text hidden in >protected macros!). Well, since ocp's are applied to text after >expansion (not including math but including actual text even if >hidden) we are doing things at the right place and in the right way. The problem with current Omega is that it only provides text processing = via OCPs, but no code processing. Uppercasing as a stylistic variation is clearly text processing and appears to be handled well (and in the right place) by the current Omega. With TeX it is best handled using special fonts, but current LaTeX has no interface for that and it would require = a lot of fonts. If uppercasing is done for some other reason then Omega is = no better than TeX. >Another problem is if input encoding belongs to code transformations >or text tranformations. Very likely you are right when you say that >after full expansion it's too late and when reading the source file is >too early. An intermediate step seem more sensible, thus making wrong >the \'e stuff discussed in the recent messages. Another useful addition >could be an ocp aware variant of \edef (or a similar device). Indeed such a device is needed. Ideally it should work in the mouth (so that it could be used without messing up the kerning). >And >regarding font transformation, they should be handled by fonts, but >the main problem is that metric information (ie, tfm) cannot be >modified from within TeX, except a few parameters; I really wonder >if allowing more changes, mainly ligatures, is feasible (that >solution would be better than font ocp's and vf's, I think). I don't understand this. What kind of font transformations are you referring to? >> my requirement for a usable internal representation is that I can = take a >> single element of it at any time and it has a welldefined meaning = (and a >> single one). > >Semantically or visually? I suspect Frank considers meaning to be a semantic concept, not a = visual. >>> at the LICR level means that the auxiliary files use the Unicode = encoding; >>> if the editor is not a Unicode one these files become unmanageable = and >>> messy. >> >> not true. the OICR has to be unicode (or more exactly unique and >>well-defined >> in the above sense, can be 20bits for all i care) if Omega ever = should >>go off >> the ground. but the interface to the external world could apply a >>well-defined >> output translation to something else before writing. > >:-/ I meant from the user's point of view. (Perhaps the replay was >too quick...) What I mean is that any LaTeX file ("main" or >auxiliary) should follow the LaTeX systax in a form closer to the >"representation" selected by the user (by "representation" a mean >input encoding and maybe a set of macros). The problem is that in multilingual documents there may not be a single such representation---the user can change input encoding just about anywhere in a document. This is why current LaTeX converts everything to LICR before it is written to the .aux file: the elements of the input encoding (as Frank called them above) do not have a single welldefined meaning. What has been discussed is that one might used some form of Unicode (most likely UTF-8) in these files instead. >=3D=3D=3D=3D=3D=3D >Lars wrote: > >> No it wouldn't. If \protect is not \@typeset@protect when \'e is = expanded >> then it will be written to a file as \'e. > >Right. Exactly because of that we should not convert text to Unicode >at this stage; otherwise we must change the definition depending on >the file to be read. We do already change e.g. the \catcode of @ for when .aux files are = read. Changing the input encoding is much more work but not principally = different. >We must only move LaTeX code and its context >information without changing it, so that if it is read correctly in = the >main file, it will be read correctly in the auxiliary file. I believe one of the main problems for multilinguality in LaTeX today is that there is no way of recording (or maybe even of determining) the current context so that this information can be moved around with every piece of code affected by it. Hence most current commands strive instead = to convert the code to a context-free representation (the LICR) by use of protected expansion. >> But such characters (the Spanish as well as the Hebrew) aren't = allowed in >> names in LaTeX! > >But they should be allowed in the future in we want a true >multilingual environment. Why? They are not part of any text, but part of the markup! >> It seems to me that what you are trying to do is to use a modified = LaTeX >> kernel which still does 8-bit input and output (in particular: it = encodes >> every character it puts onto an hlist as an 8-bit quantity) on top of = the >> Omega 16-bit (or whatever it is right now) typesetting engine. = Whereas this >> is more powerful than the current LaTeX in that it can e.g. do >> language-specific ligature processing without resorting to >> language-specific fonts, it is no better at handling the problems = related >> to _multilinguality_ because it still cannot handle character sets = that >> spans more than one (8-bit) encoding. How would for example the = proposed >> code deal with the (nonsensical but legal) input >> a\'{e}\k{e}\cyrya\cyrdje\cyrsacrs\cyrphk\textmu? > >I don't understand why you say that. Because of the example in the summary: =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D A B C D E ---------------------------------------------------- TeX a) "82 \'e * - - - - - > "E9 b) \'e \'e * - - - - - > "E9 c) "82 "82 * - - - - - > "82 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Omega a) "82 "82 "82 "00E9 "E9 b) \'e \'e "82 "00E9 "E9 The last line shows \'e being converted to an 8-bit quantity "82 (appearently the input encoding equivalent) before it is converted to Unicode. LaTeX lives between columns A and C, so there is no hint of any non-8-bit processing being done. >In fact I don't undestand what you >say :-) -- it looks very complicated to me. Anyway, it can handle two = bits >encodings and uft8, and language style files are written using utf8 >(which are directly converted to Unicode without any intermediate >step). That's what I would have expected, but the example gives no hint of this either. >Regarding the last line, you can escape the current encoding with >the \unichar macro (which is somewhat tricky to avoid killing >ligatures/kerning). As I say in the readme file, applying that trick >to utf8 didn't work. Isn't the \char primitive in Omega be able to produce arbitrary = characters (at least arbitrary characters in the basic multilingual plane)? >Actually, this preliminary lambda doesn't convert \'e to =E9, but to >e U+0301 (ie, the corresponding combining char). In the internal >Unicode step, accents are normalized in this way and then recombined >by the font ocp. The definition of \' in the la.sd file is very simple: > >\DeclareScriptCommand\'[1]{#1\unichar{"0301}} > >Very likely, this is one of the parts deserving improvements. It looks quite reasonable to me, and it is certainly much better than = the processing depicted in the example. Does this mean that the example = should rather be A B C D E \'e \'e e^^^^0301 ^^^^00e9 ^^e9 (using the ^^ notation for non-ASCII characters)? Lars Hellstr=F6m ------_=_NextPart_001_01C0DAF9.E6C8B900 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Re: Multilingual Encodings Summary 2.2

At 23.24 +0200 01-05-11, Javier Bezos wrote:
>Frank wrote:
>> LaTeX conceptually has only three levels: = source, ICR, Output
>
>However, I still think that it's necessary = separating code processing
>from text processing.  Both concepts are = mixed up by TeX (and
>therefore LaTeX) making, say, uppercasing a = tricky thing.  Remember
>that \uppercase only changes chars in the code, = and that
>\MakeUppercase first expands the argument and = then applies a set of
>transformation (including math but not including = text hidden in
>protected macros!).  Well, since ocp's are = applied to text after
>expansion (not including math but including = actual text even if
>hidden) we are doing things at the right place = and in the right way.

The problem with current Omega is that it only = provides text processing via
OCPs, but no code processing. Uppercasing as a = stylistic variation is
clearly text processing and appears to be handled = well (and in the right
place) by the current Omega. With TeX it is best = handled using special
fonts, but current LaTeX has no interface for that = and it would require a
lot of fonts. If uppercasing is done for some other = reason then Omega is no
better than TeX.

>Another problem is if input encoding belongs to = code transformations
>or text tranformations.  Very likely you are = right when you say that
>after full expansion it's too late and when = reading the source file is
>too early.  An intermediate step seem more = sensible, thus making wrong
>the \'e stuff discussed in the recent messages. = Another useful addition
>could be an ocp aware variant of \edef (or a = similar device).

Indeed such a device is needed. Ideally it should work = in the mouth (so
that it could be used without messing up the = kerning).

>And
>regarding font transformation, they should be = handled by fonts, but
>the main problem is that metric information (ie, = tfm) cannot be
>modified from within TeX, except a few = parameters; I really wonder
>if allowing more changes, mainly ligatures, is = feasible (that
>solution would be better than font ocp's and = vf's, I think).

I don't understand this. What kind of font = transformations are you
referring to?

>>  my requirement for a usable internal = representation is that I can take a
>>  single element of it at any time and = it has a welldefined meaning (and a
>>  single one).
>
>Semantically or visually?

I suspect Frank considers meaning to be a semantic = concept, not a visual.

>>> at the LICR level means that the = auxiliary files use the Unicode encoding;
>>> if the editor is not a Unicode one these = files become unmanageable and
>>> messy.
>>
>> not true. the OICR has to be unicode (or = more exactly unique and
>>well-defined
>> in the above sense, can be 20bits for all i = care) if Omega ever should
>>go off
>> the ground. but the interface to the = external world could apply a
>>well-defined
>> output translation to something else before = writing.
>
>:-/ I meant from the user's point of view.  = (Perhaps the replay was
>too quick...)  What I mean is that any LaTeX = file ("main" or
>auxiliary) should follow the LaTeX systax in a = form closer to the
>"representation" selected by the user = (by "representation" a mean
>input encoding and maybe a set of macros).

The problem is that in multilingual documents there = may not be a single
such representation---the user can change input = encoding just about
anywhere in a document. This is why current LaTeX = converts everything to
LICR before it is written to the .aux file: the = elements of the input
encoding (as Frank called them above) do not have a = single welldefined
meaning. What has been discussed is that one might = used some form of
Unicode (most likely UTF-8) in these files = instead.

>=3D=3D=3D=3D=3D=3D
>Lars wrote:
>
>> No it wouldn't. If \protect is not = \@typeset@protect when \'e is expanded
>> then it will be written to a file as = \'e.
>
>Right.  Exactly because of that we should = not convert text to Unicode
>at this stage; otherwise we must change the = definition depending on
>the file to be read.

We do already change e.g. the \catcode of @ for when = .aux files are read.
Changing the input encoding is much more work but not = principally different.

>We must only move LaTeX code and its = context
>information without changing  it, so that if = it is read correctly in the
>main file, it will be read correctly in the = auxiliary file.

I believe one of the main problems for multilinguality = in LaTeX today is
that there is no way of recording (or maybe even of = determining) the
current context so that this information can be moved = around with every
piece of code affected by it. Hence most current = commands strive instead to
convert the code to a context-free representation = (the LICR) by use of
protected expansion.

>> But such characters (the Spanish as well as = the Hebrew) aren't allowed in
>> names in LaTeX!
>
>But they should be allowed in the future in we = want a true
>multilingual environment.

Why? They are not part of any text, but part of the = markup!

>> It seems to me that what you are trying to do = is to use a modified LaTeX
>> kernel which still does 8-bit input and = output (in particular: it encodes
>> every character it puts onto an hlist as an = 8-bit quantity) on top of the
>> Omega 16-bit (or whatever it is right now) = typesetting engine. Whereas this
>> is more powerful than the current LaTeX in = that it can e.g. do
>> language-specific ligature processing = without resorting to
>> language-specific fonts, it is no better at = handling the problems related
>> to _multilinguality_ because it still cannot = handle character sets that
>> spans more than one (8-bit) encoding. How = would for example the proposed
>> code deal with the (nonsensical but legal) = input
>>    = a\'{e}\k{e}\cyrya\cyrdje\cyrsacrs\cyrphk\textmu?
>
>I don't understand why you say that.

Because of the example in the summary:
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D
          = A       = B         = C       = D         E
----------------------------------------------------
TeX   a)   = "82     \'e      = *   - - - - - >   "E9
      b)   = \'e     \'e      = *   - - - - - >   "E9
      c)   = "82     "82      = *   - - - - - >   "82
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D
Omega a)  "82     = "82      "82     = "00E9      "E9
      b)  = \'e     \'e      = "82     = "00E9      "E9

The last line shows \'e being converted to an 8-bit = quantity "82
(appearently the input encoding equivalent) before it = is converted to
Unicode. LaTeX lives between columns A and C, so = there is no hint of any
non-8-bit processing being done.

>In fact I don't undestand what you
>say :-) -- it looks very complicated to me. = Anyway, it can handle two bits
>encodings and uft8, and language style files are = written using utf8
>(which are directly converted to Unicode without = any intermediate
>step).

That's what I would have expected, but the example = gives no hint of this
either.

>Regarding the last line, you can escape the = current encoding with
>the \unichar macro (which is somewhat tricky to = avoid killing
>ligatures/kerning). As I say in the readme file, = applying that trick
>to utf8 didn't work.

Isn't the \char primitive in Omega be able to produce = arbitrary characters
(at least arbitrary characters in the basic = multilingual plane)?

>Actually, this preliminary lambda doesn't convert = \'e to =E9, but to
>e U+0301 (ie, the corresponding combining char). = In the internal
>Unicode step, accents are normalized in this way = and then recombined
>by the font ocp. The definition of \' in the = la.sd file is very simple:
>
>\DeclareScriptCommand\'[1]{#1\unichar{"0301}}
>
>Very likely, this is one of the parts deserving = improvements.

It looks quite reasonable to me, and it is certainly = much better than the
processing depicted in the example. Does this mean = that the example should
rather be

    A     = B        = C          = D        E
   \'e   \'e   = e^^^^0301   ^^^^00e9   ^^e9

(using the ^^ notation for non-ASCII = characters)?

Lars Hellstr=F6m

------_=_NextPart_001_01C0DAF9.E6C8B900--