Received: from webgate.proteosys.de (mail.proteosys-ag.com [62.225.9.49]) by lucy.proteosys (8.11.0/8.9.3/SuSE Linux 8.9.3-0.1) with ESMTP id f1IAtOf12794 for ; Sun, 18 Feb 2001 11:55:24 +0100 Received: by webgate.proteosys.de (8.11.0/8.11.0) with ESMTP id f1IAtNd22787 . for ; Sun, 18 Feb 2001 11:55:24 +0100 MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----_=_NextPart_001_01C09999.4B5FF600" Received: from mail.Uni-Mainz.DE (mailserver1.zdv.Uni-Mainz.DE [134.93.8.30]) by mailgate2.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f1IAtNQ00008 for ; Sun, 18 Feb 2001 11:55:23 +0100 (MET) Received: from mailgate2.zdv.Uni-Mainz.DE (mailgate2.zdv.Uni-Mainz.DE [134.93.8.57]) by mail.Uni-Mainz.DE (8.9.3/8.9.3) with ESMTP id LAA07412 for ; Sun, 18 Feb 2001 11:55:22 +0100 (MET) X-MimeOLE: Produced By Microsoft Exchange V6.5 Received: from mail.listserv.gmd.de (mail.listserv.gmd.de [192.88.97.5]) by mailgate2.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f1IAtLQ00004 for ; Sun, 18 Feb 2001 11:55:21 +0100 (MET) Received: from mail.listserv.gmd.de (192.88.97.5) by mail.listserv.gmd.de (LSMTP for OpenVMS v1.1a) with SMTP id <9.1CF15E71@mail.listserv.gmd.de>; Sun, 18 Feb 2001 11:55:12 +0100 Received: from RELAY.URZ.UNI-HEIDELBERG.DE by RELAY.URZ.UNI-HEIDELBERG.DE (LISTSERV-TCP/IP release 1.8b) with spool id 489371 for LATEX-L@RELAY.URZ.UNI-HEIDELBERG.DE; Sun, 18 Feb 2001 11:55:18 +0100 Received: from ix.urz.uni-heidelberg.de (mail.urz.uni-heidelberg.de [129.206.119.234]) by relay.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id LAA00674 for ; Sun, 18 Feb 2001 11:55:16 +0100 (MET) Received: from relay.uni-heidelberg.de (relay.uni-heidelberg.de [129.206.100.212]) by ix.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id LAA05156 for ; Sun, 18 Feb 2001 11:55:15 +0100 Received: from smtp.wanadoo.es (m1smtpisp03.wanadoo.es [62.36.220.63] (may be forged)) by relay.uni-heidelberg.de (8.10.2+Sun/8.10.2) with ESMTP id f1IAtEx06528 for ; Sun, 18 Feb 2001 11:55:14 +0100 (MET) Received: from [62.36.81.148] (usuario2-36-81-148.dialup.uni2.es [62.36.81.148]) by smtp.wanadoo.es (8.10.2/8.10.2) with ESMTP id f1IAt4i20466; Sun, 18 Feb 2001 11:55:05 +0100 (MET) Return-Path: X-Mailer: Microsoft Outlook Express Macintosh Edition - 4.5 (0410) x-mime-autoconverted: from quoted-printable to 8bit by relay.urz.uni-heidelberg.de id LAA00676 Content-class: urn:content-classes:message Subject: Re: LaTeX's internal char prepresentation (UTF8 or Unicode?) Date: Sun, 18 Feb 2001 11:51:51 +0100 Message-ID: <200102181055.f1IAt4i20466@smtp.wanadoo.es> X-MS-Has-Attach: X-MS-TNEF-Correlator: From: "Javier Bezos" Sender: "Mailing list for the LaTeX3 project" To: "Multiple recipients of list LATEX-L" Reply-To: "Mailing list for the LaTeX3 project" Status: R X-Status: X-Keywords: X-UID: 3965 This is a multi-part message in MIME format. ------_=_NextPart_001_01C09999.4B5FF600 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Hi all, > trans A =3D tokenising 8 bit numbers as the corresponding 16bit = numbers > Example: > if =E9 was in the cp437 code page (German DOS) it would > be the 8bit char "82; that would become the 16bit token = with > number "0082 (which is NOT =E9 in unicode =3D "00E9) > > if on the other hand =E9 was in latin1 (where it is "E9) = we > get "00E9 > > if the input was in utf8 you would not get unicode chars as > the result but sequences of 16bit chars all starting with = "00 > --- so unicode charcater that is multibyte in utf8 would = not > become the corect unicode 16-bit token but would become a = sequnce > of tokens each of the form "00 Well, you answered that :-). utf-8 is a very interesting case because chars can be represented with one, two, etc. bytes. If you say \def\xx#1#2{} \xx a=C3=A9 % two chars in utf8 then \xx doesn't understand it and gets #1 -> a, and #2 -> =C3, which is wrong. If we preprocess the document to convert utf8 to two bytes then things will work. Preprocessing is also important if we want to give the possibility of writing macro names with non ASCII chars, like Spanish \cap=EDtulo. This feature could seem mostly useless in latin scripts (I could write \capitulo), but in other scripts is essential, particularly those not written from left to right. Making non-ASCII chars active means that we cannot provide this feature. However, this kind of preprocessing poses some problems which I've not solved yet (but I'm close, or so I hope). You provide an example in a subsequent message: [...] > % the following fails (not surprisingly) > % and can't be corrected later on > > \def\foo{ab > \InputTranslation currentfile\OCPa > c=C3?} > \show\foo > > > the second \foo will now contains the tokens > > \foo=3Dmacro: > ->ab \InputTranslation currentfile\OCPa c^^c3^^a4. > [..] > > Since we have been asked to provide input encoding changes for LaTeX = within > paragraphs, eg for individual words, something like this would happen = if such > a change appears, say, inside the argument of \section. A system to coordinate preprocess and "internal" process is necessary. > Discussion: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > The problem really is transforms of type D which are using OICR1 and = are thus > likely to break in the sense that their encoding information is lost = in the > process. Not a problem, really. Currently the meaning of \'e depends on the font encoding, whose `state' is always available. The same idea can be applied to input encodings so that the encoding information is always available. > bla bla \french{foo} bla bla > > the input encoding change would not be noticed until after "foo" has = already > been tokenised (incorrectly) --- yes, we know that this example could = be made > to work using Don's \footnote trick but as with LaTeX's \verb there = will be > situations in which even more elaborate implementations will still = fail due to > tokenisation happening before any macro expansion is possible. For instance \def\bar{bla bla \french{foo} bla bla} doesn't work. However, by using ocp's it will work. And finally a little table: =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D A B C D E ---------------------------------------------------- TeX a) "82 \'e * - - - - - > "E9 b) \'e \'e * - - - - - > "E9 c) "82 "82 * - - - - - > "82 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Omega a) "82 "82 "82 "00E9 "E9 b) \'e \'e "82 "00E9 "E9 A: the source file B: a) and b) LICR, as created by \protected@edef or \protected@write, except c) if inputenc is not used because the font encoding is the same as the input encoding. C: evaluation, with fully expanded tokens. In TeX this is is the final step (=3D step E in Omega) with the font codes. D: after input encoding translation. E: after font encoding translation and the final step in Omega. Cheers Javier ------_=_NextPart_001_01C09999.4B5FF600 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Re: LaTeX's internal char prepresentation (UTF8 or = Unicode?)

Hi all,

>  trans A =3D tokenising 8 bit numbers as the = corresponding 16bit numbers
>          =   Example:
>          =   if =E9 was in the cp437 code page (German DOS) it would
>          =   be the 8bit char "82; that would become the 16bit token = with
>          =   number "0082 (which is NOT =E9 in unicode =3D = "00E9)
>
>          =   if on the other hand  =E9 was in latin1 (where it is = "E9) we
>          =   get "00E9
>
>          =   if the input was in utf8 you would not get unicode chars = as
>          =   the result but sequences of 16bit chars all starting with = "00
>          =   --- so unicode charcater that is multibyte in utf8 would = not
>          =   become the corect unicode 16-bit token but would become a = sequnce
>          =   of tokens each of the form "00

Well, you answered that :-). utf-8 is a very = interesting case because
chars can be represented with one, two, etc. bytes. = If you say

\def\xx#1#2{}
\xx a=C3=A9  % two chars in utf8

then \xx doesn't understand it and gets #1 -> a, = and #2 -> =C3, which
is wrong. If we preprocess the document to convert = utf8 to two bytes
then things will work.

Preprocessing is also important if we want to give the = possibility
of writing macro names with non ASCII chars, like = Spanish \cap=EDtulo.
This feature could seem mostly useless in latin = scripts (I could
write \capitulo), but in other scripts is essential, = particularly
those not written from left to right. Making = non-ASCII chars
active means that we cannot provide this = feature.

However, this kind of preprocessing poses some = problems which I've
not solved yet (but I'm close, or so I hope). You = provide an
example in a subsequent message:

[...]
>     % the following fails = (not surprisingly)
>     % and can't be corrected = later on
>
>     \def\foo{ab
>     \InputTranslation = currentfile\OCPa
>     c=C3?}
>     \show\foo
>
>
>  the second \foo will now contains the = tokens
>
>    \foo=3Dmacro:
>     ->ab = \InputTranslation currentfile\OCPa c^^c3^^a4.
>
[..]
>
>  Since we have been asked to provide input = encoding changes for LaTeX within
>  paragraphs, eg for individual words, = something like this would happen if such
>  a change appears, say, inside the argument = of \section.

A system to coordinate preprocess and = "internal" process is necessary.

> Discussion:
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>
> The problem really is transforms of type D which = are using OICR1 and are thus
> likely to break in the sense that their encoding = information is lost in the
> process.

Not a problem, really. Currently the meaning of \'e = depends on the
font encoding, whose `state' is always available. The = same idea
can be applied to input encodings so that the = encoding information
is always available.

>  bla bla \french{foo} bla bla
>
> the input encoding change would not be noticed = until after "foo" has already
> been tokenised (incorrectly) --- yes, we know = that this example could be made
> to work using Don's \footnote trick but as with = LaTeX's \verb there will be
> situations in which even more elaborate = implementations will still fail due to
> tokenisation happening before any macro = expansion is possible.

For instance

\def\bar{bla bla \french{foo} bla bla}

doesn't work. However, by using ocp's it will = work.

And finally a little table:
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D
          = A       = B         = C       = D         E
----------------------------------------------------
TeX   a)   = "82     \'e      = *   - - - - - >   "E9
      b)   = \'e     \'e      = *   - - - - - >   "E9
      c)   = "82     "82      = *   - - - - - >   "82
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D
Omega a)  "82     = "82      "82     = "00E9      "E9
      b)  = \'e     \'e      = "82     = "00E9      "E9

A: the source file
B: a) and b) LICR, as created by \protected@edef = or
  \protected@write, except c) if inputenc is not = used because
  the font encoding is the same as the input = encoding.
C: evaluation, with fully expanded tokens. In TeX = this is
   is the final step (=3D step E in Omega) = with the font
   codes.
D: after input encoding translation.
E: after font encoding translation and the final step = in
   Omega.

Cheers
Javier

------_=_NextPart_001_01C09999.4B5FF600--