MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----_=_NextPart_001_01C09999.4B5FF600"
Content-class: urn:content-classes:message
Subject:      Re: LaTeX's internal char prepresentation (UTF8 or Unicode?)
Date: Sun, 18 Feb 2001 11:51:51 +0100
Message-ID:  <200102181055.f1IAt4i20466@smtp.wanadoo.es>
From: "Javier Bezos" <jbezos@WANADOO.ES>
Sender: "Mailing list for the LaTeX3 project" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
To: "Multiple recipients of list LATEX-L" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
Reply-To: "Mailing list for the LaTeX3 project" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
Status: R

This is a multi-part message in MIME format.

------_=_NextPart_001_01C09999.4B5FF600
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Hi all,

>  trans A =3D tokenising 8 bit numbers as the corresponding 16bit =
numbers
>            Example:
>            if =E9 was in the cp437 code page (German DOS) it would
>            be the 8bit char "82; that would become the 16bit token =
with
>            number "0082 (which is NOT =E9 in unicode =3D "00E9)
>
>            if on the other hand  =E9 was in latin1 (where it is "E9) =
we
>            get "00E9
>
>            if the input was in utf8 you would not get unicode chars as
>            the result but sequences of 16bit chars all starting with =
"00
>            --- so unicode charcater that is multibyte in utf8 would =
not
>            become the corect unicode 16-bit token but would become a =
sequnce
>            of tokens each of the form "00

Well, you answered that :-). utf-8 is a very interesting case because
chars can be represented with one, two, etc. bytes. If you say

\def\xx#1#2{}
\xx a=C3=A9  % two chars in utf8

then \xx doesn't understand it and gets #1 -> a, and #2 -> =C3, which
is wrong. If we preprocess the document to convert utf8 to two bytes
then things will work.

Preprocessing is also important if we want to give the possibility
of writing macro names with non ASCII chars, like Spanish \cap=EDtulo.
This feature could seem mostly useless in latin scripts (I could
write \capitulo), but in other scripts is essential, particularly
those not written from left to right. Making non-ASCII chars
active means that we cannot provide this feature.

However, this kind of preprocessing poses some problems which I've
not solved yet (but I'm close, or so I hope). You provide an
example in a subsequent message:

[...]
>     % the following fails (not surprisingly)
>     % and can't be corrected later on
>
>     \def\foo{ab
>     \InputTranslation currentfile\OCPa
>     c=C3?}
>     \show\foo
>
>
>  the second \foo will now contains the tokens
>
>    \foo=3Dmacro:
>     ->ab \InputTranslation currentfile\OCPa c^^c3^^a4.
>
[..]
>
>  Since we have been asked to provide input encoding changes for LaTeX =
within
>  paragraphs, eg for individual words, something like this would happen =
if such
>  a change appears, say, inside the argument of \section.

A system to coordinate preprocess and "internal" process is necessary.

> Discussion:
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>
> The problem really is transforms of type D which are using OICR1 and =
are thus
> likely to break in the sense that their encoding information is lost =
in the
> process.

Not a problem, really. Currently the meaning of \'e depends on the
font encoding, whose `state' is always available. The same idea
can be applied to input encodings so that the encoding information
is always available.

>  bla bla \french{foo} bla bla
>
> the input encoding change would not be noticed until after "foo" has =
already
> been tokenised (incorrectly) --- yes, we know that this example could =
be made
> to work using Don's \footnote trick but as with LaTeX's \verb there =
will be
> situations in which even more elaborate implementations will still =
fail due to
> tokenisation happening before any macro expansion is possible.

For instance

\def\bar{bla bla \french{foo} bla bla}

doesn't work. However, by using ocp's it will work.

And finally a little table:
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
          A       B         C       D         E
----------------------------------------------------
TeX   a)   "82     \'e      *   - - - - - >   "E9
      b)   \'e     \'e      *   - - - - - >   "E9
      c)   "82     "82      *   - - - - - >   "82
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Omega a)  "82     "82      "82     "00E9      "E9
      b)  \'e     \'e      "82     "00E9      "E9

A: the source file
B: a) and b) LICR, as created by \protected@edef or
  \protected@write, except c) if inputenc is not used because
  the font encoding is the same as the input encoding.
C: evaluation, with fully expanded tokens. In TeX this is
   is the final step (=3D step E in Omega) with the font
   codes.
D: after input encoding translation.
E: after font encoding translation and the final step in
   Omega.

Cheers
Javier

------_=_NextPart_001_01C09999.4B5FF600
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Diso-8859-1">
<META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version =
6.5.7654.12">
<TITLE>     Re: LaTeX's internal char prepresentation (UTF8 or =
Unicode?)</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->

<P><FONT SIZE=3D2>Hi all,</FONT>
</P>

<P><FONT SIZE=3D2>&gt;&nbsp; trans A =3D tokenising 8 bit numbers as the =
corresponding 16bit numbers</FONT>

<BR><FONT =
SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=
&nbsp; Example:</FONT>

<BR><FONT =
SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=
&nbsp; if =E9 was in the cp437 code page (German DOS) it would</FONT>

<BR><FONT =
SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=
&nbsp; be the 8bit char &quot;82; that would become the 16bit token =
with</FONT>

<BR><FONT =
SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=
&nbsp; number &quot;0082 (which is NOT =E9 in unicode =3D =
&quot;00E9)</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT =
SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=
&nbsp; if on the other hand&nbsp; =E9 was in latin1 (where it is =
&quot;E9) we</FONT>

<BR><FONT =
SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=
&nbsp; get &quot;00E9</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT =
SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=
&nbsp; if the input was in utf8 you would not get unicode chars =
as</FONT>

<BR><FONT =
SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=
&nbsp; the result but sequences of 16bit chars all starting with =
&quot;00</FONT>

<BR><FONT =
SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=
&nbsp; --- so unicode charcater that is multibyte in utf8 would =
not</FONT>

<BR><FONT =
SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=
&nbsp; become the corect unicode 16-bit token but would become a =
sequnce</FONT>

<BR><FONT =
SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=
&nbsp; of tokens each of the form &quot;00</FONT>
</P>

<P><FONT SIZE=3D2>Well, you answered that :-). utf-8 is a very =
interesting case because</FONT>

<BR><FONT SIZE=3D2>chars can be represented with one, two, etc. bytes. =
If you say</FONT>
</P>

<P><FONT SIZE=3D2>\def\xx#1#2{}</FONT>

<BR><FONT SIZE=3D2>\xx a=C3=A9&nbsp; % two chars in utf8</FONT>
</P>

<P><FONT SIZE=3D2>then \xx doesn't understand it and gets #1 -&gt; a, =
and #2 -&gt; =C3, which</FONT>

<BR><FONT SIZE=3D2>is wrong. If we preprocess the document to convert =
utf8 to two bytes</FONT>

<BR><FONT SIZE=3D2>then things will work.</FONT>
</P>

<P><FONT SIZE=3D2>Preprocessing is also important if we want to give the =
possibility</FONT>

<BR><FONT SIZE=3D2>of writing macro names with non ASCII chars, like =
Spanish \cap=EDtulo.</FONT>

<BR><FONT SIZE=3D2>This feature could seem mostly useless in latin =
scripts (I could</FONT>

<BR><FONT SIZE=3D2>write \capitulo), but in other scripts is essential, =
particularly</FONT>

<BR><FONT SIZE=3D2>those not written from left to right. Making =
non-ASCII chars</FONT>

<BR><FONT SIZE=3D2>active means that we cannot provide this =
feature.</FONT>
</P>

<P><FONT SIZE=3D2>However, this kind of preprocessing poses some =
problems which I've</FONT>

<BR><FONT SIZE=3D2>not solved yet (but I'm close, or so I hope). You =
provide an</FONT>

<BR><FONT SIZE=3D2>example in a subsequent message:</FONT>
</P>

<P><FONT SIZE=3D2>[...]</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp;&nbsp; % the following fails =
(not surprisingly)</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp;&nbsp; % and can't be corrected =
later on</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp;&nbsp; \def\foo{ab</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp;&nbsp; \InputTranslation =
currentfile\OCPa</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp;&nbsp; c=C3?}</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp;&nbsp; \show\foo</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp; the second \foo will now contains the =
tokens</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp; \foo=3Dmacro:</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp;&nbsp; -&gt;ab =
\InputTranslation currentfile\OCPa c^^c3^^a4.</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>[..]</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp; Since we have been asked to provide input =
encoding changes for LaTeX within</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp; paragraphs, eg for individual words, =
something like this would happen if such</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp; a change appears, say, inside the argument =
of \section.</FONT>
</P>

<P><FONT SIZE=3D2>A system to coordinate preprocess and =
&quot;internal&quot; process is necessary.</FONT>
</P>

<P><FONT SIZE=3D2>&gt; Discussion:</FONT>

<BR><FONT SIZE=3D2>&gt; =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt; The problem really is transforms of type D which =
are using OICR1 and are thus</FONT>

<BR><FONT SIZE=3D2>&gt; likely to break in the sense that their encoding =
information is lost in the</FONT>

<BR><FONT SIZE=3D2>&gt; process.</FONT>
</P>

<P><FONT SIZE=3D2>Not a problem, really. Currently the meaning of \'e =
depends on the</FONT>

<BR><FONT SIZE=3D2>font encoding, whose `state' is always available. The =
same idea</FONT>

<BR><FONT SIZE=3D2>can be applied to input encodings so that the =
encoding information</FONT>

<BR><FONT SIZE=3D2>is always available.</FONT>
</P>

<P><FONT SIZE=3D2>&gt;&nbsp; bla bla \french{foo} bla bla</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt; the input encoding change would not be noticed =
until after &quot;foo&quot; has already</FONT>

<BR><FONT SIZE=3D2>&gt; been tokenised (incorrectly) --- yes, we know =
that this example could be made</FONT>

<BR><FONT SIZE=3D2>&gt; to work using Don's \footnote trick but as with =
LaTeX's \verb there will be</FONT>

<BR><FONT SIZE=3D2>&gt; situations in which even more elaborate =
implementations will still fail due to</FONT>

<BR><FONT SIZE=3D2>&gt; tokenisation happening before any macro =
expansion is possible.</FONT>
</P>

<P><FONT SIZE=3D2>For instance</FONT>
</P>

<P><FONT SIZE=3D2>\def\bar{bla bla \french{foo} bla bla}</FONT>
</P>

<P><FONT SIZE=3D2>doesn't work. However, by using ocp's it will =
work.</FONT>
</P>

<P><FONT SIZE=3D2>And finally a little table:</FONT>

<BR><FONT =
SIZE=3D2>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D</FONT>

<BR><FONT =
SIZE=3D2>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
A&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
B&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
C&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
D&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; E</FONT>

<BR><FONT =
SIZE=3D2>----------------------------------------------------</FONT>

<BR><FONT SIZE=3D2>TeX&nbsp;&nbsp; a)&nbsp;&nbsp; =
&quot;82&nbsp;&nbsp;&nbsp;&nbsp; \'e&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
*&nbsp;&nbsp; - - - - - &gt;&nbsp;&nbsp; &quot;E9</FONT>

<BR><FONT SIZE=3D2>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; b)&nbsp;&nbsp; =
\'e&nbsp;&nbsp;&nbsp;&nbsp; \'e&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
*&nbsp;&nbsp; - - - - - &gt;&nbsp;&nbsp; &quot;E9</FONT>

<BR><FONT SIZE=3D2>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; c)&nbsp;&nbsp; =
&quot;82&nbsp;&nbsp;&nbsp;&nbsp; &quot;82&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
*&nbsp;&nbsp; - - - - - &gt;&nbsp;&nbsp; &quot;82</FONT>

<BR><FONT =
SIZE=3D2>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D</FONT>

<BR><FONT SIZE=3D2>Omega a)&nbsp; &quot;82&nbsp;&nbsp;&nbsp;&nbsp; =
&quot;82&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &quot;82&nbsp;&nbsp;&nbsp;&nbsp; =
&quot;00E9&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &quot;E9</FONT>

<BR><FONT SIZE=3D2>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; b)&nbsp; =
\'e&nbsp;&nbsp;&nbsp;&nbsp; \'e&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
&quot;82&nbsp;&nbsp;&nbsp;&nbsp; =
&quot;00E9&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &quot;E9</FONT>
</P>

<P><FONT SIZE=3D2>A: the source file</FONT>

<BR><FONT SIZE=3D2>B: a) and b) LICR, as created by \protected@edef =
or</FONT>

<BR><FONT SIZE=3D2>&nbsp; \protected@write, except c) if inputenc is not =
used because</FONT>

<BR><FONT SIZE=3D2>&nbsp; the font encoding is the same as the input =
encoding.</FONT>

<BR><FONT SIZE=3D2>C: evaluation, with fully expanded tokens. In TeX =
this is</FONT>

<BR><FONT SIZE=3D2>&nbsp;&nbsp; is the final step (=3D step E in Omega) =
with the font</FONT>

<BR><FONT SIZE=3D2>&nbsp;&nbsp; codes.</FONT>

<BR><FONT SIZE=3D2>D: after input encoding translation.</FONT>

<BR><FONT SIZE=3D2>E: after font encoding translation and the final step =
in</FONT>

<BR><FONT SIZE=3D2>&nbsp;&nbsp; Omega.</FONT>
</P>

<P><FONT SIZE=3D2>Cheers</FONT>

<BR><FONT SIZE=3D2>Javier</FONT>
</P>

</BODY>
</HTML>
------_=_NextPart_001_01C09999.4B5FF600--