MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----_=_NextPart_001_01C098FE.4E59D880"
In-Reply-To:  <14990.31415.781462.208072@istrati.zdv.uni-mainz.de>
References: <200102122049.f1CKnvi13875@smtp.wanadoo.es>            <200102122049.f1CKnvi13875@smtp.wanadoo.es>
Content-class: urn:content-classes:message
Subject:      Re: LaTeX's internal char prepresentation (UTF8 or Unicode?)
Date: Sat, 17 Feb 2001 17:24:45 +0100
Message-ID:  <v03110700b6b44d211cd3@[195.100.226.146]>
From: "Hans Aberg" <haberg@MATEMATIK.SU.SE>
Sender: "Mailing list for the LaTeX3 project" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
To: "Multiple recipients of list LATEX-L" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
Reply-To: "Mailing list for the LaTeX3 project" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
Status: R

This is a multi-part message in MIME format.

------_=_NextPart_001_01C098FE.4E59D880
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

At 14:20 +0100 2001/02/17, Frank Mittelbach wrote:

There appears to be two variations, one based on the original TeX, and =
one
with TeX having some kind of extensions.

As for the second approach, it seems me that the internal representation
should be 32-bit Unicode. As TeX does not seem well equipped handling =
the
encoding issues, one should then hook up a preprocessor providing the
suitable translations. Thus
    whatever encoding -> preprocessor -> UTeX
This easy-to-write preprocessor can combine combining characters to =
single
Unicode characters, if possible, or otherwise write them on a form that
UTeX easily can handle, say by switching from postfix to prefix =
notation,
or whatever. With further tweaking of the TeX engine it could even =
combine
TeX combinations such as "--", "---" into single Unicode characters.

It is further easy to write such translators for reading/writing to =
files.

Thus the picture becomes

                         - trans C (eg Uppercasing)
                         |      |
                         |      | Hardwired translations
                         V      |
 whatever --> decode --> Unicode ------->  trans B  --> ^^e9
                         ^      |
                         |      |
                         |      |
                       files: Choice of say utf8, Unicode

> > Omega can represent internally non ascii chars and hence
> > actual chars are used instead of macros (with a few exceptions).
> > Trivial as it can seem, this difference is in fact a HUGE
> > difference. For example, the path followed by =E9 will be:
> >
> >  =E9 --an encoding ocp-|           |-- T1 font ocp-->  ^^e9
> >                      +-> U+00E9 -+
> >  \'e -fontenc (!)----|           |- OT1 font ocp -> \OT1\'{e}

Thus, one does not bother representing ASCII or any other such encoding
internally anymore. Combinations such as \'e are no longer necessary,
except for backwards compatibility. It will probably not make much
difference what you do with those combinations, because documents that =
for
archival purposes need absolute compatibility can use the old TeX -- =
newer
documents can be translated, if necessary, at need. (That is, this will
work, if the incompatibilities are not too big.)

> trans A =3D tokenising 8 bit numbers as the corresponding 16bit =
numbers
>           Example:
>           if =E9 was in the cp437 code page (German DOS) it would
>           be the 8bit char "82; that would become the 16bit token with
>           number "0082 (which is NOT =E9 in unicode =3D "00E9)
>
>           if on the other hand  =E9 was in latin1 (where it is "E9) we
>           get "00E9

Such problems would not happen within UTeX, as the preprocessor would
provide the proper translation to \U000000E9.

  Hans Aberg

------_=_NextPart_001_01C098FE.4E59D880
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Diso-8859-1">
<META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version =
6.5.7654.12">
<TITLE>     Re: LaTeX's internal char prepresentation (UTF8 or =
Unicode?)</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->

<P><FONT SIZE=3D2>At 14:20 +0100 2001/02/17, Frank Mittelbach =
wrote:</FONT>
</P>

<P><FONT SIZE=3D2>There appears to be two variations, one based on the =
original TeX, and one</FONT>

<BR><FONT SIZE=3D2>with TeX having some kind of extensions.</FONT>
</P>

<P><FONT SIZE=3D2>As for the second approach, it seems me that the =
internal representation</FONT>

<BR><FONT SIZE=3D2>should be 32-bit Unicode. As TeX does not seem well =
equipped handling the</FONT>

<BR><FONT SIZE=3D2>encoding issues, one should then hook up a =
preprocessor providing the</FONT>

<BR><FONT SIZE=3D2>suitable translations. Thus</FONT>

<BR><FONT SIZE=3D2>&nbsp;&nbsp;&nbsp; whatever encoding -&gt; =
preprocessor -&gt; UTeX</FONT>

<BR><FONT SIZE=3D2>This easy-to-write preprocessor can combine combining =
characters to single</FONT>

<BR><FONT SIZE=3D2>Unicode characters, if possible, or otherwise write =
them on a form that</FONT>

<BR><FONT SIZE=3D2>UTeX easily can handle, say by switching from postfix =
to prefix notation,</FONT>

<BR><FONT SIZE=3D2>or whatever. With further tweaking of the TeX engine =
it could even combine</FONT>

<BR><FONT SIZE=3D2>TeX combinations such as &quot;--&quot;, =
&quot;---&quot; into single Unicode characters.</FONT>
</P>

<P><FONT SIZE=3D2>It is further easy to write such translators for =
reading/writing to files.</FONT>
</P>

<P><FONT SIZE=3D2>Thus the picture becomes</FONT>
</P>

<P><FONT =
SIZE=3D2>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp=
;&nbsp; - trans C (eg Uppercasing)</FONT>

<BR><FONT =
SIZE=3D2>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp=
;&nbsp; |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |</FONT>

<BR><FONT =
SIZE=3D2>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp=
;&nbsp; |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; | Hardwired translations</FONT>

<BR><FONT =
SIZE=3D2>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp=
;&nbsp; V&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |</FONT>

<BR><FONT SIZE=3D2>&nbsp;whatever --&gt; decode --&gt; Unicode =
-------&gt;&nbsp; trans B&nbsp; --&gt; ^^e9</FONT>

<BR><FONT =
SIZE=3D2>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp=
;&nbsp; ^&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |</FONT>

<BR><FONT =
SIZE=3D2>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp=
;&nbsp; |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |</FONT>

<BR><FONT =
SIZE=3D2>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp=
;&nbsp; |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |</FONT>

<BR><FONT =
SIZE=3D2>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
files: Choice of say utf8, Unicode</FONT>
</P>

<P><FONT SIZE=3D2>&gt; &gt; Omega can represent internally non ascii =
chars and hence</FONT>

<BR><FONT SIZE=3D2>&gt; &gt; actual chars are used instead of macros =
(with a few exceptions).</FONT>

<BR><FONT SIZE=3D2>&gt; &gt; Trivial as it can seem, this difference is =
in fact a HUGE</FONT>

<BR><FONT SIZE=3D2>&gt; &gt; difference. For example, the path followed =
by =E9 will be:</FONT>

<BR><FONT SIZE=3D2>&gt; &gt;</FONT>

<BR><FONT SIZE=3D2>&gt; &gt;&nbsp; =E9 --an encoding =
ocp-|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |-- T1 =
font ocp--&gt;&nbsp; ^^e9</FONT>

<BR><FONT SIZE=3D2>&gt; =
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nb=
sp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; +-&gt; U+00E9 =
-+</FONT>

<BR><FONT SIZE=3D2>&gt; &gt;&nbsp; \'e -fontenc =
(!)----|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |- =
OT1 font ocp -&gt; \OT1\'{e}</FONT>
</P>

<P><FONT SIZE=3D2>Thus, one does not bother representing ASCII or any =
other such encoding</FONT>

<BR><FONT SIZE=3D2>internally anymore. Combinations such as \'e are no =
longer necessary,</FONT>

<BR><FONT SIZE=3D2>except for backwards compatibility. It will probably =
not make much</FONT>

<BR><FONT SIZE=3D2>difference what you do with those combinations, =
because documents that for</FONT>

<BR><FONT SIZE=3D2>archival purposes need absolute compatibility can use =
the old TeX -- newer</FONT>

<BR><FONT SIZE=3D2>documents can be translated, if necessary, at need. =
(That is, this will</FONT>

<BR><FONT SIZE=3D2>work, if the incompatibilities are not too =
big.)</FONT>
</P>

<P><FONT SIZE=3D2>&gt; trans A =3D tokenising 8 bit numbers as the =
corresponding 16bit numbers</FONT>

<BR><FONT =
SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=
 Example:</FONT>

<BR><FONT =
SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=
 if =E9 was in the cp437 code page (German DOS) it would</FONT>

<BR><FONT =
SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=
 be the 8bit char &quot;82; that would become the 16bit token =
with</FONT>

<BR><FONT =
SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=
 number &quot;0082 (which is NOT =E9 in unicode =3D &quot;00E9)</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT =
SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=
 if on the other hand&nbsp; =E9 was in latin1 (where it is &quot;E9) =
we</FONT>

<BR><FONT =
SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=
 get &quot;00E9</FONT>
</P>

<P><FONT SIZE=3D2>Such problems would not happen within UTeX, as the =
preprocessor would</FONT>

<BR><FONT SIZE=3D2>provide the proper translation to \U000000E9.</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp; Hans Aberg</FONT>
</P>

</BODY>
</HTML>
------_=_NextPart_001_01C098FE.4E59D880--