MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----_=_NextPart_001_01C099E8.175FF180"
Content-class: urn:content-classes:message
Subject:      Re: LaTeX's internal char prepresentation (UTF8 or Unicode?)
Date: Sun, 18 Feb 2001 21:18:11 +0100
Message-ID:  <v03110701b6b5d46068b2@[195.100.226.135]>
From: "Hans Aberg" <haberg@MATEMATIK.SU.SE>
Sender: "Mailing list for the LaTeX3 project" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
To: "Multiple recipients of list LATEX-L" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
Reply-To: "Mailing list for the LaTeX3 project" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
Status: R

This is a multi-part message in MIME format.

------_=_NextPart_001_01C099E8.175FF180
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Just as an input, here is a system for "chef" preprocessing (i.e., to =
make
it palatable before it reaches TeX's mouth, in order to avoid internal
indigestion) that comes to my mind:

Every file that UTeX reads, is required to in say the first 4 bytes to =
have
information about its general encoding, interpretable as ASCII, say =
(padded
with spaces)
BYTE   -- eight byte mixed encoding
UT8    -- UTF8
UT16   -- UTF16
U16    -- Unicode 16
U32    -- Unicode 32

For the last four, no further information, but in the first case BYTE, =
one
then has a series of lines indicating, each one indicating an encoding =
that
might be used and a start sequence. -- It might difficult to foresee a
suitable start sequence for every possible file, so one could allow
individual choices for each file. It could look like:
ASCII   <as>
Latin-1 <we>
Russian <ru>
...
(Or whatever official names one decides to have for the different =
encodings.)

The preprocessor then zips through the file, looking for the indicated
character combinations, in this case <as> (7 bit), <we> (Western =
European),
<ru> (Russian), etc., and applies the encodings to the characters that
follow. So for example, one would have to write
  <as> ...
  \def\bar{bla bla \french{<we>f=F4=F4<as>} bla bla}
as \french does not handle character encodings anymore, but only other
aspects that might related to the use of French in TeX.

It is an inconvenience that one has to do this explicit markup, but it =
only
happens if one is mixing encodings. In the example above, the ASCII
characters in the bottom 7 bits agree (i.e., have the same Unicode
translation), so it would not have been needed, if one uses Latin-1 all =
the
time.

One spin-off, though, is that it is fairly easy to convert such files to
Unicode, once Unicode editors become available. -- If, in the example
above, one requires that the <we>...<as> markup being a part of the =
\french
command, then it becomes more complicated to translate such a file to a
Unicode format.

  Hans Aberg

------_=_NextPart_001_01C099E8.175FF180
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Diso-8859-1">
<META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version =
6.5.7654.12">
<TITLE>     Re: LaTeX's internal char prepresentation (UTF8 or =
Unicode?)</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->

<P><FONT SIZE=3D2>Just as an input, here is a system for =
&quot;chef&quot; preprocessing (i.e., to make</FONT>

<BR><FONT SIZE=3D2>it palatable before it reaches TeX's mouth, in order =
to avoid internal</FONT>

<BR><FONT SIZE=3D2>indigestion) that comes to my mind:</FONT>
</P>

<P><FONT SIZE=3D2>Every file that UTeX reads, is required to in say the =
first 4 bytes to have</FONT>

<BR><FONT SIZE=3D2>information about its general encoding, interpretable =
as ASCII, say (padded</FONT>

<BR><FONT SIZE=3D2>with spaces)</FONT>

<BR><FONT SIZE=3D2>BYTE&nbsp;&nbsp; -- eight byte mixed encoding</FONT>

<BR><FONT SIZE=3D2>UT8&nbsp;&nbsp;&nbsp; -- UTF8</FONT>

<BR><FONT SIZE=3D2>UT16&nbsp;&nbsp; -- UTF16</FONT>

<BR><FONT SIZE=3D2>U16&nbsp;&nbsp;&nbsp; -- Unicode 16</FONT>

<BR><FONT SIZE=3D2>U32&nbsp;&nbsp;&nbsp; -- Unicode 32</FONT>
</P>

<P><FONT SIZE=3D2>For the last four, no further information, but in the =
first case BYTE, one</FONT>

<BR><FONT SIZE=3D2>then has a series of lines indicating, each one =
indicating an encoding that</FONT>

<BR><FONT SIZE=3D2>might be used and a start sequence. -- It might =
difficult to foresee a</FONT>

<BR><FONT SIZE=3D2>suitable start sequence for every possible file, so =
one could allow</FONT>

<BR><FONT SIZE=3D2>individual choices for each file. It could look =
like:</FONT>

<BR><FONT SIZE=3D2>ASCII&nbsp;&nbsp; &lt;as&gt;</FONT>

<BR><FONT SIZE=3D2>Latin-1 &lt;we&gt;</FONT>

<BR><FONT SIZE=3D2>Russian &lt;ru&gt;</FONT>

<BR><FONT SIZE=3D2>...</FONT>

<BR><FONT SIZE=3D2>(Or whatever official names one decides to have for =
the different encodings.)</FONT>
</P>

<P><FONT SIZE=3D2>The preprocessor then zips through the file, looking =
for the indicated</FONT>

<BR><FONT SIZE=3D2>character combinations, in this case &lt;as&gt; (7 =
bit), &lt;we&gt; (Western European),</FONT>

<BR><FONT SIZE=3D2>&lt;ru&gt; (Russian), etc., and applies the encodings =
to the characters that</FONT>

<BR><FONT SIZE=3D2>follow. So for example, one would have to =
write</FONT>

<BR><FONT SIZE=3D2>&nbsp; &lt;as&gt; ...</FONT>

<BR><FONT SIZE=3D2>&nbsp; \def\bar{bla bla =
\french{&lt;we&gt;f=F4=F4&lt;as&gt;} bla bla}</FONT>

<BR><FONT SIZE=3D2>as \french does not handle character encodings =
anymore, but only other</FONT>

<BR><FONT SIZE=3D2>aspects that might related to the use of French in =
TeX.</FONT>
</P>

<P><FONT SIZE=3D2>It is an inconvenience that one has to do this =
explicit markup, but it only</FONT>

<BR><FONT SIZE=3D2>happens if one is mixing encodings. In the example =
above, the ASCII</FONT>

<BR><FONT SIZE=3D2>characters in the bottom 7 bits agree (i.e., have the =
same Unicode</FONT>

<BR><FONT SIZE=3D2>translation), so it would not have been needed, if =
one uses Latin-1 all the</FONT>

<BR><FONT SIZE=3D2>time.</FONT>
</P>

<P><FONT SIZE=3D2>One spin-off, though, is that it is fairly easy to =
convert such files to</FONT>

<BR><FONT SIZE=3D2>Unicode, once Unicode editors become available. -- =
If, in the example</FONT>

<BR><FONT SIZE=3D2>above, one requires that the &lt;we&gt;...&lt;as&gt; =
markup being a part of the \french</FONT>

<BR><FONT SIZE=3D2>command, then it becomes more complicated to =
translate such a file to a</FONT>

<BR><FONT SIZE=3D2>Unicode format.</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp; Hans Aberg</FONT>
</P>

</BODY>
</HTML>
------_=_NextPart_001_01C099E8.175FF180--