MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----_=_NextPart_001_01C0D972.C91D1780"
Content-class: urn:content-classes:message
Subject:      Re: Multilingual Encodings Summary 2.2
Date: Thu, 10 May 2001 18:00:26 +0100
Message-ID:  <GD4PWQ$IswLQS5qBE52yfXuc2O7g_3Z0lpTwKV@wanadoo.es>
From: "jbezos" <jbezos@WANADOO.ES>
Sender: "Mailing list for the LaTeX3 project" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
To: "Multiple recipients of list LATEX-L" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
Reply-To: "Mailing list for the LaTeX3 project" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
Status: R

This is a multi-part message in MIME format.

------_=_NextPart_001_01C0D972.C91D1780
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Quick answers to a couple of points. Lars says:

>The comparison in Section 3.2.1 of how characters are processed in TeX =
and
>Omega respectively also seems strange. In Omega case (b), column C, we =
see
>that the LICR character \'e is converted to an 8-bit character "82 =
before
>some OTP converts it to the Unicode character "00E9 in column D. Surely
>this can't be right---whenever LICR is converted to anything it should =
be
>to full Unicode, since we will otherwise end up in an encoding morass =
much
>worse than that in current LaTeX.

Surely it's right :-). Remember that =E9 is not an active character in
lambda and that ocp's are applied after expansion. Let's consider
the input =E9\'e=E9. It's expanded to the character sequence "82 "82 =
"82,
which is fine. If we define \'e as "00E9 the expansion is "82 "00 "E9
"82, which is definitely wrong. Further, converting the input to Unicode
at the LICR level means that the auxiliary files use the Unicode =
encoding;
if the editor is not a Unicode one these files become unmanageable and =
messy.
LICR should preserve, IMO, the current LaTeX conventions, and =E9\'e=E9
should be written to these files in exactly that way. Or in other words,
any file to be read by LaTeX should follow the "external" LaTeX
conventions and only transcoded in the mouth.

>As I understand the Omega draft documentation, there can be no more =
than
>one OTP (the \InputTranslation) acting on the input of LaTeX at any =
time
>and that OTP in only meant to handle the basic conversion from the =
external
>encoding (ASCII, latin-1, UTF-8, or whatever) to the internal 32-bit
>Unicode. All this happens way before the input gets tokenized, so there =
is

In fact, \InputEncoding was not intended for that, but only for
"technical" translations which applies to the whole document
as one byte -> two byte or little endian -> big endian. The main
problem of it is that it doesn't translate macros:
\def\myE{=C9}
\InputEncoding <an encoding>
=C9\myE

only the explicit =C9 is transcoded. However, that can be desirable
under some circumstances, but you know in advance which encodings
will be used. More dangerous is the following:

\comenzar{enumeraci=F3n} % Spanish interface with, say, MacRoman
                       % \comenzar means \begin
\InputEncoding <iso hebrew>

\terminar{enumeraci=F3n} % <- that's transcoded using iso hebrew!

Regards
Javier


_____________________________________________________________________
Conoce la que ser=E1 la pel=EDcula del verano y ll=E9vate una camiseta =
de cine en http://www.marujasasesinas.com/html/concurso.html

------_=_NextPart_001_01C0D972.C91D1780
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Diso-8859-1">
<META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version =
6.5.7654.12">
<TITLE>     Re: Multilingual Encodings Summary 2.2</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->

<P><FONT SIZE=3D2>Quick answers to a couple of points. Lars says:</FONT>
</P>

<P><FONT SIZE=3D2>&gt;The comparison in Section 3.2.1 of how characters =
are processed in TeX and</FONT>

<BR><FONT SIZE=3D2>&gt;Omega respectively also seems strange. In Omega =
case (b), column C, we see</FONT>

<BR><FONT SIZE=3D2>&gt;that the LICR character \'e is converted to an =
8-bit character &quot;82 before</FONT>

<BR><FONT SIZE=3D2>&gt;some OTP converts it to the Unicode character =
&quot;00E9 in column D. Surely</FONT>

<BR><FONT SIZE=3D2>&gt;this can't be right---whenever LICR is converted =
to anything it should be</FONT>

<BR><FONT SIZE=3D2>&gt;to full Unicode, since we will otherwise end up =
in an encoding morass much</FONT>

<BR><FONT SIZE=3D2>&gt;worse than that in current LaTeX.</FONT>
</P>

<P><FONT SIZE=3D2>Surely it's right :-). Remember that =E9 is not an =
active character in</FONT>

<BR><FONT SIZE=3D2>lambda and that ocp's are applied after expansion. =
Let's consider</FONT>

<BR><FONT SIZE=3D2>the input =E9\'e=E9. It's expanded to the character =
sequence &quot;82 &quot;82 &quot;82,</FONT>

<BR><FONT SIZE=3D2>which is fine. If we define \'e as &quot;00E9 the =
expansion is &quot;82 &quot;00 &quot;E9</FONT>

<BR><FONT SIZE=3D2>&quot;82, which is definitely wrong. Further, =
converting the input to Unicode</FONT>

<BR><FONT SIZE=3D2>at the LICR level means that the auxiliary files use =
the Unicode encoding;</FONT>

<BR><FONT SIZE=3D2>if the editor is not a Unicode one these files become =
unmanageable and messy.</FONT>

<BR><FONT SIZE=3D2>LICR should preserve, IMO, the current LaTeX =
conventions, and =E9\'e=E9</FONT>

<BR><FONT SIZE=3D2>should be written to these files in exactly that way. =
Or in other words,</FONT>

<BR><FONT SIZE=3D2>any file to be read by LaTeX should follow the =
&quot;external&quot; LaTeX</FONT>

<BR><FONT SIZE=3D2>conventions and only transcoded in the mouth.</FONT>
</P>

<P><FONT SIZE=3D2>&gt;As I understand the Omega draft documentation, =
there can be no more than</FONT>

<BR><FONT SIZE=3D2>&gt;one OTP (the \InputTranslation) acting on the =
input of LaTeX at any time</FONT>

<BR><FONT SIZE=3D2>&gt;and that OTP in only meant to handle the basic =
conversion from the external</FONT>

<BR><FONT SIZE=3D2>&gt;encoding (ASCII, latin-1, UTF-8, or whatever) to =
the internal 32-bit</FONT>

<BR><FONT SIZE=3D2>&gt;Unicode. All this happens way before the input =
gets tokenized, so there is</FONT>
</P>

<P><FONT SIZE=3D2>In fact, \InputEncoding was not intended for that, but =
only for</FONT>

<BR><FONT SIZE=3D2>&quot;technical&quot; translations which applies to =
the whole document</FONT>

<BR><FONT SIZE=3D2>as one byte -&gt; two byte or little endian -&gt; big =
endian. The main</FONT>

<BR><FONT SIZE=3D2>problem of it is that it doesn't translate =
macros:</FONT>

<BR><FONT SIZE=3D2>\def\myE{=C9}</FONT>

<BR><FONT SIZE=3D2>\InputEncoding &lt;an encoding&gt;</FONT>

<BR><FONT SIZE=3D2>=C9\myE</FONT>
</P>

<P><FONT SIZE=3D2>only the explicit =C9 is transcoded. However, that can =
be desirable</FONT>

<BR><FONT SIZE=3D2>under some circumstances, but you know in advance =
which encodings</FONT>

<BR><FONT SIZE=3D2>will be used. More dangerous is the following:</FONT>
</P>

<P><FONT SIZE=3D2>\comenzar{enumeraci=F3n} % Spanish interface with, =
say, MacRoman</FONT>

<BR><FONT =
SIZE=3D2>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; % =
\comenzar means \begin</FONT>

<BR><FONT SIZE=3D2>\InputEncoding &lt;iso hebrew&gt;</FONT>
</P>

<P><FONT SIZE=3D2>\terminar{enumeraci=F3n} % &lt;- that's transcoded =
using iso hebrew!</FONT>
</P>

<P><FONT SIZE=3D2>Regards</FONT>

<BR><FONT SIZE=3D2>Javier</FONT>
</P>
<BR>

<P><FONT =
SIZE=3D2>________________________________________________________________=
_____</FONT>

<BR><FONT SIZE=3D2>Conoce la que ser=E1 la pel=EDcula del verano y =
ll=E9vate una camiseta de cine en <A =
HREF=3D"http://www.marujasasesinas.com/html/concurso.html">http://www.mar=
ujasasesinas.com/html/concurso.html</A></FONT></P>

</BODY>
</HTML>
------_=_NextPart_001_01C0D972.C91D1780--