MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----_=_NextPart_001_01C0D983.4EA8C380"
In-Reply-To:  <GD4PWQ$IswLQS5qBE52yfXuc2O7g_3Z0lpTwKV@wanadoo.es>
Content-class: urn:content-classes:message
Subject:      Re: Multilingual Encodings Summary 2.2
Date: Thu, 10 May 2001 19:59:08 +0100
Message-ID:  <l03102800b7208a488a41@[130.239.137.13]>
From: =?iso-8859-1?Q?Lars_Hellstr=F6m?= <Lars.Hellstrom@MATH.UMU.SE>
Sender: "Mailing list for the LaTeX3 project" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
To: "Multiple recipients of list LATEX-L" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
Reply-To: "Mailing list for the LaTeX3 project" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
Status: R

This is a multi-part message in MIME format.

------_=_NextPart_001_01C0D983.4EA8C380
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

At 19.00 +0200 01-05-10, jbezos wrote:
>Quick answers to a couple of points. Lars says:
>
>>The comparison in Section 3.2.1 of how characters are processed in TeX =
and
>>Omega respectively also seems strange. In Omega case (b), column C, we =
see
>>that the LICR character \'e is converted to an 8-bit character "82 =
before
>>some OTP converts it to the Unicode character "00E9 in column D. =
Surely
>>this can't be right---whenever LICR is converted to anything it should =
be
>>to full Unicode, since we will otherwise end up in an encoding morass =
much
>>worse than that in current LaTeX.
>
>Surely it's right :-). Remember that =C8 is not an active character in
>lambda and that ocp's are applied after expansion. Let's consider
>the input =C8\'e=C8. It's expanded to the character sequence "82 "82 =
"82,
>which is fine. If we define \'e as "00E9 the expansion is "82 "00 "E9
>"82, which is definitely wrong.

Aren't you completely confusing the text file format with the internal
format (OICR or whatever) here? If \'e is defined anything like it is =
today
(via \DeclareTextComposite) it should expand to a category 11 token with
character code "00E9, i.e., the token ^^^^00e9. I see no indication in =
the
docs that Omega would convert such a token to two tokens.

>Further, converting the input to Unicode
>at the LICR level means that the auxiliary files use the Unicode =
encoding;

No it wouldn't. If \protect is not \@typeset@protect when \'e is =
expanded
then it will be written to a file as \'e.

>if the editor is not a Unicode one these files become unmanageable and =
messy.
>LICR should preserve, IMO, the current LaTeX conventions, and =C8\'e=C8
>should be written to these files in exactly that way.

Well, if you're not naughtily assuming that input encoding equals font
encoding, and hence use inputenc to interpret the non-ASCII-characters
(which I assume were =E9s when you sent them), then the above text is
_currently_ written to the .aux file as \'e\'e\'e, not as =E9\'e=E9.

>Or in other words,
>any file to be read by LaTeX should follow the "external" LaTeX
>conventions and only transcoded in the mouth.
>
>>As I understand the Omega draft documentation, there can be no more =
than
>>one OTP (the \InputTranslation) acting on the input of LaTeX at any =
time
>>and that OTP in only meant to handle the basic conversion from the =
external
>>encoding (ASCII, latin-1, UTF-8, or whatever) to the internal 32-bit
>>Unicode. All this happens way before the input gets tokenized, so =
there is
>
>In fact, \InputEncoding was not intended for that, but only for
>"technical" translations which applies to the whole document
>as one byte -> two byte or little endian -> big endian.

Assuming that \InputEncoding is some alias for the \InputTranslation
primitive, that's roughly what I meant; maybe translation from latin-1 =
was
a bit off the target. OTOH you seem to assume below that \InputEncoding
should also handle translations which are just as untechnical!!?

>The main
>problem of it is that it doesn't translate macros:
>\def\myE{ }
>\InputEncoding <an encoding>
> \myE
>
>only the explicit   is transcoded.

Isn't that a bit like saying "the main problem is that changing the
\catcode of @ doesn't change the categories of @ tokens in macros"?

> However, that can be desirable
>under some circumstances, but you know in advance which encodings
>will be used.
That this is known sounds like a very dangerous assumption to me.
>More dangerous is the following:
>
>\comenzar{enumeraci=DBn} % Spanish interface with, say, MacRoman
>                       % \comenzar means \begin
>\InputEncoding <iso hebrew>
>
>\terminar{enumeraci=DBn} % <- that's transcoded using iso hebrew!

But such characters (the Spanish as well as the Hebrew) aren't allowed =
in
names in LaTeX!

Lars Hellstr=F6m

------_=_NextPart_001_01C0D983.4EA8C380
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Diso-8859-1">
<META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version =
6.5.7654.12">
<TITLE>     Re: Multilingual Encodings Summary 2.2</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->

<P><FONT SIZE=3D2>At 19.00 +0200 01-05-10, jbezos wrote:</FONT>

<BR><FONT SIZE=3D2>&gt;Quick answers to a couple of points. Lars =
says:</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt;&gt;The comparison in Section 3.2.1 of how =
characters are processed in TeX and</FONT>

<BR><FONT SIZE=3D2>&gt;&gt;Omega respectively also seems strange. In =
Omega case (b), column C, we see</FONT>

<BR><FONT SIZE=3D2>&gt;&gt;that the LICR character \'e is converted to =
an 8-bit character &quot;82 before</FONT>

<BR><FONT SIZE=3D2>&gt;&gt;some OTP converts it to the Unicode character =
&quot;00E9 in column D. Surely</FONT>

<BR><FONT SIZE=3D2>&gt;&gt;this can't be right---whenever LICR is =
converted to anything it should be</FONT>

<BR><FONT SIZE=3D2>&gt;&gt;to full Unicode, since we will otherwise end =
up in an encoding morass much</FONT>

<BR><FONT SIZE=3D2>&gt;&gt;worse than that in current LaTeX.</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt;Surely it's right :-). Remember that =C8 is not =
an active character in</FONT>

<BR><FONT SIZE=3D2>&gt;lambda and that ocp's are applied after =
expansion. Let's consider</FONT>

<BR><FONT SIZE=3D2>&gt;the input =C8\'e=C8. It's expanded to the =
character sequence &quot;82 &quot;82 &quot;82,</FONT>

<BR><FONT SIZE=3D2>&gt;which is fine. If we define \'e as &quot;00E9 the =
expansion is &quot;82 &quot;00 &quot;E9</FONT>

<BR><FONT SIZE=3D2>&gt;&quot;82, which is definitely wrong.</FONT>
</P>

<P><FONT SIZE=3D2>Aren't you completely confusing the text file format =
with the internal</FONT>

<BR><FONT SIZE=3D2>format (OICR or whatever) here? If \'e is defined =
anything like it is today</FONT>

<BR><FONT SIZE=3D2>(via \DeclareTextComposite) it should expand to a =
category 11 token with</FONT>

<BR><FONT SIZE=3D2>character code &quot;00E9, i.e., the token ^^^^00e9. =
I see no indication in the</FONT>

<BR><FONT SIZE=3D2>docs that Omega would convert such a token to two =
tokens.</FONT>
</P>

<P><FONT SIZE=3D2>&gt;Further, converting the input to Unicode</FONT>

<BR><FONT SIZE=3D2>&gt;at the LICR level means that the auxiliary files =
use the Unicode encoding;</FONT>
</P>

<P><FONT SIZE=3D2>No it wouldn't. If \protect is not \@typeset@protect =
when \'e is expanded</FONT>

<BR><FONT SIZE=3D2>then it will be written to a file as \'e.</FONT>
</P>

<P><FONT SIZE=3D2>&gt;if the editor is not a Unicode one these files =
become unmanageable and messy.</FONT>

<BR><FONT SIZE=3D2>&gt;LICR should preserve, IMO, the current LaTeX =
conventions, and =C8\'e=C8</FONT>

<BR><FONT SIZE=3D2>&gt;should be written to these files in exactly that =
way.</FONT>
</P>

<P><FONT SIZE=3D2>Well, if you're not naughtily assuming that input =
encoding equals font</FONT>

<BR><FONT SIZE=3D2>encoding, and hence use inputenc to interpret the =
non-ASCII-characters</FONT>

<BR><FONT SIZE=3D2>(which I assume were =E9s when you sent them), then =
the above text is</FONT>

<BR><FONT SIZE=3D2>_currently_ written to the .aux file as \'e\'e\'e, =
not as =E9\'e=E9.</FONT>
</P>

<P><FONT SIZE=3D2>&gt;Or in other words,</FONT>

<BR><FONT SIZE=3D2>&gt;any file to be read by LaTeX should follow the =
&quot;external&quot; LaTeX</FONT>

<BR><FONT SIZE=3D2>&gt;conventions and only transcoded in the =
mouth.</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt;&gt;As I understand the Omega draft =
documentation, there can be no more than</FONT>

<BR><FONT SIZE=3D2>&gt;&gt;one OTP (the \InputTranslation) acting on the =
input of LaTeX at any time</FONT>

<BR><FONT SIZE=3D2>&gt;&gt;and that OTP in only meant to handle the =
basic conversion from the external</FONT>

<BR><FONT SIZE=3D2>&gt;&gt;encoding (ASCII, latin-1, UTF-8, or whatever) =
to the internal 32-bit</FONT>

<BR><FONT SIZE=3D2>&gt;&gt;Unicode. All this happens way before the =
input gets tokenized, so there is</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt;In fact, \InputEncoding was not intended for =
that, but only for</FONT>

<BR><FONT SIZE=3D2>&gt;&quot;technical&quot; translations which applies =
to the whole document</FONT>

<BR><FONT SIZE=3D2>&gt;as one byte -&gt; two byte or little endian -&gt; =
big endian.</FONT>
</P>

<P><FONT SIZE=3D2>Assuming that \InputEncoding is some alias for the =
\InputTranslation</FONT>

<BR><FONT SIZE=3D2>primitive, that's roughly what I meant; maybe =
translation from latin-1 was</FONT>

<BR><FONT SIZE=3D2>a bit off the target. OTOH you seem to assume below =
that \InputEncoding</FONT>

<BR><FONT SIZE=3D2>should also handle translations which are just as =
untechnical!!?</FONT>
</P>

<P><FONT SIZE=3D2>&gt;The main</FONT>

<BR><FONT SIZE=3D2>&gt;problem of it is that it doesn't translate =
macros:</FONT>

<BR><FONT SIZE=3D2>&gt;\def\myE{ }</FONT>

<BR><FONT SIZE=3D2>&gt;\InputEncoding &lt;an encoding&gt;</FONT>

<BR><FONT SIZE=3D2>&gt; \myE</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt;only the explicit&nbsp;&nbsp; is =
transcoded.</FONT>
</P>

<P><FONT SIZE=3D2>Isn't that a bit like saying &quot;the main problem is =
that changing the</FONT>

<BR><FONT SIZE=3D2>\catcode of @ doesn't change the categories of @ =
tokens in macros&quot;?</FONT>
</P>

<P><FONT SIZE=3D2>&gt; However, that can be desirable</FONT>

<BR><FONT SIZE=3D2>&gt;under some circumstances, but you know in advance =
which encodings</FONT>

<BR><FONT SIZE=3D2>&gt;will be used.</FONT>

<BR><FONT SIZE=3D2>That this is known sounds like a very dangerous =
assumption to me.</FONT>

<BR><FONT SIZE=3D2>&gt;More dangerous is the following:</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt;\comenzar{enumeraci=DBn} % Spanish interface =
with, say, MacRoman</FONT>

<BR><FONT =
SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
% \comenzar means \begin</FONT>

<BR><FONT SIZE=3D2>&gt;\InputEncoding &lt;iso hebrew&gt;</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt;\terminar{enumeraci=DBn} % &lt;- that's =
transcoded using iso hebrew!</FONT>
</P>

<P><FONT SIZE=3D2>But such characters (the Spanish as well as the =
Hebrew) aren't allowed in</FONT>

<BR><FONT SIZE=3D2>names in LaTeX!</FONT>
</P>

<P><FONT SIZE=3D2>Lars Hellstr=F6m</FONT>
</P>

</BODY>
</HTML>
------_=_NextPart_001_01C0D983.4EA8C380--