MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----_=_NextPart_001_01C0D983.B9F29580"
In-Reply-To:  <GD4PWQ$IswLQS5qBE52yfXuc2O7g_3Z0lpTwKV@wanadoo.es>
References: <GD4PWQ$IswLQS5qBE52yfXuc2O7g_3Z0lpTwKV@wanadoo.es>
Content-class: urn:content-classes:message
Subject:      Re: Multilingual Encodings Summary 2.2
Date: Thu, 10 May 2001 19:59:31 +0100
Message-ID:  <15098.58643.589025.379865@istrati.zdv.uni-mainz.de>
From: "Frank Mittelbach" <frank.mittelbach@LATEX-PROJECT.ORG>
Sender: "Mailing list for the LaTeX3 project" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
To: "Multiple recipients of list LATEX-L" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
Reply-To: "Mailing list for the LaTeX3 project" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
Status: R

This is a multi-part message in MIME format.

------_=_NextPart_001_01C0D983.B9F29580
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Javier wrote in reply to Lars:

 > Quick answers to a couple of points. Lars says:
 >
 > >The comparison in Section 3.2.1 of how characters are processed in =
TeX and
 > >Omega respectively also seems strange. In Omega case (b), column C, =
we see
 > >that the LICR character \'e is converted to an 8-bit character "82 =
before
 > >some OTP converts it to the Unicode character "00E9 in column D. =
Surely
 > >this can't be right---whenever LICR is converted to anything it =
should be
 > >to full Unicode, since we will otherwise end up in an encoding =
morass much
 > >worse than that in current LaTeX.

in my opinion this whole section is incorrect too or say at best half =
correct
(sorry) --- however the problem really is that the whole area inside =
omega
isn't at the current point in time anywhere consistent due to the fact =
that
the OTPs have been hooked into the wrong places (this is due to the =
technical
ease to open up the points where the code was opened up and this near
impossibility to do it elsewhere without rewriting the whole of TeX, and =
due
to the fact that originally the whole method was intended for far =
simpler
tasks and not for a grande picture)

conceptually LaTeX has a well-defined ICR (though with a somewhat clumsy
implementation due to the technical limitations of TeX) while at the =
current
point in time Omega hasn't such a beast.

For LaTeX the line c) in your table simply doesn't exist (it is not =
supported
code) and  the columns actually do not make much sense  for LaTeX as =
they only
reflect the missing concept of an internal encoding in Omega and you are
looking at the thing from the current omega implementation.

LaTeX conceptually has only three levels: source, ICR, Output

and something like the step step C lives along the way from ICR -> =
Output but is a
only a technicality which is not of conceptual importance. All the =
reasoning
and manipulation of text is done in only one form which is the ICR. and =
step D
is in LaTeX the transformation from source to LICR via inputenc

For omega one would expect that to be the same except that the OICR =
would be
something like U+00E --- but it isn't: as it takes a long while for text =
to get
to this form  (if ever!!!!!).

you can say it differently as follows:

 my requirement for a usable internal representation is that I can take =
a
 single element of it at any time and it has a welldefined meaning (and =
a
 single one).

now for the LICR this is the case but for Omega it is (right now) not.


as a result one ends up to have to explain all those problems of
misinterpreting the internal forms if you do this or that at a certain =
stage
(like storing text in a token register and reusing it at some other =
point or
never pass it to the hlist builder (where the OTPs actually execute)


from your second ascii drawing in that section one would get the =
impression
(for a moment) that Omega has a welldefined OICR which is U+00E9 but as =
we
know this is unfortunately not the case --- though it should be!!!
(and to be honnest to see the word "fontenc" on the left makes me =
shudder
though I understand why Javier put it there originally; I think it is a
horrible misinterpretation of what fontenc conceptually does)

 > Surely it's right :-). Remember that =E9 is not an active character =
in
 > lambda and that ocp's are applied after expansion. Let's consider

but ocp's should work on OICR and not on undefined byte sequences!
like here:

 > the input =E9\'e=E9. It's expanded to the character sequence "82 "82 =
"82,
 > which is fine.

which is not fine, not fine at all
because of this:

 > If we define \'e as "00E9 the expansion is "82 "00 "E9
 > "82, which is definitely wrong. Further, converting the input to =
Unicode

not the latter is wrong but the whole thing is wrong

 > at the LICR level means that the auxiliary files use the Unicode =
encoding;
 > if the editor is not a Unicode one these files become unmanageable =
and
 > messy.

not true. the OICR has to be unicode (or more exactly unique and =
well-defined
in the above sense, can be 20bits for all i care) if Omega ever should =
go off
the ground. but the interface to the external world could apply a =
well-defined
output translation to something else before writing.

that could be utf8, but in fact that could be anything as long as it =
definable
and controllable so that you know what this file ends up with so =
inputting it
back again would result in turning it right back into the OICR


 > LICR should preserve, IMO, the current LaTeX conventions, and =
=E9\'e=E9
 > should be written to these files in exactly that way.

not sure what you mean by "current LaTeX conventions": current LaTeX
conventions is that external files are always written in a special 7bit
representation of the LICR (involving things like \IeC). not wonderful =
but
conceptually clean.


 > Or in other words,
 > any file to be read by LaTeX should follow the "external" LaTeX
 > conventions and only transcoded in the mouth.

????

 > >As I understand the Omega draft documentation, there can be no more =
than
 > >one OTP (the \InputTranslation) acting on the input of LaTeX at any =
time
 > >and that OTP in only meant to handle the basic conversion from the =
external
 > >encoding (ASCII, latin-1, UTF-8, or whatever) to the internal 32-bit
 > >Unicode. All this happens way before the input gets tokenized, so =
there is
 >
 > In fact, \InputEncoding was not intended for that, but only for
 > "technical" translations which applies to the whole document
 > as one byte -> two byte or little endian -> big endian. The main
 > problem of it is that it doesn't translate macros:
 > \def\myE{=C9}
 > \InputEncoding <an encoding>
 > =C9\myE

\InputEncoding is the point where one need to go from external source =
encoding
to OICR that is precisely the wound: the current \InputEncoding isn't =
doing
this job fully (and that it is not clear how to do it properly (to be =
fair))

but in my opinion it is absolutely essential that this all gets =
detangled and
Omega ends up with a proper OICR model. Only then, it could become =
usable in a
broader sense in my opinion.

cheers
frank

ps: I would really like to thank Oliver a lot for doing this =
compilation. The
fact that we don't agree with some points in it only means that the =
processes
are so complicated that we haven't yet understood them properly and so =
need to
work further on them (and a document like this does help)
pps: what might help as well is to identify the parts we do feel be
controverse and actually mark them (perhaps with some marginal
notes\marginpar{FMi: bla bla}\marginpar{JLo: Fmi talks =
rubbish}\marginpar{LHe:
they both seem to have no idea what they are talking about} :-)
ppps: i'm off to GUTenberg so don't be surprised if flaming replies go
unansered by me for a while

------_=_NextPart_001_01C0D983.B9F29580
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Diso-8859-1">
<META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version =
6.5.7654.12">
<TITLE>     Re: Multilingual Encodings Summary 2.2</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->

<P><FONT SIZE=3D2>Javier wrote in reply to Lars:</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp;&gt; Quick answers to a couple of points. Lars =
says:</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; &gt;The comparison in Section 3.2.1 of how =
characters are processed in TeX and</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; &gt;Omega respectively also seems strange. =
In Omega case (b), column C, we see</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; &gt;that the LICR character \'e is =
converted to an 8-bit character &quot;82 before</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; &gt;some OTP converts it to the Unicode =
character &quot;00E9 in column D. Surely</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; &gt;this can't be right---whenever LICR is =
converted to anything it should be</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; &gt;to full Unicode, since we will =
otherwise end up in an encoding morass much</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; &gt;worse than that in current =
LaTeX.</FONT>
</P>

<P><FONT SIZE=3D2>in my opinion this whole section is incorrect too or =
say at best half correct</FONT>

<BR><FONT SIZE=3D2>(sorry) --- however the problem really is that the =
whole area inside omega</FONT>

<BR><FONT SIZE=3D2>isn't at the current point in time anywhere =
consistent due to the fact that</FONT>

<BR><FONT SIZE=3D2>the OTPs have been hooked into the wrong places (this =
is due to the technical</FONT>

<BR><FONT SIZE=3D2>ease to open up the points where the code was opened =
up and this near</FONT>

<BR><FONT SIZE=3D2>impossibility to do it elsewhere without rewriting =
the whole of TeX, and due</FONT>

<BR><FONT SIZE=3D2>to the fact that originally the whole method was =
intended for far simpler</FONT>

<BR><FONT SIZE=3D2>tasks and not for a grande picture)</FONT>
</P>

<P><FONT SIZE=3D2>conceptually LaTeX has a well-defined ICR (though with =
a somewhat clumsy</FONT>

<BR><FONT SIZE=3D2>implementation due to the technical limitations of =
TeX) while at the current</FONT>

<BR><FONT SIZE=3D2>point in time Omega hasn't such a beast.</FONT>
</P>

<P><FONT SIZE=3D2>For LaTeX the line c) in your table simply doesn't =
exist (it is not supported</FONT>

<BR><FONT SIZE=3D2>code) and&nbsp; the columns actually do not make much =
sense&nbsp; for LaTeX as they only</FONT>

<BR><FONT SIZE=3D2>reflect the missing concept of an internal encoding =
in Omega and you are</FONT>

<BR><FONT SIZE=3D2>looking at the thing from the current omega =
implementation.</FONT>
</P>

<P><FONT SIZE=3D2>LaTeX conceptually has only three levels: source, ICR, =
Output</FONT>
</P>

<P><FONT SIZE=3D2>and something like the step step C lives along the way =
from ICR -&gt; Output but is a</FONT>

<BR><FONT SIZE=3D2>only a technicality which is not of conceptual =
importance. All the reasoning</FONT>

<BR><FONT SIZE=3D2>and manipulation of text is done in only one form =
which is the ICR. and step D</FONT>

<BR><FONT SIZE=3D2>is in LaTeX the transformation from source to LICR =
via inputenc</FONT>
</P>

<P><FONT SIZE=3D2>For omega one would expect that to be the same except =
that the OICR would be</FONT>

<BR><FONT SIZE=3D2>something like U+00E --- but it isn't: as it takes a =
long while for text to get</FONT>

<BR><FONT SIZE=3D2>to this form&nbsp; (if ever!!!!!).</FONT>
</P>

<P><FONT SIZE=3D2>you can say it differently as follows:</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp;my requirement for a usable internal =
representation is that I can take a</FONT>

<BR><FONT SIZE=3D2>&nbsp;single element of it at any time and it has a =
welldefined meaning (and a</FONT>

<BR><FONT SIZE=3D2>&nbsp;single one).</FONT>
</P>

<P><FONT SIZE=3D2>now for the LICR this is the case but for Omega it is =
(right now) not.</FONT>
</P>
<BR>

<P><FONT SIZE=3D2>as a result one ends up to have to explain all those =
problems of</FONT>

<BR><FONT SIZE=3D2>misinterpreting the internal forms if you do this or =
that at a certain stage</FONT>

<BR><FONT SIZE=3D2>(like storing text in a token register and reusing it =
at some other point or</FONT>

<BR><FONT SIZE=3D2>never pass it to the hlist builder (where the OTPs =
actually execute)</FONT>
</P>
<BR>

<P><FONT SIZE=3D2>from your second ascii drawing in that section one =
would get the impression</FONT>

<BR><FONT SIZE=3D2>(for a moment) that Omega has a welldefined OICR =
which is U+00E9 but as we</FONT>

<BR><FONT SIZE=3D2>know this is unfortunately not the case --- though it =
should be!!!</FONT>

<BR><FONT SIZE=3D2>(and to be honnest to see the word =
&quot;fontenc&quot; on the left makes me shudder</FONT>

<BR><FONT SIZE=3D2>though I understand why Javier put it there =
originally; I think it is a</FONT>

<BR><FONT SIZE=3D2>horrible misinterpretation of what fontenc =
conceptually does)</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp;&gt; Surely it's right :-). Remember that =E9 is =
not an active character in</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; lambda and that ocp's are applied after =
expansion. Let's consider</FONT>
</P>

<P><FONT SIZE=3D2>but ocp's should work on OICR and not on undefined =
byte sequences!</FONT>

<BR><FONT SIZE=3D2>like here:</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp;&gt; the input =E9\'e=E9. It's expanded to the =
character sequence &quot;82 &quot;82 &quot;82,</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; which is fine.</FONT>
</P>

<P><FONT SIZE=3D2>which is not fine, not fine at all</FONT>

<BR><FONT SIZE=3D2>because of this:</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp;&gt; If we define \'e as &quot;00E9 the =
expansion is &quot;82 &quot;00 &quot;E9</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; &quot;82, which is definitely wrong. =
Further, converting the input to Unicode</FONT>
</P>

<P><FONT SIZE=3D2>not the latter is wrong but the whole thing is =
wrong</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp;&gt; at the LICR level means that the auxiliary =
files use the Unicode encoding;</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; if the editor is not a Unicode one these =
files become unmanageable and</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; messy.</FONT>
</P>

<P><FONT SIZE=3D2>not true. the OICR has to be unicode (or more exactly =
unique and well-defined</FONT>

<BR><FONT SIZE=3D2>in the above sense, can be 20bits for all i care) if =
Omega ever should go off</FONT>

<BR><FONT SIZE=3D2>the ground. but the interface to the external world =
could apply a well-defined</FONT>

<BR><FONT SIZE=3D2>output translation to something else before =
writing.</FONT>
</P>

<P><FONT SIZE=3D2>that could be utf8, but in fact that could be anything =
as long as it definable</FONT>

<BR><FONT SIZE=3D2>and controllable so that you know what this file ends =
up with so inputting it</FONT>

<BR><FONT SIZE=3D2>back again would result in turning it right back into =
the OICR</FONT>
</P>
<BR>

<P><FONT SIZE=3D2>&nbsp;&gt; LICR should preserve, IMO, the current =
LaTeX conventions, and =E9\'e=E9</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; should be written to these files in =
exactly that way.</FONT>
</P>

<P><FONT SIZE=3D2>not sure what you mean by &quot;current LaTeX =
conventions&quot;: current LaTeX</FONT>

<BR><FONT SIZE=3D2>conventions is that external files are always written =
in a special 7bit</FONT>

<BR><FONT SIZE=3D2>representation of the LICR (involving things like =
\IeC). not wonderful but</FONT>

<BR><FONT SIZE=3D2>conceptually clean.</FONT>
</P>
<BR>

<P><FONT SIZE=3D2>&nbsp;&gt; Or in other words,</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; any file to be read by LaTeX should follow =
the &quot;external&quot; LaTeX</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; conventions and only transcoded in the =
mouth.</FONT>
</P>

<P><FONT SIZE=3D2>????</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp;&gt; &gt;As I understand the Omega draft =
documentation, there can be no more than</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; &gt;one OTP (the \InputTranslation) acting =
on the input of LaTeX at any time</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; &gt;and that OTP in only meant to handle =
the basic conversion from the external</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; &gt;encoding (ASCII, latin-1, UTF-8, or =
whatever) to the internal 32-bit</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; &gt;Unicode. All this happens way before =
the input gets tokenized, so there is</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; In fact, \InputEncoding was not intended =
for that, but only for</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; &quot;technical&quot; translations which =
applies to the whole document</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; as one byte -&gt; two byte or little =
endian -&gt; big endian. The main</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; problem of it is that it doesn't translate =
macros:</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; \def\myE{=C9}</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; \InputEncoding &lt;an encoding&gt;</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; =C9\myE</FONT>
</P>

<P><FONT SIZE=3D2>\InputEncoding is the point where one need to go from =
external source encoding</FONT>

<BR><FONT SIZE=3D2>to OICR that is precisely the wound: the current =
\InputEncoding isn't doing</FONT>

<BR><FONT SIZE=3D2>this job fully (and that it is not clear how to do it =
properly (to be fair))</FONT>
</P>

<P><FONT SIZE=3D2>but in my opinion it is absolutely essential that this =
all gets detangled and</FONT>

<BR><FONT SIZE=3D2>Omega ends up with a proper OICR model. Only then, it =
could become usable in a</FONT>

<BR><FONT SIZE=3D2>broader sense in my opinion.</FONT>
</P>

<P><FONT SIZE=3D2>cheers</FONT>

<BR><FONT SIZE=3D2>frank</FONT>
</P>

<P><FONT SIZE=3D2>ps: I would really like to thank Oliver a lot for =
doing this compilation. The</FONT>

<BR><FONT SIZE=3D2>fact that we don't agree with some points in it only =
means that the processes</FONT>

<BR><FONT SIZE=3D2>are so complicated that we haven't yet understood =
them properly and so need to</FONT>

<BR><FONT SIZE=3D2>work further on them (and a document like this does =
help)</FONT>

<BR><FONT SIZE=3D2>pps: what might help as well is to identify the parts =
we do feel be</FONT>

<BR><FONT SIZE=3D2>controverse and actually mark them (perhaps with some =
marginal</FONT>

<BR><FONT SIZE=3D2>notes\marginpar{FMi: bla bla}\marginpar{JLo: Fmi =
talks rubbish}\marginpar{LHe:</FONT>

<BR><FONT SIZE=3D2>they both seem to have no idea what they are talking =
about} :-)</FONT>

<BR><FONT SIZE=3D2>ppps: i'm off to GUTenberg so don't be surprised if =
flaming replies go</FONT>

<BR><FONT SIZE=3D2>unansered by me for a while</FONT>
</P>

</BODY>
</HTML>
------_=_NextPart_001_01C0D983.B9F29580--