Received: from webgate.proteosys.de (mail.proteosys-ag.com [62.225.9.49]) by lucy.proteosys (8.11.0/8.9.3/SuSE Linux 8.9.3-0.1) with ESMTP id f4AJ2Ff06132 for ; Thu, 10 May 2001 21:02:15 +0200 Received: by webgate.proteosys.de (8.11.0/8.11.0) with ESMTP id f4AJ2F701946 . for ; Thu, 10 May 2001 21:02:15 +0200 Received: from mail.Uni-Mainz.DE (mailserver1.zdv.Uni-Mainz.DE [134.93.8.30]) by mailgate2.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f4AJ2E012345 for ; Thu, 10 May 2001 21:02:14 +0200 (MET DST) MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----_=_NextPart_001_01C0D983.B9F29580" Received: from mailgate1.zdv.Uni-Mainz.DE (mailgate1.zdv.Uni-Mainz.DE [134.93.8.56]) by mail.Uni-Mainz.DE (8.9.3/8.9.3) with ESMTP id VAA05960 for ; Thu, 10 May 2001 21:02:13 +0200 (MEST) Received: from mail.listserv.gmd.de (mail.listserv.gmd.de [192.88.97.5]) by mailgate1.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f4AJ2DU11897 for ; Thu, 10 May 2001 21:02:13 +0200 (MET DST) X-MimeOLE: Produced By Microsoft Exchange V6.5 Received: from mail.listserv.gmd.de (192.88.97.5) by mail.listserv.gmd.de (LSMTP for OpenVMS v1.1a) with SMTP id <8.BCF5DD5B@mail.listserv.gmd.de>; Thu, 10 May 2001 21:00:42 +0200 Received: from RELAY.URZ.UNI-HEIDELBERG.DE by RELAY.URZ.UNI-HEIDELBERG.DE (LISTSERV-TCP/IP release 1.8b) with spool id 496180 for LATEX-L@RELAY.URZ.UNI-HEIDELBERG.DE; Thu, 10 May 2001 21:02:10 +0200 Received: from ix.urz.uni-heidelberg.de (mail.urz.uni-heidelberg.de [129.206.119.234]) by relay.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id VAA08639 for ; Thu, 10 May 2001 21:02:09 +0200 (MET DST) Received: from relay.uni-heidelberg.de (relay.uni-heidelberg.de [129.206.100.212]) by ix.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id VAA58910 for ; Thu, 10 May 2001 21:02:09 +0200 Received: from sender.ngi.de (sender.ngi.de [212.79.47.18]) by relay.uni-heidelberg.de (8.10.2+Sun/8.10.2) with ESMTP id f4AJ29Q16670 for ; Thu, 10 May 2001 21:02:09 +0200 (MET DST) Received: from istrati.zdv.uni-mainz.de (manz-3e364fd3.pool.mediaWays.net [62.54.79.211]) by sender.ngi.de (Postfix) with ESMTP id DE69E96D60 for ; Thu, 10 May 2001 20:54:34 +0200 (CEST) Received: (from latex3@localhost) by istrati.zdv.uni-mainz.de (8.9.3/8.9.3/SuSE Linux 8.9.3-0.1) id UAA15373; Thu, 10 May 2001 20:59:31 +0200 In-Reply-To: References: Return-Path: X-Mailer: VM 6.75 under Emacs 20.4.1 x-mime-autoconverted: from quoted-printable to 8bit by relay.urz.uni-heidelberg.de id VAA08640 X-Authentication-Warning: istrati.zdv.uni-mainz.de: latex3 set sender to frank@mittelbach-online.de using -f Content-class: urn:content-classes:message Subject: Re: Multilingual Encodings Summary 2.2 Date: Thu, 10 May 2001 19:59:31 +0100 Message-ID: <15098.58643.589025.379865@istrati.zdv.uni-mainz.de> X-MS-Has-Attach: X-MS-TNEF-Correlator: From: "Frank Mittelbach" Sender: "Mailing list for the LaTeX3 project" To: "Multiple recipients of list LATEX-L" Reply-To: "Mailing list for the LaTeX3 project" Status: R X-Status: X-Keywords: X-UID: 4043 This is a multi-part message in MIME format. ------_=_NextPart_001_01C0D983.B9F29580 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Javier wrote in reply to Lars: > Quick answers to a couple of points. Lars says: > > >The comparison in Section 3.2.1 of how characters are processed in = TeX and > >Omega respectively also seems strange. In Omega case (b), column C, = we see > >that the LICR character \'e is converted to an 8-bit character "82 = before > >some OTP converts it to the Unicode character "00E9 in column D. = Surely > >this can't be right---whenever LICR is converted to anything it = should be > >to full Unicode, since we will otherwise end up in an encoding = morass much > >worse than that in current LaTeX. in my opinion this whole section is incorrect too or say at best half = correct (sorry) --- however the problem really is that the whole area inside = omega isn't at the current point in time anywhere consistent due to the fact = that the OTPs have been hooked into the wrong places (this is due to the = technical ease to open up the points where the code was opened up and this near impossibility to do it elsewhere without rewriting the whole of TeX, and = due to the fact that originally the whole method was intended for far = simpler tasks and not for a grande picture) conceptually LaTeX has a well-defined ICR (though with a somewhat clumsy implementation due to the technical limitations of TeX) while at the = current point in time Omega hasn't such a beast. For LaTeX the line c) in your table simply doesn't exist (it is not = supported code) and the columns actually do not make much sense for LaTeX as = they only reflect the missing concept of an internal encoding in Omega and you are looking at the thing from the current omega implementation. LaTeX conceptually has only three levels: source, ICR, Output and something like the step step C lives along the way from ICR -> = Output but is a only a technicality which is not of conceptual importance. All the = reasoning and manipulation of text is done in only one form which is the ICR. and = step D is in LaTeX the transformation from source to LICR via inputenc For omega one would expect that to be the same except that the OICR = would be something like U+00E --- but it isn't: as it takes a long while for text = to get to this form (if ever!!!!!). you can say it differently as follows: my requirement for a usable internal representation is that I can take = a single element of it at any time and it has a welldefined meaning (and = a single one). now for the LICR this is the case but for Omega it is (right now) not. as a result one ends up to have to explain all those problems of misinterpreting the internal forms if you do this or that at a certain = stage (like storing text in a token register and reusing it at some other = point or never pass it to the hlist builder (where the OTPs actually execute) from your second ascii drawing in that section one would get the = impression (for a moment) that Omega has a welldefined OICR which is U+00E9 but as = we know this is unfortunately not the case --- though it should be!!! (and to be honnest to see the word "fontenc" on the left makes me = shudder though I understand why Javier put it there originally; I think it is a horrible misinterpretation of what fontenc conceptually does) > Surely it's right :-). Remember that =E9 is not an active character = in > lambda and that ocp's are applied after expansion. Let's consider but ocp's should work on OICR and not on undefined byte sequences! like here: > the input =E9\'e=E9. It's expanded to the character sequence "82 "82 = "82, > which is fine. which is not fine, not fine at all because of this: > If we define \'e as "00E9 the expansion is "82 "00 "E9 > "82, which is definitely wrong. Further, converting the input to = Unicode not the latter is wrong but the whole thing is wrong > at the LICR level means that the auxiliary files use the Unicode = encoding; > if the editor is not a Unicode one these files become unmanageable = and > messy. not true. the OICR has to be unicode (or more exactly unique and = well-defined in the above sense, can be 20bits for all i care) if Omega ever should = go off the ground. but the interface to the external world could apply a = well-defined output translation to something else before writing. that could be utf8, but in fact that could be anything as long as it = definable and controllable so that you know what this file ends up with so = inputting it back again would result in turning it right back into the OICR > LICR should preserve, IMO, the current LaTeX conventions, and = =E9\'e=E9 > should be written to these files in exactly that way. not sure what you mean by "current LaTeX conventions": current LaTeX conventions is that external files are always written in a special 7bit representation of the LICR (involving things like \IeC). not wonderful = but conceptually clean. > Or in other words, > any file to be read by LaTeX should follow the "external" LaTeX > conventions and only transcoded in the mouth. ???? > >As I understand the Omega draft documentation, there can be no more = than > >one OTP (the \InputTranslation) acting on the input of LaTeX at any = time > >and that OTP in only meant to handle the basic conversion from the = external > >encoding (ASCII, latin-1, UTF-8, or whatever) to the internal 32-bit > >Unicode. All this happens way before the input gets tokenized, so = there is > > In fact, \InputEncoding was not intended for that, but only for > "technical" translations which applies to the whole document > as one byte -> two byte or little endian -> big endian. The main > problem of it is that it doesn't translate macros: > \def\myE{=C9} > \InputEncoding > =C9\myE \InputEncoding is the point where one need to go from external source = encoding to OICR that is precisely the wound: the current \InputEncoding isn't = doing this job fully (and that it is not clear how to do it properly (to be = fair)) but in my opinion it is absolutely essential that this all gets = detangled and Omega ends up with a proper OICR model. Only then, it could become = usable in a broader sense in my opinion. cheers frank ps: I would really like to thank Oliver a lot for doing this = compilation. The fact that we don't agree with some points in it only means that the = processes are so complicated that we haven't yet understood them properly and so = need to work further on them (and a document like this does help) pps: what might help as well is to identify the parts we do feel be controverse and actually mark them (perhaps with some marginal notes\marginpar{FMi: bla bla}\marginpar{JLo: Fmi talks = rubbish}\marginpar{LHe: they both seem to have no idea what they are talking about} :-) ppps: i'm off to GUTenberg so don't be surprised if flaming replies go unansered by me for a while ------_=_NextPart_001_01C0D983.B9F29580 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Re: Multilingual Encodings Summary 2.2

Javier wrote in reply to Lars:

 > Quick answers to a couple of points. Lars = says:
 >
 > >The comparison in Section 3.2.1 of how = characters are processed in TeX and
 > >Omega respectively also seems strange. = In Omega case (b), column C, we see
 > >that the LICR character \'e is = converted to an 8-bit character "82 before
 > >some OTP converts it to the Unicode = character "00E9 in column D. Surely
 > >this can't be right---whenever LICR is = converted to anything it should be
 > >to full Unicode, since we will = otherwise end up in an encoding morass much
 > >worse than that in current = LaTeX.

in my opinion this whole section is incorrect too or = say at best half correct
(sorry) --- however the problem really is that the = whole area inside omega
isn't at the current point in time anywhere = consistent due to the fact that
the OTPs have been hooked into the wrong places (this = is due to the technical
ease to open up the points where the code was opened = up and this near
impossibility to do it elsewhere without rewriting = the whole of TeX, and due
to the fact that originally the whole method was = intended for far simpler
tasks and not for a grande picture)

conceptually LaTeX has a well-defined ICR (though with = a somewhat clumsy
implementation due to the technical limitations of = TeX) while at the current
point in time Omega hasn't such a beast.

For LaTeX the line c) in your table simply doesn't = exist (it is not supported
code) and  the columns actually do not make much = sense  for LaTeX as they only
reflect the missing concept of an internal encoding = in Omega and you are
looking at the thing from the current omega = implementation.

LaTeX conceptually has only three levels: source, ICR, = Output

and something like the step step C lives along the way = from ICR -> Output but is a
only a technicality which is not of conceptual = importance. All the reasoning
and manipulation of text is done in only one form = which is the ICR. and step D
is in LaTeX the transformation from source to LICR = via inputenc

For omega one would expect that to be the same except = that the OICR would be
something like U+00E --- but it isn't: as it takes a = long while for text to get
to this form  (if ever!!!!!).

you can say it differently as follows:

 my requirement for a usable internal = representation is that I can take a
 single element of it at any time and it has a = welldefined meaning (and a
 single one).

now for the LICR this is the case but for Omega it is = (right now) not.


as a result one ends up to have to explain all those = problems of
misinterpreting the internal forms if you do this or = that at a certain stage
(like storing text in a token register and reusing it = at some other point or
never pass it to the hlist builder (where the OTPs = actually execute)


from your second ascii drawing in that section one = would get the impression
(for a moment) that Omega has a welldefined OICR = which is U+00E9 but as we
know this is unfortunately not the case --- though it = should be!!!
(and to be honnest to see the word = "fontenc" on the left makes me shudder
though I understand why Javier put it there = originally; I think it is a
horrible misinterpretation of what fontenc = conceptually does)

 > Surely it's right :-). Remember that =E9 is = not an active character in
 > lambda and that ocp's are applied after = expansion. Let's consider

but ocp's should work on OICR and not on undefined = byte sequences!
like here:

 > the input =E9\'e=E9. It's expanded to the = character sequence "82 "82 "82,
 > which is fine.

which is not fine, not fine at all
because of this:

 > If we define \'e as "00E9 the = expansion is "82 "00 "E9
 > "82, which is definitely wrong. = Further, converting the input to Unicode

not the latter is wrong but the whole thing is = wrong

 > at the LICR level means that the auxiliary = files use the Unicode encoding;
 > if the editor is not a Unicode one these = files become unmanageable and
 > messy.

not true. the OICR has to be unicode (or more exactly = unique and well-defined
in the above sense, can be 20bits for all i care) if = Omega ever should go off
the ground. but the interface to the external world = could apply a well-defined
output translation to something else before = writing.

that could be utf8, but in fact that could be anything = as long as it definable
and controllable so that you know what this file ends = up with so inputting it
back again would result in turning it right back into = the OICR


 > LICR should preserve, IMO, the current = LaTeX conventions, and =E9\'e=E9
 > should be written to these files in = exactly that way.

not sure what you mean by "current LaTeX = conventions": current LaTeX
conventions is that external files are always written = in a special 7bit
representation of the LICR (involving things like = \IeC). not wonderful but
conceptually clean.


 > Or in other words,
 > any file to be read by LaTeX should follow = the "external" LaTeX
 > conventions and only transcoded in the = mouth.

????

 > >As I understand the Omega draft = documentation, there can be no more than
 > >one OTP (the \InputTranslation) acting = on the input of LaTeX at any time
 > >and that OTP in only meant to handle = the basic conversion from the external
 > >encoding (ASCII, latin-1, UTF-8, or = whatever) to the internal 32-bit
 > >Unicode. All this happens way before = the input gets tokenized, so there is
 >
 > In fact, \InputEncoding was not intended = for that, but only for
 > "technical" translations which = applies to the whole document
 > as one byte -> two byte or little = endian -> big endian. The main
 > problem of it is that it doesn't translate = macros:
 > \def\myE{=C9}
 > \InputEncoding <an encoding>
 > =C9\myE

\InputEncoding is the point where one need to go from = external source encoding
to OICR that is precisely the wound: the current = \InputEncoding isn't doing
this job fully (and that it is not clear how to do it = properly (to be fair))

but in my opinion it is absolutely essential that this = all gets detangled and
Omega ends up with a proper OICR model. Only then, it = could become usable in a
broader sense in my opinion.

cheers
frank

ps: I would really like to thank Oliver a lot for = doing this compilation. The
fact that we don't agree with some points in it only = means that the processes
are so complicated that we haven't yet understood = them properly and so need to
work further on them (and a document like this does = help)
pps: what might help as well is to identify the parts = we do feel be
controverse and actually mark them (perhaps with some = marginal
notes\marginpar{FMi: bla bla}\marginpar{JLo: Fmi = talks rubbish}\marginpar{LHe:
they both seem to have no idea what they are talking = about} :-)
ppps: i'm off to GUTenberg so don't be surprised if = flaming replies go
unansered by me for a while

------_=_NextPart_001_01C0D983.B9F29580--