Received: from webgate.proteosys.de (mail.proteosys-ag.com [62.225.9.49]) by lucy.proteosys (8.11.0/8.9.3/SuSE Linux 8.9.3-0.1) with ESMTP id f4AIxFf06104 for ; Thu, 10 May 2001 20:59:15 +0200 Received: by webgate.proteosys.de (8.11.0/8.11.0) with ESMTP id f4AIxF701912 . for ; Thu, 10 May 2001 20:59:15 +0200 MIME-Version: 1.0 Received: from mail.Uni-Mainz.DE (mailserver1.zdv.Uni-Mainz.DE [134.93.8.30]) by mailgate2.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f4AIxE012202 for ; Thu, 10 May 2001 20:59:14 +0200 (MET DST) Content-Type: multipart/alternative; boundary="----_=_NextPart_001_01C0D983.4EA8C380" Received: from mailgate1.zdv.Uni-Mainz.DE (mailgate1.zdv.Uni-Mainz.DE [134.93.8.56]) by mail.Uni-Mainz.DE (8.9.3/8.9.3) with ESMTP id UAA05219 for ; Thu, 10 May 2001 20:59:14 +0200 (MEST) X-MimeOLE: Produced By Microsoft Exchange V6.5 Received: from mail.listserv.gmd.de (mail.listserv.gmd.de [192.88.97.5]) by mailgate1.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f4AIxEU11734 for ; Thu, 10 May 2001 20:59:14 +0200 (MET DST) Received: from mail.listserv.gmd.de (192.88.97.5) by mail.listserv.gmd.de (LSMTP for OpenVMS v1.1a) with SMTP id <10.522F5F22@mail.listserv.gmd.de>; Thu, 10 May 2001 20:57:43 +0200 Received: from RELAY.URZ.UNI-HEIDELBERG.DE by RELAY.URZ.UNI-HEIDELBERG.DE (LISTSERV-TCP/IP release 1.8b) with spool id 496170 for LATEX-L@RELAY.URZ.UNI-HEIDELBERG.DE; Thu, 10 May 2001 20:59:11 +0200 Received: from ix.urz.uni-heidelberg.de (mail.urz.uni-heidelberg.de [129.206.119.234]) by relay.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id UAA08589 for ; Thu, 10 May 2001 20:59:10 +0200 (MET DST) Received: from relay.uni-heidelberg.de (relay.uni-heidelberg.de [129.206.100.212]) by ix.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id UAA206244 for ; Thu, 10 May 2001 20:59:10 +0200 Received: from mail.umu.se (custer.umdac.umu.se [130.239.8.14]) by relay.uni-heidelberg.de (8.10.2+Sun/8.10.2) with ESMTP id f4AIxAQ15304 for ; Thu, 10 May 2001 20:59:10 +0200 (MET DST) Received: from [130.239.137.13] (mariehemsv093.sn.umu.se [130.239.137.13]) by mail.umu.se (8.8.8/8.8.8) with ESMTP id UAA06742 for ; Thu, 10 May 2001 20:59:09 +0200 (MET DST) In-Reply-To: Return-Path: X-Sender: lars@abel.math.umu.se x-mime-autoconverted: from quoted-printable to 8bit by relay.urz.uni-heidelberg.de id UAA08590 Content-class: urn:content-classes:message Subject: Re: Multilingual Encodings Summary 2.2 Date: Thu, 10 May 2001 19:59:08 +0100 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: From: =?iso-8859-1?Q?Lars_Hellstr=F6m?= Sender: "Mailing list for the LaTeX3 project" To: "Multiple recipients of list LATEX-L" Reply-To: "Mailing list for the LaTeX3 project" Status: R X-Status: X-Keywords: X-UID: 4042 This is a multi-part message in MIME format. ------_=_NextPart_001_01C0D983.4EA8C380 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable At 19.00 +0200 01-05-10, jbezos wrote: >Quick answers to a couple of points. Lars says: > >>The comparison in Section 3.2.1 of how characters are processed in TeX = and >>Omega respectively also seems strange. In Omega case (b), column C, we = see >>that the LICR character \'e is converted to an 8-bit character "82 = before >>some OTP converts it to the Unicode character "00E9 in column D. = Surely >>this can't be right---whenever LICR is converted to anything it should = be >>to full Unicode, since we will otherwise end up in an encoding morass = much >>worse than that in current LaTeX. > >Surely it's right :-). Remember that =C8 is not an active character in >lambda and that ocp's are applied after expansion. Let's consider >the input =C8\'e=C8. It's expanded to the character sequence "82 "82 = "82, >which is fine. If we define \'e as "00E9 the expansion is "82 "00 "E9 >"82, which is definitely wrong. Aren't you completely confusing the text file format with the internal format (OICR or whatever) here? If \'e is defined anything like it is = today (via \DeclareTextComposite) it should expand to a category 11 token with character code "00E9, i.e., the token ^^^^00e9. I see no indication in = the docs that Omega would convert such a token to two tokens. >Further, converting the input to Unicode >at the LICR level means that the auxiliary files use the Unicode = encoding; No it wouldn't. If \protect is not \@typeset@protect when \'e is = expanded then it will be written to a file as \'e. >if the editor is not a Unicode one these files become unmanageable and = messy. >LICR should preserve, IMO, the current LaTeX conventions, and =C8\'e=C8 >should be written to these files in exactly that way. Well, if you're not naughtily assuming that input encoding equals font encoding, and hence use inputenc to interpret the non-ASCII-characters (which I assume were =E9s when you sent them), then the above text is _currently_ written to the .aux file as \'e\'e\'e, not as =E9\'e=E9. >Or in other words, >any file to be read by LaTeX should follow the "external" LaTeX >conventions and only transcoded in the mouth. > >>As I understand the Omega draft documentation, there can be no more = than >>one OTP (the \InputTranslation) acting on the input of LaTeX at any = time >>and that OTP in only meant to handle the basic conversion from the = external >>encoding (ASCII, latin-1, UTF-8, or whatever) to the internal 32-bit >>Unicode. All this happens way before the input gets tokenized, so = there is > >In fact, \InputEncoding was not intended for that, but only for >"technical" translations which applies to the whole document >as one byte -> two byte or little endian -> big endian. Assuming that \InputEncoding is some alias for the \InputTranslation primitive, that's roughly what I meant; maybe translation from latin-1 = was a bit off the target. OTOH you seem to assume below that \InputEncoding should also handle translations which are just as untechnical!!? >The main >problem of it is that it doesn't translate macros: >\def\myE{ } >\InputEncoding > \myE > >only the explicit is transcoded. Isn't that a bit like saying "the main problem is that changing the \catcode of @ doesn't change the categories of @ tokens in macros"? > However, that can be desirable >under some circumstances, but you know in advance which encodings >will be used. That this is known sounds like a very dangerous assumption to me. >More dangerous is the following: > >\comenzar{enumeraci=DBn} % Spanish interface with, say, MacRoman > % \comenzar means \begin >\InputEncoding > >\terminar{enumeraci=DBn} % <- that's transcoded using iso hebrew! But such characters (the Spanish as well as the Hebrew) aren't allowed = in names in LaTeX! Lars Hellstr=F6m ------_=_NextPart_001_01C0D983.4EA8C380 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Re: Multilingual Encodings Summary 2.2

At 19.00 +0200 01-05-10, jbezos wrote:
>Quick answers to a couple of points. Lars = says:
>
>>The comparison in Section 3.2.1 of how = characters are processed in TeX and
>>Omega respectively also seems strange. In = Omega case (b), column C, we see
>>that the LICR character \'e is converted to = an 8-bit character "82 before
>>some OTP converts it to the Unicode character = "00E9 in column D. Surely
>>this can't be right---whenever LICR is = converted to anything it should be
>>to full Unicode, since we will otherwise end = up in an encoding morass much
>>worse than that in current LaTeX.
>
>Surely it's right :-). Remember that =C8 is not = an active character in
>lambda and that ocp's are applied after = expansion. Let's consider
>the input =C8\'e=C8. It's expanded to the = character sequence "82 "82 "82,
>which is fine. If we define \'e as "00E9 the = expansion is "82 "00 "E9
>"82, which is definitely wrong.

Aren't you completely confusing the text file format = with the internal
format (OICR or whatever) here? If \'e is defined = anything like it is today
(via \DeclareTextComposite) it should expand to a = category 11 token with
character code "00E9, i.e., the token ^^^^00e9. = I see no indication in the
docs that Omega would convert such a token to two = tokens.

>Further, converting the input to Unicode
>at the LICR level means that the auxiliary files = use the Unicode encoding;

No it wouldn't. If \protect is not \@typeset@protect = when \'e is expanded
then it will be written to a file as \'e.

>if the editor is not a Unicode one these files = become unmanageable and messy.
>LICR should preserve, IMO, the current LaTeX = conventions, and =C8\'e=C8
>should be written to these files in exactly that = way.

Well, if you're not naughtily assuming that input = encoding equals font
encoding, and hence use inputenc to interpret the = non-ASCII-characters
(which I assume were =E9s when you sent them), then = the above text is
_currently_ written to the .aux file as \'e\'e\'e, = not as =E9\'e=E9.

>Or in other words,
>any file to be read by LaTeX should follow the = "external" LaTeX
>conventions and only transcoded in the = mouth.
>
>>As I understand the Omega draft = documentation, there can be no more than
>>one OTP (the \InputTranslation) acting on the = input of LaTeX at any time
>>and that OTP in only meant to handle the = basic conversion from the external
>>encoding (ASCII, latin-1, UTF-8, or whatever) = to the internal 32-bit
>>Unicode. All this happens way before the = input gets tokenized, so there is
>
>In fact, \InputEncoding was not intended for = that, but only for
>"technical" translations which applies = to the whole document
>as one byte -> two byte or little endian -> = big endian.

Assuming that \InputEncoding is some alias for the = \InputTranslation
primitive, that's roughly what I meant; maybe = translation from latin-1 was
a bit off the target. OTOH you seem to assume below = that \InputEncoding
should also handle translations which are just as = untechnical!!?

>The main
>problem of it is that it doesn't translate = macros:
>\def\myE{ }
>\InputEncoding <an encoding>
> \myE
>
>only the explicit   is = transcoded.

Isn't that a bit like saying "the main problem is = that changing the
\catcode of @ doesn't change the categories of @ = tokens in macros"?

> However, that can be desirable
>under some circumstances, but you know in advance = which encodings
>will be used.
That this is known sounds like a very dangerous = assumption to me.
>More dangerous is the following:
>
>\comenzar{enumeraci=DBn} % Spanish interface = with, say, MacRoman
>          =              = % \comenzar means \begin
>\InputEncoding <iso hebrew>
>
>\terminar{enumeraci=DBn} % <- that's = transcoded using iso hebrew!

But such characters (the Spanish as well as the = Hebrew) aren't allowed in
names in LaTeX!

Lars Hellstr=F6m

------_=_NextPart_001_01C0D983.4EA8C380--