Received: from webgate.proteosys.de (mail.proteosys-ag.com [62.225.9.49]) by lucy.proteosys (8.11.0/8.9.3/SuSE Linux 8.9.3-0.1) with ESMTP id f1HGPvf02503 for ; Sat, 17 Feb 2001 17:25:57 +0100 Received: by webgate.proteosys.de (8.11.0/8.11.0) with ESMTP id f1HGPvd19854 . for ; Sat, 17 Feb 2001 17:25:57 +0100 Received: from mail.Uni-Mainz.DE (mailserver1.zdv.Uni-Mainz.DE [134.93.8.30]) by mailgate1.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f1HGPqH17881 for ; Sat, 17 Feb 2001 17:25:52 +0100 (MET) MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----_=_NextPart_001_01C098FE.4E59D880" Received: from mailgate1.zdv.Uni-Mainz.DE (mailgate1.zdv.Uni-Mainz.DE [134.93.8.56]) by mail.Uni-Mainz.DE (8.9.3/8.9.3) with ESMTP id RAA19800 for ; Sat, 17 Feb 2001 17:25:51 +0100 (MET) Received: from mail.listserv.gmd.de (mail.listserv.gmd.de [192.88.97.5]) by mailgate1.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f1HGPoH17875 for ; Sat, 17 Feb 2001 17:25:50 +0100 (MET) X-MimeOLE: Produced By Microsoft Exchange V6.5 Received: from mail.listserv.gmd.de (192.88.97.5) by mail.listserv.gmd.de (LSMTP for OpenVMS v1.1a) with SMTP id <12.1DFA6B0B@mail.listserv.gmd.de>; Sat, 17 Feb 2001 17:25:42 +0100 Received: from RELAY.URZ.UNI-HEIDELBERG.DE by RELAY.URZ.UNI-HEIDELBERG.DE (LISTSERV-TCP/IP release 1.8b) with spool id 489536 for LATEX-L@RELAY.URZ.UNI-HEIDELBERG.DE; Sat, 17 Feb 2001 17:25:46 +0100 Received: from ix.urz.uni-heidelberg.de (mail.urz.uni-heidelberg.de [129.206.119.234]) by relay.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id RAA21352 for ; Sat, 17 Feb 2001 17:25:45 +0100 (MET) Received: from relay.uni-heidelberg.de (relay.uni-heidelberg.de [129.206.100.212]) by ix.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id RAA18916 for ; Sat, 17 Feb 2001 17:25:46 +0100 Received: from musse.tninet.se (musse.tninet.se [195.100.94.12]) by relay.uni-heidelberg.de (8.10.2+Sun/8.10.2) with SMTP id f1HGPix08883 for ; Sat, 17 Feb 2001 17:25:44 +0100 (MET) Received: (qmail 2062 invoked from network); 17 Feb 2001 17:25:42 +0100 Received: from delenn.tninet.se (HELO algonet.se) (195.100.94.104) by musse.tninet.se with SMTP; 17 Feb 2001 17:25:42 +0100 Received: from [195.100.226.141] (du141-226.ppp.su-anst.tninet.se [195.100.226.141]) by delenn.tninet.se (BLUETAIL Mail Robustifier 2.2.1) with ESMTP id 191240.427141.982delenn-s0 for ; Sat, 17 Feb 2001 17:25:41 +0100 In-Reply-To: <14990.31415.781462.208072@istrati.zdv.uni-mainz.de> References: <200102122049.f1CKnvi13875@smtp.wanadoo.es> <200102122049.f1CKnvi13875@smtp.wanadoo.es> Return-Path: X-Sender: haberg@pop.matematik.su.se x-mime-autoconverted: from quoted-printable to 8bit by relay.urz.uni-heidelberg.de id RAA21353 Content-class: urn:content-classes:message Subject: Re: LaTeX's internal char prepresentation (UTF8 or Unicode?) Date: Sat, 17 Feb 2001 17:24:45 +0100 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: From: "Hans Aberg" Sender: "Mailing list for the LaTeX3 project" To: "Multiple recipients of list LATEX-L" Reply-To: "Mailing list for the LaTeX3 project" Status: R X-Status: X-Keywords: X-UID: 3954 This is a multi-part message in MIME format. ------_=_NextPart_001_01C098FE.4E59D880 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable At 14:20 +0100 2001/02/17, Frank Mittelbach wrote: There appears to be two variations, one based on the original TeX, and = one with TeX having some kind of extensions. As for the second approach, it seems me that the internal representation should be 32-bit Unicode. As TeX does not seem well equipped handling = the encoding issues, one should then hook up a preprocessor providing the suitable translations. Thus whatever encoding -> preprocessor -> UTeX This easy-to-write preprocessor can combine combining characters to = single Unicode characters, if possible, or otherwise write them on a form that UTeX easily can handle, say by switching from postfix to prefix = notation, or whatever. With further tweaking of the TeX engine it could even = combine TeX combinations such as "--", "---" into single Unicode characters. It is further easy to write such translators for reading/writing to = files. Thus the picture becomes - trans C (eg Uppercasing) | | | | Hardwired translations V | whatever --> decode --> Unicode -------> trans B --> ^^e9 ^ | | | | | files: Choice of say utf8, Unicode > > Omega can represent internally non ascii chars and hence > > actual chars are used instead of macros (with a few exceptions). > > Trivial as it can seem, this difference is in fact a HUGE > > difference. For example, the path followed by =E9 will be: > > > > =E9 --an encoding ocp-| |-- T1 font ocp--> ^^e9 > > +-> U+00E9 -+ > > \'e -fontenc (!)----| |- OT1 font ocp -> \OT1\'{e} Thus, one does not bother representing ASCII or any other such encoding internally anymore. Combinations such as \'e are no longer necessary, except for backwards compatibility. It will probably not make much difference what you do with those combinations, because documents that = for archival purposes need absolute compatibility can use the old TeX -- = newer documents can be translated, if necessary, at need. (That is, this will work, if the incompatibilities are not too big.) > trans A =3D tokenising 8 bit numbers as the corresponding 16bit = numbers > Example: > if =E9 was in the cp437 code page (German DOS) it would > be the 8bit char "82; that would become the 16bit token with > number "0082 (which is NOT =E9 in unicode =3D "00E9) > > if on the other hand =E9 was in latin1 (where it is "E9) we > get "00E9 Such problems would not happen within UTeX, as the preprocessor would provide the proper translation to \U000000E9. Hans Aberg ------_=_NextPart_001_01C098FE.4E59D880 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Re: LaTeX's internal char prepresentation (UTF8 or = Unicode?)

At 14:20 +0100 2001/02/17, Frank Mittelbach = wrote:

There appears to be two variations, one based on the = original TeX, and one
with TeX having some kind of extensions.

As for the second approach, it seems me that the = internal representation
should be 32-bit Unicode. As TeX does not seem well = equipped handling the
encoding issues, one should then hook up a = preprocessor providing the
suitable translations. Thus
    whatever encoding -> = preprocessor -> UTeX
This easy-to-write preprocessor can combine combining = characters to single
Unicode characters, if possible, or otherwise write = them on a form that
UTeX easily can handle, say by switching from postfix = to prefix notation,
or whatever. With further tweaking of the TeX engine = it could even combine
TeX combinations such as "--", = "---" into single Unicode characters.

It is further easy to write such translators for = reading/writing to files.

Thus the picture becomes

          &nbs= p;            = ;  - trans C (eg Uppercasing)
          &nbs= p;            = ;  |      |
          &nbs= p;            = ;  |      | Hardwired translations
          &nbs= p;            = ;  V      |
 whatever --> decode --> Unicode = ------->  trans B  --> ^^e9
          &nbs= p;            = ;  ^      |
          &nbs= p;            = ;  |      |
          &nbs= p;            = ;  |      |
          &nbs= p;            = files: Choice of say utf8, Unicode

> > Omega can represent internally non ascii = chars and hence
> > actual chars are used instead of macros = (with a few exceptions).
> > Trivial as it can seem, this difference is = in fact a HUGE
> > difference. For example, the path followed = by =E9 will be:
> >
> >  =E9 --an encoding = ocp-|           |-- T1 = font ocp-->  ^^e9
> = >           &nb= sp;          +-> U+00E9 = -+
> >  \'e -fontenc = (!)----|           |- = OT1 font ocp -> \OT1\'{e}

Thus, one does not bother representing ASCII or any = other such encoding
internally anymore. Combinations such as \'e are no = longer necessary,
except for backwards compatibility. It will probably = not make much
difference what you do with those combinations, = because documents that for
archival purposes need absolute compatibility can use = the old TeX -- newer
documents can be translated, if necessary, at need. = (That is, this will
work, if the incompatibilities are not too = big.)

> trans A =3D tokenising 8 bit numbers as the = corresponding 16bit numbers
>          = Example:
>          = if =E9 was in the cp437 code page (German DOS) it would
>          = be the 8bit char "82; that would become the 16bit token = with
>          = number "0082 (which is NOT =E9 in unicode =3D "00E9)
>
>          = if on the other hand  =E9 was in latin1 (where it is "E9) = we
>          = get "00E9

Such problems would not happen within UTeX, as the = preprocessor would
provide the proper translation to \U000000E9.

  Hans Aberg

------_=_NextPart_001_01C098FE.4E59D880--