Received: from webgate.proteosys.de (mail.proteosys-ag.com [62.225.9.49]) by lucy.proteosys (8.11.0/8.9.3/SuSE Linux 8.9.3-0.1) with ESMTP id f1BGkHH11635 for ; Sun, 11 Feb 2001 17:46:17 +0100 Received: by webgate.proteosys.de (8.11.0/8.11.0) with ESMTP id f1BGkGd25611 . for ; Sun, 11 Feb 2001 17:46:16 +0100 Received: from mail.Uni-Mainz.DE (mailserver1.zdv.Uni-Mainz.DE [134.93.8.30]) by mailgate1.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f1BGkBM03770 for ; Sun, 11 Feb 2001 17:46:11 +0100 (MET) MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----_=_NextPart_001_01C0944A.270C9280" Received: from mailgate1.zdv.Uni-Mainz.DE (mailgate1.zdv.Uni-Mainz.DE [134.93.8.56]) by mail.Uni-Mainz.DE (8.9.3/8.9.3) with ESMTP id RAA20614 for ; Sun, 11 Feb 2001 17:46:11 +0100 (MET) Received: from mail.listserv.gmd.de (mail.listserv.gmd.de [192.88.97.5]) by mailgate1.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f1BGk7M03766 for ; Sun, 11 Feb 2001 17:46:08 +0100 (MET) X-MimeOLE: Produced By Microsoft Exchange V6.5 Received: from mail.listserv.gmd.de (192.88.97.5) by mail.listserv.gmd.de (LSMTP for OpenVMS v1.1a) with SMTP id <1.F63B9EB3@mail.listserv.gmd.de>; Sun, 11 Feb 2001 17:46:01 +0100 Received: from RELAY.URZ.UNI-HEIDELBERG.DE by RELAY.URZ.UNI-HEIDELBERG.DE (LISTSERV-TCP/IP release 1.8b) with spool id 487634 for LATEX-L@RELAY.URZ.UNI-HEIDELBERG.DE; Sun, 11 Feb 2001 17:46:04 +0100 Received: from ix.urz.uni-heidelberg.de (mail.urz.uni-heidelberg.de [129.206.119.234]) by relay.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id RAA21818 for ; Sun, 11 Feb 2001 17:46:03 +0100 (MET) Received: from relay.uni-heidelberg.de (relay.uni-heidelberg.de [129.206.100.212]) by ix.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id RAA25168 for ; Sun, 11 Feb 2001 17:46:03 +0100 Received: from naf1.mathematik.uni-tuebingen.de (naf1.mathematik.uni-tuebingen.de [134.2.161.197]) by relay.uni-heidelberg.de (8.10.2+Sun/8.10.2) with ESMTP id f1BGk3u12115 for ; Sun, 11 Feb 2001 17:46:03 +0100 (MET) Received: from na13.mathematik.uni-tuebingen.de (na13 [134.2.161.180]) by naf1.mathematik.uni-tuebingen.de (8.9.3+Sun/8.9.3) with ESMTP id RAA14640; Sun, 11 Feb 2001 17:46:00 +0100 (MET) Received: (from oliver@localhost) by na13.mathematik.uni-tuebingen.de (8.9.3+Sun/8.9.1) id RAA26356; Sun, 11 Feb 2001 17:45:44 +0100 (MET) In-Reply-To: <14982.45082.150652.74719@istrati.zdv.uni-mainz.de> References: <200102091445.JAA00482@plmsc.psu.edu> <200102091643.RAA23818@mozart.ujf-grenoble.Fr> <14980.23750.628032.305093@gargle.gargle.HOWL> <14982.45082.150652.74719@istrati.zdv.uni-mainz.de> Return-Path: X-Mailer: VM 6.88 under Emacs 20.7.2 X-Authentication-Warning: na13.mathematik.uni-tuebingen.de: oliver set sender to oliver@na13 using -f Content-class: urn:content-classes:message Subject: LaTeX's internal char representation (UTF8 or Unicode?) Date: Sun, 11 Feb 2001 17:45:44 +0100 Message-ID: <14982.49592.211375.235529@gargle.gargle.HOWL> X-MS-Has-Attach: X-MS-TNEF-Correlator: From: "Marcel Oliver" Sender: "Mailing list for the LaTeX3 project" To: "Multiple recipients of list LATEX-L" Reply-To: "Mailing list for the LaTeX3 project" Status: R X-Status: X-Keywords: X-UID: 3800 This is a multi-part message in MIME format. ------_=_NextPart_001_01C0944A.270C9280 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Frank Mittelbach writes: > in 1992/3 when we worked on shaping the ideas of the LaTeX internal > representation we actually did discuss similar ideas but back then > abandoned them because of resource constraints (in the > software). Machines are nowadays bigger and faster so this isn't > really much of an argument there. > > So... time for another attempt? Yes, yes, yes! > The LaTeX internal character representation is a 7bit > representation not an 8bit one as UTF8. As such it is far less > likely to be mangled by incorrect conversion if files are exchanged > between different platforms. I have yet to see that UTF8 text > (without taking precaution and externally announcing that a file is > in UTF8) is really properly handled by any OS platform. Is it? Not at the moment. But there is a strong movement pushing for UTF8 as _the_ encoding standard. Support in bleeding edge versions of a lot of software is actually quite good. As far as I can see, UTF8 is the only standard that has a reasonable chance of becoming the one that "works without taking precaution". It would be a pity if LaTeX missed the boat. For some info, see http://www.cl.cam.ac.uk/~mgk25/unicode.html > TeX is 7bit with a parser that accepts 8bit but doesn't by default > gives it any meaning. On the other hand Omega is 16bit (or more > these days?) and could be viewed as internally using something like > Unicode for representation. This is good because UTF8 is a proper superset of what TeX is currently taking as input. Does anybody know about the state of Omega? 16-bit Unicode is not the whole game, and also not particularly attractive as an input encoding. > wouldn't it be better if the internal LaTeX representation would > be Unicode in one or the other flavor? Yes, because: - A LaTeX specific naming scheme will be essentially unmaintainable. - LaTeX could eventually be made to output diagnostics and log files in UTF8. For example, UTF8-enabled Xterms exist now, and will likely come as default on Linux distributions long before LaTeX3. Same for text editors. So the infrastructure is getting to a state where it's possible to pull this off. > - however, not clear is that the resulting names are easier to > read, eg \unicode{00e4} viz \"a. See remark about Xterms. > - the current latex internal representation is richer than unicode > for good or worse, eg \" is defined individually as > representation for accenting the next char, which means that > anything \" is automatically > also a member of it, eg \"g. This is not necessarily a problem (cf. Roozbeh's remark about math symbols which are not properly defined in unicode). As long as there are no hyphenation and nongeneric kerning issues involved (and those seem only an issue for natural language scripts), one could still have named symbols as they exists now, whether they part of some special font, a combination of glyphs, the drawing of some box, etc. > - the latter point could be considered bad since it allows to > produce characters not in unicode but independently of what you > feel about that the fact itself as consequences when defining > mappings for font encodings. right now, one specifies the > accents, ie \DeclareTextAccent\" and for those glyphs that > exists as composites one also specifies the composite, eg > \DeclareTextComposite{\"}{T1}{a}{...} With the unicode approach > as the internal representation there would be an atomic form ie > \unicode{00e4} describing umlaut-a so if that has no > representation in a font, eg in OT1, then one would need to > define for each combination the result. This seems to be the logical thing to do?! --Marcel ------_=_NextPart_001_01C0944A.270C9280 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable LaTeX's internal char representation (UTF8 or = Unicode?)

Frank Mittelbach writes:
 > in 1992/3 when we worked on shaping the = ideas of the LaTeX internal
 > representation we actually did discuss = similar ideas but back then
 > abandoned them because of resource = constraints (in the
 > software). Machines are nowadays bigger = and faster so this isn't
 > really much of an argument there.
 >
 > So... time for another attempt?

Yes, yes, yes!

 > The LaTeX internal character representation = is a 7bit
 > representation not an 8bit one as UTF8. As = such it is far less
 > likely to be mangled by incorrect = conversion if files are exchanged
 > between different platforms. I have yet to = see that UTF8 text
 > (without taking precaution and externally = announcing that a file is
 > in UTF8) is really properly handled by any = OS platform. Is it?

Not at the moment.  But there is a strong = movement pushing for UTF8 as
_the_ encoding standard.  Support in bleeding = edge versions of a lot
of software is actually quite good.  As far as I = can see, UTF8 is the
only standard that has a reasonable chance of = becoming the one that
"works without taking precaution".

It would be a pity if LaTeX missed the boat.  For = some info, see

  http://www.cl.cam.ac= .uk/~mgk25/unicode.html

 > TeX is 7bit with a parser that accepts 8bit = but doesn't by default
 > gives it any meaning. On the other hand = Omega is 16bit (or more
 > these days?) and could be viewed as = internally using something like
 > Unicode for representation.

This is good because UTF8 is a proper superset of what = TeX is
currently taking as input.  Does anybody know = about the state of
Omega?  16-bit Unicode is not the whole game, = and also not particularly
attractive as an input encoding.

 >  wouldn't it be better if the internal = LaTeX representation would
 >  be Unicode in one or the other = flavor?

Yes, because:

- A LaTeX specific naming scheme will be essentially = unmaintainable.

- LaTeX could eventually be made to output diagnostics = and log files
  in UTF8.  For example, UTF8-enabled = Xterms exist now, and will
  likely come as default on Linux distributions = long before LaTeX3.
  Same for text editors.  So the = infrastructure is getting to a state
  where it's possible to pull this off.

 >  - however, not clear is that the = resulting names are easier to
 >    read, eg \unicode{00e4} = viz \"a.

See remark about Xterms.

 >  - the current latex internal = representation is richer than unicode
 >    for good or worse, eg = \" is defined individually as
 >    representation for = accenting the next char, which means that
 >    anything = \"<base-char-in-the-internal-reps> is automatically
 >    also a member of it, eg = \"g.

This is not necessarily a problem (cf. Roozbeh's = remark about math
symbols which are not properly defined in = unicode).  As long as there
are no hyphenation and nongeneric kerning issues = involved (and those
seem only an issue for natural language scripts), one = could still have
named symbols as they exists now, whether they part = of some special
font, a combination of glyphs, the drawing of some = box, etc.

 >  - the latter point could be = considered bad since it allows to
 >    produce characters not = in unicode but independently of what you
 >    feel about that the fact = itself as consequences when defining
 >    mappings for font = encodings. right now, one specifies the
 >    accents, ie = \DeclareTextAccent\" and for those glyphs that
 >    exists as composites one = also specifies the composite, eg
 >    = \DeclareTextComposite{\"}{T1}{a}{...}  With the unicode = approach
 >    as the internal = representation there would be an atomic form ie
 >    \unicode{00e4} = describing umlaut-a so if that has no
 >    representation in a = font, eg in OT1, then one would need to
 >    define for each = combination the result.

This seems to be the logical thing to do?!

--Marcel

------_=_NextPart_001_01C0944A.270C9280--