Received: from webgate.proteosys.de (mail.proteosys-ag.com [62.225.9.49]) by lucy.proteosys (8.11.0/8.9.3/SuSE Linux 8.9.3-0.1) with ESMTP id f1CAQ4915214 for ; Mon, 12 Feb 2001 11:26:04 +0100 Received: by webgate.proteosys.de (8.11.0/8.11.0) with ESMTP id f1CAQ3d28712 . for ; Mon, 12 Feb 2001 11:26:03 +0100 MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----_=_NextPart_001_01C094DE.33DAC600" Received: from mail.Uni-Mainz.DE (mailserver1.zdv.Uni-Mainz.DE [134.93.8.30]) by mailgate1.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f1CAPwM19860 for ; Mon, 12 Feb 2001 11:25:58 +0100 (MET) Received: from mailgate2.zdv.Uni-Mainz.DE (mailgate2.zdv.Uni-Mainz.DE [134.93.8.57]) by mail.Uni-Mainz.DE (8.9.3/8.9.3) with ESMTP id LAA10358 for ; Mon, 12 Feb 2001 11:25:58 +0100 (MET) X-MimeOLE: Produced By Microsoft Exchange V6.5 Received: from mail.listserv.gmd.de (mail.listserv.gmd.de [192.88.97.5]) by mailgate2.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f1CAPv701525 for ; Mon, 12 Feb 2001 11:25:57 +0100 (MET) Received: from mail.listserv.gmd.de (192.88.97.5) by mail.listserv.gmd.de (LSMTP for OpenVMS v1.1a) with SMTP id <13.03EF54A9@mail.listserv.gmd.de>; Mon, 12 Feb 2001 11:25:49 +0100 Received: from RELAY.URZ.UNI-HEIDELBERG.DE by RELAY.URZ.UNI-HEIDELBERG.DE (LISTSERV-TCP/IP release 1.8b) with spool id 487950 for LATEX-L@RELAY.URZ.UNI-HEIDELBERG.DE; Mon, 12 Feb 2001 11:25:53 +0100 Received: from ix.urz.uni-heidelberg.de (mail.urz.uni-heidelberg.de [129.206.119.234]) by relay.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id LAA03733 for ; Mon, 12 Feb 2001 11:25:52 +0100 (MET) Received: from relay.uni-heidelberg.de (relay.uni-heidelberg.de [129.206.100.212]) by ix.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id LAA53706 for ; Mon, 12 Feb 2001 11:25:52 +0100 Received: from nag.co.uk (openmath.nag.co.uk [62.232.54.144]) by relay.uni-heidelberg.de (8.10.2+Sun/8.10.2) with ESMTP id f1CAPqu25497 for ; Mon, 12 Feb 2001 11:25:52 +0100 (MET) Received: (from davidc@localhost) by nag.co.uk (AIX4.2/UCB 8.7/8.7) id KAA15302; Mon, 12 Feb 2001 10:25:21 GMT In-Reply-To: <14982.59968.683159.480912@istrati.zdv.uni-mainz.de> (message from Frank Mittelbach on Sun, 11 Feb 2001 20:38:40 +0100) References: <14982.45082.150652.74719@istrati.zdv.uni-mainz.de> <14982.59968.683159.480912@istrati.zdv.uni-mainz.de> Return-Path: Content-class: urn:content-classes:message Subject: Re: LaTeX's internal char prepresentation (UTF8 or Unicode?) Date: Mon, 12 Feb 2001 11:25:21 +0100 Message-ID: <200102121025.KAA15302@nag.co.uk> X-MS-Has-Attach: X-MS-TNEF-Correlator: From: "David Carlisle" Sender: "Mailing list for the LaTeX3 project" To: "Multiple recipients of list LATEX-L" Reply-To: "Mailing list for the LaTeX3 project" Status: R X-Status: X-Keywords: X-UID: 3841 This is a multi-part message in MIME format. ------_=_NextPart_001_01C094DE.33DAC600 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable > > What about symbol fonts like TC? What about math characters that = are > > unified in Unicode (\rightarrow and \longrightarrow)? What about = the > > things that are not yet in Unicode? > yes, what about them? It may be worth noting that unicode 3.1 and 3.2 will (assuming the current plans go through) have a lot more (~1500 more, If I recall correctly) math characters than unicode 3.0. Actually one of the main things missing (currently) are long arrows. We (MathML working group) are in touch with the Unicode folks to see if there's any chance of those being added as well, although time is getting short for further additions to 3.1 and 3.2) There are always the private use areas of course for extra characters that a TeX/Unicode system could use. (Although private use characters for a publicly distributed system is considered bad form, sometimes it can't be avoided) Having built a TeX (rather than omega) based system (xmltex) that does use utf8 as the internal form of all characters I'd agree with the comments made earlier that one of the hardest problems are unicode combining characters. (xmltex doesn't deal with them at all by default). xmltex of course makes almost no use of TeX's inbuilt csname parsing for document files, as it reads xml syntax. It can run with all characters being active (which would be needed to handle combining characters) but normally ascii characters are non active which is a big time saving for languages that are mainly latin alphabet. In xmltex one is more or less forced to use utf8 as you are accepting XML character streams with no explicit markup, however for a system using TeX style markup it isn't at all clear that the benefits would outweigh the costs. Changing to a utf8 internal form would make latex slower (a lot slower if it handled combining characters) and for the majority of existing users it would have no advantage (so they would use the old system, and not update). For specific uses (in particular typesetting sources derived from XML) it is possible to layer utf8 support over the current base. There the costs are worth it (as there is really no alternative to UTF8 support, either you code the utf8 support in TeX as in xmltex (or similar code in cjk package and I think there's a utf8.sty on ctan) or you have an external program do the translation. If the TeX engine changes to Omega (or an Omega like system that is similarly based on unicode) then the rules change completely, and unicode as the internal form becomes a lot more attractive prospect. However I'm not sure that we are quite ready to switch all TeX distributions to Omega as the default engine for LaTeX are we? David ------_=_NextPart_001_01C094DE.33DAC600 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Re: LaTeX's internal char prepresentation (UTF8 or = Unicode?)

>  > What about symbol fonts like TC? What = about math characters that are
>  > unified in Unicode (\rightarrow and = \longrightarrow)? What about the
>  > things that are not yet in = Unicode?

> yes, what about them?

It may be worth noting that unicode 3.1 and 3.2 will = (assuming the
current plans go through) have a lot more (~1500 = more, If I recall
correctly) math characters than unicode 3.0. Actually = one of the main
things missing (currently) are long arrows. We = (MathML working group)
are in touch with the Unicode folks to see if there's = any chance of
those being added as well, although time is getting = short for further
additions to 3.1 and 3.2)

There are always the private use areas of course for = extra characters
that a TeX/Unicode system could use. (Although = private use characters
for a publicly distributed system is considered bad = form, sometimes it
can't be avoided)

Having built a TeX (rather than omega) based system = (xmltex) that does
use utf8 as the internal form of all characters I'd = agree with the
comments made earlier that one of the hardest = problems are unicode
combining characters. (xmltex doesn't deal with them = at all by default).
xmltex of course makes almost no use of TeX's inbuilt = csname parsing
for document files, as it reads xml syntax. It can = run with all
characters being active (which would be needed to = handle combining
characters) but normally ascii characters are non = active which is a big
time saving for languages that are mainly latin = alphabet.

In xmltex one is more or less forced to use utf8 as = you are accepting
XML character streams with no explicit markup, = however for a system
using TeX style markup it isn't at all clear that the = benefits would
outweigh the costs. Changing to a utf8 internal form = would make latex
slower (a lot slower if it handled combining = characters) and for the
majority of existing users it would have no advantage = (so they would use
the old system, and not update).

For specific uses (in particular typesetting sources = derived from XML)
it is possible to layer utf8 support over the current = base. There the
costs are worth it (as there is really no alternative = to UTF8 support,
either you code the utf8 support in TeX as in xmltex = (or similar code in
cjk package and I think there's a utf8.sty on ctan) = or you have an
external program do the translation.

If the TeX engine changes to Omega (or an Omega like = system  that is
similarly based on unicode) then the rules change = completely, and
unicode as the internal form becomes a lot more = attractive prospect.
However I'm not sure that we are quite ready to = switch all TeX
distributions to Omega as the default engine for = LaTeX are we?


David

------_=_NextPart_001_01C094DE.33DAC600--