MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----_=_NextPart_001_01C094DE.33DAC600"
In-Reply-To:  <14982.59968.683159.480912@istrati.zdv.uni-mainz.de> (message              from Frank Mittelbach on Sun, 11 Feb 2001 20:38:40 +0100)
References: <14982.45082.150652.74719@istrati.zdv.uni-mainz.de>            <Pine.LNX.4.10.10102111920000.11902-100000@Sina.sharif.ac.ir>            <14982.59968.683159.480912@istrati.zdv.uni-mainz.de>
Content-class: urn:content-classes:message
Subject:      Re: LaTeX's internal char prepresentation (UTF8 or Unicode?)
Date: Mon, 12 Feb 2001 11:25:21 +0100
Message-ID:  <200102121025.KAA15302@nag.co.uk>
From: "David Carlisle" <davidc@NAG.CO.UK>
Sender: "Mailing list for the LaTeX3 project" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
To: "Multiple recipients of list LATEX-L" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
Reply-To: "Mailing list for the LaTeX3 project" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
Status: R

This is a multi-part message in MIME format.

------_=_NextPart_001_01C094DE.33DAC600
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

>  > What about symbol fonts like TC? What about math characters that =
are
>  > unified in Unicode (\rightarrow and \longrightarrow)? What about =
the
>  > things that are not yet in Unicode?

> yes, what about them?

It may be worth noting that unicode 3.1 and 3.2 will (assuming the
current plans go through) have a lot more (~1500 more, If I recall
correctly) math characters than unicode 3.0. Actually one of the main
things missing (currently) are long arrows. We (MathML working group)
are in touch with the Unicode folks to see if there's any chance of
those being added as well, although time is getting short for further
additions to 3.1 and 3.2)

There are always the private use areas of course for extra characters
that a TeX/Unicode system could use. (Although private use characters
for a publicly distributed system is considered bad form, sometimes it
can't be avoided)

Having built a TeX (rather than omega) based system (xmltex) that does
use utf8 as the internal form of all characters I'd agree with the
comments made earlier that one of the hardest problems are unicode
combining characters. (xmltex doesn't deal with them at all by default).
xmltex of course makes almost no use of TeX's inbuilt csname parsing
for document files, as it reads xml syntax. It can run with all
characters being active (which would be needed to handle combining
characters) but normally ascii characters are non active which is a big
time saving for languages that are mainly latin alphabet.

In xmltex one is more or less forced to use utf8 as you are accepting
XML character streams with no explicit markup, however for a system
using TeX style markup it isn't at all clear that the benefits would
outweigh the costs. Changing to a utf8 internal form would make latex
slower (a lot slower if it handled combining characters) and for the
majority of existing users it would have no advantage (so they would use
the old system, and not update).

For specific uses (in particular typesetting sources derived from XML)
it is possible to layer utf8 support over the current base. There the
costs are worth it (as there is really no alternative to UTF8 support,
either you code the utf8 support in TeX as in xmltex (or similar code in
cjk package and I think there's a utf8.sty on ctan) or you have an
external program do the translation.

If the TeX engine changes to Omega (or an Omega like system  that is
similarly based on unicode) then the rules change completely, and
unicode as the internal form becomes a lot more attractive prospect.
However I'm not sure that we are quite ready to switch all TeX
distributions to Omega as the default engine for LaTeX are we?


David

------_=_NextPart_001_01C094DE.33DAC600
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Diso-8859-1">
<META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version =
6.5.7654.12">
<TITLE>     Re: LaTeX's internal char prepresentation (UTF8 or =
Unicode?)</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->

<P><FONT SIZE=3D2>&gt;&nbsp; &gt; What about symbol fonts like TC? What =
about math characters that are</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp; &gt; unified in Unicode (\rightarrow and =
\longrightarrow)? What about the</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp; &gt; things that are not yet in =
Unicode?</FONT>
</P>

<P><FONT SIZE=3D2>&gt; yes, what about them?</FONT>
</P>

<P><FONT SIZE=3D2>It may be worth noting that unicode 3.1 and 3.2 will =
(assuming the</FONT>

<BR><FONT SIZE=3D2>current plans go through) have a lot more (~1500 =
more, If I recall</FONT>

<BR><FONT SIZE=3D2>correctly) math characters than unicode 3.0. Actually =
one of the main</FONT>

<BR><FONT SIZE=3D2>things missing (currently) are long arrows. We =
(MathML working group)</FONT>

<BR><FONT SIZE=3D2>are in touch with the Unicode folks to see if there's =
any chance of</FONT>

<BR><FONT SIZE=3D2>those being added as well, although time is getting =
short for further</FONT>

<BR><FONT SIZE=3D2>additions to 3.1 and 3.2)</FONT>
</P>

<P><FONT SIZE=3D2>There are always the private use areas of course for =
extra characters</FONT>

<BR><FONT SIZE=3D2>that a TeX/Unicode system could use. (Although =
private use characters</FONT>

<BR><FONT SIZE=3D2>for a publicly distributed system is considered bad =
form, sometimes it</FONT>

<BR><FONT SIZE=3D2>can't be avoided)</FONT>
</P>

<P><FONT SIZE=3D2>Having built a TeX (rather than omega) based system =
(xmltex) that does</FONT>

<BR><FONT SIZE=3D2>use utf8 as the internal form of all characters I'd =
agree with the</FONT>

<BR><FONT SIZE=3D2>comments made earlier that one of the hardest =
problems are unicode</FONT>

<BR><FONT SIZE=3D2>combining characters. (xmltex doesn't deal with them =
at all by default).</FONT>

<BR><FONT SIZE=3D2>xmltex of course makes almost no use of TeX's inbuilt =
csname parsing</FONT>

<BR><FONT SIZE=3D2>for document files, as it reads xml syntax. It can =
run with all</FONT>

<BR><FONT SIZE=3D2>characters being active (which would be needed to =
handle combining</FONT>

<BR><FONT SIZE=3D2>characters) but normally ascii characters are non =
active which is a big</FONT>

<BR><FONT SIZE=3D2>time saving for languages that are mainly latin =
alphabet.</FONT>
</P>

<P><FONT SIZE=3D2>In xmltex one is more or less forced to use utf8 as =
you are accepting</FONT>

<BR><FONT SIZE=3D2>XML character streams with no explicit markup, =
however for a system</FONT>

<BR><FONT SIZE=3D2>using TeX style markup it isn't at all clear that the =
benefits would</FONT>

<BR><FONT SIZE=3D2>outweigh the costs. Changing to a utf8 internal form =
would make latex</FONT>

<BR><FONT SIZE=3D2>slower (a lot slower if it handled combining =
characters) and for the</FONT>

<BR><FONT SIZE=3D2>majority of existing users it would have no advantage =
(so they would use</FONT>

<BR><FONT SIZE=3D2>the old system, and not update).</FONT>
</P>

<P><FONT SIZE=3D2>For specific uses (in particular typesetting sources =
derived from XML)</FONT>

<BR><FONT SIZE=3D2>it is possible to layer utf8 support over the current =
base. There the</FONT>

<BR><FONT SIZE=3D2>costs are worth it (as there is really no alternative =
to UTF8 support,</FONT>

<BR><FONT SIZE=3D2>either you code the utf8 support in TeX as in xmltex =
(or similar code in</FONT>

<BR><FONT SIZE=3D2>cjk package and I think there's a utf8.sty on ctan) =
or you have an</FONT>

<BR><FONT SIZE=3D2>external program do the translation.</FONT>
</P>

<P><FONT SIZE=3D2>If the TeX engine changes to Omega (or an Omega like =
system&nbsp; that is</FONT>

<BR><FONT SIZE=3D2>similarly based on unicode) then the rules change =
completely, and</FONT>

<BR><FONT SIZE=3D2>unicode as the internal form becomes a lot more =
attractive prospect.</FONT>

<BR><FONT SIZE=3D2>However I'm not sure that we are quite ready to =
switch all TeX</FONT>

<BR><FONT SIZE=3D2>distributions to Omega as the default engine for =
LaTeX are we?</FONT>
</P>
<BR>

<P><FONT SIZE=3D2>David</FONT>
</P>

</BODY>
</HTML>
------_=_NextPart_001_01C094DE.33DAC600--