MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----_=_NextPart_001_01C095DD.C3B36980"
In-Reply-To:  <Pine.LNX.4.10.10102131831200.2744-100000@Sina.sharif.ac.ir>              (message from Roozbeh Pournader on Tue, 13 Feb 2001 19:51:03              +0330)
References:  <Pine.LNX.4.10.10102131831200.2744-100000@Sina.sharif.ac.ir>
Content-class: urn:content-classes:message
Subject:      Re: Multilingual Encodings Summary
Date: Tue, 13 Feb 2001 17:55:07 +0100
Message-ID:  <200102131655.QAA05656@penguin.nag.co.uk>
From: "David Carlisle" <davidc@NAG.CO.UK>
Sender: "Mailing list for the LaTeX3 project" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
To: "Multiple recipients of list LATEX-L" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
Reply-To: "Mailing list for the LaTeX3 project" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
Status: R

This is a multi-part message in MIME format.

------_=_NextPart_001_01C095DD.C3B36980
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

> Every letter should be made active to look forward to find the =
combining
> character sequence after it, and then puts that over its own head! I =
don't
> think this is impossible, you need to loop until a non-combining char =
is
> found.

That's the easy bit.

The hard bit is that having made every character active \begin no longer
parses as the begin token but as \ b e g i n so you have to make the
active definition of \ look ahead to grab all the "letters" where
"letter" means those characters that were catcode 11 until you made them
13, so you have to maintain a list of all those, and check one by one
with what's in the token stream. Similarly matching { } no longer works
(unless you cheat and leave those catcode 1 and 2) so in the end you
have to write TeX's tokeniser in TeX. Which is possible but not
especially fast and hard to do without breaking some add-on latex
package, somewhere.

> With math yes, but with other things no, the model is getting stable.
It's not just math. 40000 (I think) Chinese characters just got added.
Unicode 2 was one plane of 2^16. Uniocde 3 is 17 planes of 2^16.
that's a lot of new slots for people to suggest ways to fill, it will =
grow.

> it because Unicode only uses code points less
> than U+10FFFF, there is a lot of space if we want additional internal
> glyphs.

Going above 10FFFF might be dangerous (if you ever wanted a feature to
output the internal state you'd have problems) but plane 13 and 14 are
empty for private use, which is 2^17 spare slots, which ought to be
enough.


But I think the main problem is that it doesn't really make sense to
use unicode internally in standard TeX (which is a 7bit  system
pretending to be 8bit).

If latex switched to use omega (only) then
a) this might require omega to be more stable than omega users would
wish, ie it might prematurely limit addition of new features.
b) it would cut out people using tex systems that don't include omega.
You might say they should all switch to web2c tex, but that's like
saying that everyone should use emacs on linux. Clearly it's true, but
it doesn't happen that way.
c) special case of (b) it would (at present, I think) cut out pdflatex.
d) It would require reasonably major surgery to LaTeX internals. It
would be possible to make documents and packages using "documented
interfaces" still work with a new internal character handling, but
ctan will reveal a lot of heavily used packages that for good (or bad)
reasons don't use documented interfaces, but just redefine arbitrary
macros. (Often because there isn't a documented interface).
A lot of these would break.

So in short to medium term it seems there have to be two versions
latex/omega and latex/tex. How compatible they can be as latex/omega
uses more omega features I am not sure.

David

_____________________________________________________________________
This message has been checked for all known viruses by Star Internet =
delivered
through the MessageLabs Virus Control Centre. For further information =
visit
http://www.star.net.uk/stats.asp

------_=_NextPart_001_01C095DD.C3B36980
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Diso-8859-1">
<META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version =
6.5.7654.12">
<TITLE>     Re: Multilingual Encodings Summary</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->

<P><FONT SIZE=3D2>&gt; Every letter should be made active to look =
forward to find the combining</FONT>

<BR><FONT SIZE=3D2>&gt; character sequence after it, and then puts that =
over its own head! I don't</FONT>

<BR><FONT SIZE=3D2>&gt; think this is impossible, you need to loop until =
a non-combining char is</FONT>

<BR><FONT SIZE=3D2>&gt; found.</FONT>
</P>

<P><FONT SIZE=3D2>That's the easy bit.</FONT>
</P>

<P><FONT SIZE=3D2>The hard bit is that having made every character =
active \begin no longer</FONT>

<BR><FONT SIZE=3D2>parses as the begin token but as \ b e g i n so you =
have to make the</FONT>

<BR><FONT SIZE=3D2>active definition of \ look ahead to grab all the =
&quot;letters&quot; where</FONT>

<BR><FONT SIZE=3D2>&quot;letter&quot; means those characters that were =
catcode 11 until you made them</FONT>

<BR><FONT SIZE=3D2>13, so you have to maintain a list of all those, and =
check one by one</FONT>

<BR><FONT SIZE=3D2>with what's in the token stream. Similarly matching { =
} no longer works</FONT>

<BR><FONT SIZE=3D2>(unless you cheat and leave those catcode 1 and 2) so =
in the end you</FONT>

<BR><FONT SIZE=3D2>have to write TeX's tokeniser in TeX. Which is =
possible but not</FONT>

<BR><FONT SIZE=3D2>especially fast and hard to do without breaking some =
add-on latex</FONT>

<BR><FONT SIZE=3D2>package, somewhere.</FONT>
</P>

<P><FONT SIZE=3D2>&gt; With math yes, but with other things no, the =
model is getting stable.</FONT>

<BR><FONT SIZE=3D2>It's not just math. 40000 (I think) Chinese =
characters just got added.</FONT>

<BR><FONT SIZE=3D2>Unicode 2 was one plane of 2^16. Uniocde 3 is 17 =
planes of 2^16.</FONT>

<BR><FONT SIZE=3D2>that's a lot of new slots for people to suggest ways =
to fill, it will grow.</FONT>
</P>

<P><FONT SIZE=3D2>&gt; it because Unicode only uses code points =
less</FONT>

<BR><FONT SIZE=3D2>&gt; than U+10FFFF, there is a lot of space if we =
want additional internal</FONT>

<BR><FONT SIZE=3D2>&gt; glyphs.</FONT>
</P>

<P><FONT SIZE=3D2>Going above 10FFFF might be dangerous (if you ever =
wanted a feature to</FONT>

<BR><FONT SIZE=3D2>output the internal state you'd have problems) but =
plane 13 and 14 are</FONT>

<BR><FONT SIZE=3D2>empty for private use, which is 2^17 spare slots, =
which ought to be</FONT>

<BR><FONT SIZE=3D2>enough.</FONT>
</P>
<BR>
<BR>

<P><FONT SIZE=3D2>But I think the main problem is that it doesn't really =
make sense to</FONT>

<BR><FONT SIZE=3D2>use unicode internally in standard TeX (which is a =
7bit&nbsp; system</FONT>

<BR><FONT SIZE=3D2>pretending to be 8bit).</FONT>
</P>

<P><FONT SIZE=3D2>If latex switched to use omega (only) then</FONT>

<BR><FONT SIZE=3D2>a) this might require omega to be more stable than =
omega users would</FONT>

<BR><FONT SIZE=3D2>wish, ie it might prematurely limit addition of new =
features.</FONT>

<BR><FONT SIZE=3D2>b) it would cut out people using tex systems that =
don't include omega.</FONT>

<BR><FONT SIZE=3D2>You might say they should all switch to web2c tex, =
but that's like</FONT>

<BR><FONT SIZE=3D2>saying that everyone should use emacs on linux. =
Clearly it's true, but</FONT>

<BR><FONT SIZE=3D2>it doesn't happen that way.</FONT>

<BR><FONT SIZE=3D2>c) special case of (b) it would (at present, I think) =
cut out pdflatex.</FONT>

<BR><FONT SIZE=3D2>d) It would require reasonably major surgery to LaTeX =
internals. It</FONT>

<BR><FONT SIZE=3D2>would be possible to make documents and packages =
using &quot;documented</FONT>

<BR><FONT SIZE=3D2>interfaces&quot; still work with a new internal =
character handling, but</FONT>

<BR><FONT SIZE=3D2>ctan will reveal a lot of heavily used packages that =
for good (or bad)</FONT>

<BR><FONT SIZE=3D2>reasons don't use documented interfaces, but just =
redefine arbitrary</FONT>

<BR><FONT SIZE=3D2>macros. (Often because there isn't a documented =
interface).</FONT>

<BR><FONT SIZE=3D2>A lot of these would break.</FONT>
</P>

<P><FONT SIZE=3D2>So in short to medium term it seems there have to be =
two versions</FONT>

<BR><FONT SIZE=3D2>latex/omega and latex/tex. How compatible they can be =
as latex/omega</FONT>

<BR><FONT SIZE=3D2>uses more omega features I am not sure.</FONT>
</P>

<P><FONT SIZE=3D2>David</FONT>
</P>

<P><FONT =
SIZE=3D2>________________________________________________________________=
_____</FONT>

<BR><FONT SIZE=3D2>This message has been checked for all known viruses =
by Star Internet delivered</FONT>

<BR><FONT SIZE=3D2>through the MessageLabs Virus Control Centre. For =
further information visit</FONT>

<BR><FONT SIZE=3D2><A =
HREF=3D"http://www.star.net.uk/stats.asp">http://www.star.net.uk/stats.as=
p</A></FONT>
</P>

</BODY>
</HTML>
------_=_NextPart_001_01C095DD.C3B36980--