Received: from webgate.proteosys.de (mail.proteosys-ag.com [62.225.9.49]) by lucy.proteosys (8.11.0/8.9.3/SuSE Linux 8.9.3-0.1) with ESMTP id f1DGtRH26957 for ; Tue, 13 Feb 2001 17:55:27 +0100 Received: by webgate.proteosys.de (8.11.0/8.11.0) with ESMTP id f1DGtQd02049 . for ; Tue, 13 Feb 2001 17:55:26 +0100 Received: from mail.Uni-Mainz.DE (mailserver1.zdv.Uni-Mainz.DE [134.93.8.30]) by mailgate1.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f1DGtQM06052 for ; Tue, 13 Feb 2001 17:55:26 +0100 (MET) MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----_=_NextPart_001_01C095DD.C3B36980" Received: from mailgate2.zdv.Uni-Mainz.DE (mailgate2.zdv.Uni-Mainz.DE [134.93.8.57]) by mail.Uni-Mainz.DE (8.9.3/8.9.3) with ESMTP id RAA26688 for ; Tue, 13 Feb 2001 17:55:25 +0100 (MET) Received: from mail.listserv.gmd.de (mail.listserv.gmd.de [192.88.97.5]) by mailgate2.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f1DGtP719620 for ; Tue, 13 Feb 2001 17:55:25 +0100 (MET) X-MimeOLE: Produced By Microsoft Exchange V6.5 Received: from mail.listserv.gmd.de (192.88.97.5) by mail.listserv.gmd.de (LSMTP for OpenVMS v1.1a) with SMTP id <4.96D00265@mail.listserv.gmd.de>; Tue, 13 Feb 2001 17:55:18 +0100 Received: from RELAY.URZ.UNI-HEIDELBERG.DE by RELAY.URZ.UNI-HEIDELBERG.DE (LISTSERV-TCP/IP release 1.8b) with spool id 488668 for LATEX-L@RELAY.URZ.UNI-HEIDELBERG.DE; Tue, 13 Feb 2001 17:55:21 +0100 Received: from ix.urz.uni-heidelberg.de (mail.urz.uni-heidelberg.de [129.206.119.234]) by relay.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id RAA06129 for ; Tue, 13 Feb 2001 17:55:19 +0100 (MET) Received: from relay.uni-heidelberg.de (relay.uni-heidelberg.de [129.206.100.212]) by ix.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id RAA39912 for ; Tue, 13 Feb 2001 17:55:20 +0100 Received: from server-7.tower-4.starlabs.net (mail.london-1.starlabs.net [212.125.75.12]) by relay.uni-heidelberg.de (8.10.2+Sun/8.10.2) with SMTP id f1DGtIg13591 for ; Tue, 13 Feb 2001 17:55:18 +0100 (MET) Received: (qmail 17757 invoked from network); 13 Feb 2001 16:52:10 -0000 Received: from nagmx1e.nag.co.uk (HELO nag.co.uk) (62.232.54.130) by server-7.tower-4.starlabs.net with SMTP; 13 Feb 2001 16:52:10 -0000 Received: from penguin.nag.co.uk (IDENT:root@penguin.nag.co.uk [192.156.217.14]) by nag.co.uk (8.9.3/8.9.3) with ESMTP id QAA20338 for ; Tue, 13 Feb 2001 16:55:09 GMT Received: by penguin.nag.co.uk (8.9.3) id QAA05656; Tue, 13 Feb 2001 16:55:07 GMT In-Reply-To: (message from Roozbeh Pournader on Tue, 13 Feb 2001 19:51:03 +0330) References: Return-Path: X-VirusChecked: Checked Content-class: urn:content-classes:message Subject: Re: Multilingual Encodings Summary Date: Tue, 13 Feb 2001 17:55:07 +0100 Message-ID: <200102131655.QAA05656@penguin.nag.co.uk> X-MS-Has-Attach: X-MS-TNEF-Correlator: From: "David Carlisle" Sender: "Mailing list for the LaTeX3 project" To: "Multiple recipients of list LATEX-L" Reply-To: "Mailing list for the LaTeX3 project" Status: R X-Status: X-Keywords: X-UID: 3881 This is a multi-part message in MIME format. ------_=_NextPart_001_01C095DD.C3B36980 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable > Every letter should be made active to look forward to find the = combining > character sequence after it, and then puts that over its own head! I = don't > think this is impossible, you need to loop until a non-combining char = is > found. That's the easy bit. The hard bit is that having made every character active \begin no longer parses as the begin token but as \ b e g i n so you have to make the active definition of \ look ahead to grab all the "letters" where "letter" means those characters that were catcode 11 until you made them 13, so you have to maintain a list of all those, and check one by one with what's in the token stream. Similarly matching { } no longer works (unless you cheat and leave those catcode 1 and 2) so in the end you have to write TeX's tokeniser in TeX. Which is possible but not especially fast and hard to do without breaking some add-on latex package, somewhere. > With math yes, but with other things no, the model is getting stable. It's not just math. 40000 (I think) Chinese characters just got added. Unicode 2 was one plane of 2^16. Uniocde 3 is 17 planes of 2^16. that's a lot of new slots for people to suggest ways to fill, it will = grow. > it because Unicode only uses code points less > than U+10FFFF, there is a lot of space if we want additional internal > glyphs. Going above 10FFFF might be dangerous (if you ever wanted a feature to output the internal state you'd have problems) but plane 13 and 14 are empty for private use, which is 2^17 spare slots, which ought to be enough. But I think the main problem is that it doesn't really make sense to use unicode internally in standard TeX (which is a 7bit system pretending to be 8bit). If latex switched to use omega (only) then a) this might require omega to be more stable than omega users would wish, ie it might prematurely limit addition of new features. b) it would cut out people using tex systems that don't include omega. You might say they should all switch to web2c tex, but that's like saying that everyone should use emacs on linux. Clearly it's true, but it doesn't happen that way. c) special case of (b) it would (at present, I think) cut out pdflatex. d) It would require reasonably major surgery to LaTeX internals. It would be possible to make documents and packages using "documented interfaces" still work with a new internal character handling, but ctan will reveal a lot of heavily used packages that for good (or bad) reasons don't use documented interfaces, but just redefine arbitrary macros. (Often because there isn't a documented interface). A lot of these would break. So in short to medium term it seems there have to be two versions latex/omega and latex/tex. How compatible they can be as latex/omega uses more omega features I am not sure. David _____________________________________________________________________ This message has been checked for all known viruses by Star Internet = delivered through the MessageLabs Virus Control Centre. For further information = visit http://www.star.net.uk/stats.asp ------_=_NextPart_001_01C095DD.C3B36980 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Re: Multilingual Encodings Summary

> Every letter should be made active to look = forward to find the combining
> character sequence after it, and then puts that = over its own head! I don't
> think this is impossible, you need to loop until = a non-combining char is
> found.

That's the easy bit.

The hard bit is that having made every character = active \begin no longer
parses as the begin token but as \ b e g i n so you = have to make the
active definition of \ look ahead to grab all the = "letters" where
"letter" means those characters that were = catcode 11 until you made them
13, so you have to maintain a list of all those, and = check one by one
with what's in the token stream. Similarly matching { = } no longer works
(unless you cheat and leave those catcode 1 and 2) so = in the end you
have to write TeX's tokeniser in TeX. Which is = possible but not
especially fast and hard to do without breaking some = add-on latex
package, somewhere.

> With math yes, but with other things no, the = model is getting stable.
It's not just math. 40000 (I think) Chinese = characters just got added.
Unicode 2 was one plane of 2^16. Uniocde 3 is 17 = planes of 2^16.
that's a lot of new slots for people to suggest ways = to fill, it will grow.

> it because Unicode only uses code points = less
> than U+10FFFF, there is a lot of space if we = want additional internal
> glyphs.

Going above 10FFFF might be dangerous (if you ever = wanted a feature to
output the internal state you'd have problems) but = plane 13 and 14 are
empty for private use, which is 2^17 spare slots, = which ought to be
enough.



But I think the main problem is that it doesn't really = make sense to
use unicode internally in standard TeX (which is a = 7bit  system
pretending to be 8bit).

If latex switched to use omega (only) then
a) this might require omega to be more stable than = omega users would
wish, ie it might prematurely limit addition of new = features.
b) it would cut out people using tex systems that = don't include omega.
You might say they should all switch to web2c tex, = but that's like
saying that everyone should use emacs on linux. = Clearly it's true, but
it doesn't happen that way.
c) special case of (b) it would (at present, I think) = cut out pdflatex.
d) It would require reasonably major surgery to LaTeX = internals. It
would be possible to make documents and packages = using "documented
interfaces" still work with a new internal = character handling, but
ctan will reveal a lot of heavily used packages that = for good (or bad)
reasons don't use documented interfaces, but just = redefine arbitrary
macros. (Often because there isn't a documented = interface).
A lot of these would break.

So in short to medium term it seems there have to be = two versions
latex/omega and latex/tex. How compatible they can be = as latex/omega
uses more omega features I am not sure.

David

________________________________________________________________= _____
This message has been checked for all known viruses = by Star Internet delivered
through the MessageLabs Virus Control Centre. For = further information visit
http://www.star.net.uk/stats.as= p

------_=_NextPart_001_01C095DD.C3B36980--