Received: from webgate.proteosys.de (mail.proteosys-ag.com [62.225.9.49]) by lucy.proteosys (8.11.0/8.9.3/SuSE Linux 8.9.3-0.1) with ESMTP id f1CKoFH21813 for ; Mon, 12 Feb 2001 21:50:15 +0100 Received: by webgate.proteosys.de (8.11.0/8.11.0) with ESMTP id f1CKoEd30770 . for ; Mon, 12 Feb 2001 21:50:14 +0100 MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----_=_NextPart_001_01C09535.6663DD80" Received: from mail.Uni-Mainz.DE (mailserver1.zdv.Uni-Mainz.DE [134.93.8.30]) by mailgate2.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f1CKoD729212 for ; Mon, 12 Feb 2001 21:50:13 +0100 (MET) Received: from mailgate1.zdv.Uni-Mainz.DE (mailgate1.zdv.Uni-Mainz.DE [134.93.8.56]) by mail.Uni-Mainz.DE (8.9.3/8.9.3) with ESMTP id VAA14888 for ; Mon, 12 Feb 2001 21:50:13 +0100 (MET) X-MimeOLE: Produced By Microsoft Exchange V6.5 Received: from mail.listserv.gmd.de (mail.listserv.gmd.de [192.88.97.5]) by mailgate1.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f1CKoCM15916 for ; Mon, 12 Feb 2001 21:50:13 +0100 (MET) Received: from mail.listserv.gmd.de (192.88.97.5) by mail.listserv.gmd.de (LSMTP for OpenVMS v1.1a) with SMTP id <13.38A57DCA@mail.listserv.gmd.de>; Mon, 12 Feb 2001 21:50:04 +0100 Received: from RELAY.URZ.UNI-HEIDELBERG.DE by RELAY.URZ.UNI-HEIDELBERG.DE (LISTSERV-TCP/IP release 1.8b) with spool id 489106 for LATEX-L@RELAY.URZ.UNI-HEIDELBERG.DE; Mon, 12 Feb 2001 21:50:08 +0100 Received: from ix.urz.uni-heidelberg.de (mail.urz.uni-heidelberg.de [129.206.119.234]) by relay.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id VAA19249 for ; Mon, 12 Feb 2001 21:50:07 +0100 (MET) Received: from relay.uni-heidelberg.de (relay.uni-heidelberg.de [129.206.100.212]) by ix.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id VAA52780 for ; Mon, 12 Feb 2001 21:50:06 +0100 Received: from smtp.wanadoo.es (m1smtpisp03.wanadoo.es [62.36.220.63] (may be forged)) by relay.uni-heidelberg.de (8.10.2+Sun/8.10.2) with ESMTP id f1CKo5u09521 for ; Mon, 12 Feb 2001 21:50:05 +0100 (MET) Received: from [62.37.92.52] (usuario2-37-92-52.dialup.uni2.es [62.37.92.52]) by smtp.wanadoo.es (8.10.2/8.10.2) with ESMTP id f1CKnvi13875 for ; Mon, 12 Feb 2001 21:49:57 +0100 (MET) Return-Path: X-Mailer: Microsoft Outlook Express Macintosh Edition - 4.5 (0410) x-mime-autoconverted: from quoted-printable to 8bit by relay.urz.uni-heidelberg.de id VAA19250 Content-class: urn:content-classes:message Subject: Re: LaTeX's internal char prepresentation (UTF8 or Unicode?) Date: Mon, 12 Feb 2001 21:45:58 +0100 Message-ID: <200102122049.f1CKnvi13875@smtp.wanadoo.es> X-MS-Has-Attach: X-MS-TNEF-Correlator: From: "Javier Bezos" Sender: "Mailing list for the LaTeX3 project" To: "Multiple recipients of list LATEX-L" Reply-To: "Mailing list for the LaTeX3 project" Status: R X-Status: X-Keywords: X-UID: 3865 This is a multi-part message in MIME format. ------_=_NextPart_001_01C09535.6663DD80 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Some random quick remarks. I'm trying to read the huge amount of messages Which is the purpose of the LICR? Apparently, it's only an intermediate step before creating the final output. That can be true in TeX, but not in Omega because the LICR can be processed by external tools (spelling, syntax, etc.) There are lots of tools using Unicode and very likely there will be more in a future. However, there are only a handful of tools understanding the current LICR and it's unlikely there will be more (they are eventually expanded and therefore cannot be processed anyway, the very fact that unicode chars are actual `letter' chars is critical). So, having true Unicode text (perhaps with tags, which can be removed if necessary) at some part of the internal processing is imo an essential feature in future extensions to TeX. And indeed Omega is an extension which can cope with that; I wouldn't like renounce that. Another aim of Omega is handling language typographical features without explicit markup. For instance: German "ck, Spanish "rr, Portuguese f{}i, Arabic ligatures, etc. Of course, vf can handle that, but must I create several hundreds of vf files only to remove the fi ligature? Omega tranlation processes can handle that very easily. [Marcel:] > > Anyway, Frank, I just got your last mail in my inbox (need to read = the > > details more carefully), and I think we agree that it's worth > > exploring if there would be a substantial advantage for having some > > engine with Unicode internal reprentation. > [Frank:] > it surely is, though i'm not convinced that the time has come, given = that the > current LICR actually is as powerful (or more powerful in fact) than = unicode > ever can be. Please, could you explain why? [Roozbeh:] > > Please note that with different scripts, we have different font > > classifications also. I'm not sure if the NFSS model is suitable = for > > scripts other than Latin, Cyrillic, and Greek (ok, there are some = others > > here, like Armenian). > [Frank:] > i grant you that the way I developed the model was by looking at fonts = and > their concepts available for languages close to Latin and so it is = quite > likely that it is not suitable for scripts which are quite different. > > However to be able to sensibly argue this I beg you to give us some = insight > about these classifications and why you think NFSS would be unable to = model > them (or say not really suitable) I think that Roozbeh refers to the fact that the arabic script does not follow the occidental claasification of fonts (serif, sans serif, typewriter) The draft I've written for lambda will allow to say: \scriptproperties{latin}{rmfamily =3D ptmr, sffamily =3D phvr} \scriptproperties{greek}{rmfamily =3D grtimes, sffamily =3D grhelv} (names are invented) but as you can see, it still uses rm/sf/tt model. If I switch from latin to greek and the current font is sf (ie, phvr), then the greek text is written using grhelv, but which is the sf equivalent in Arabic script? Javier _________________________________________________________________ Javier Bezos | TeX y tipografia jbezos arroba wanadoo punto es | http://perso.wanadoo.es/jbezos/ PS. I would also apologize for discussing a set of macros which has not been made public yet, but remember it's only a draft and many thing are liable to change (and maybe the final code can be quite different. As we Spaniards say, perhaps "no lo reconocer=E1 ni la madre que lo pari=F3"). Anyway, I'm going to reproduce part of a small text I sent to the Omega list sometime ago. I would like to note that I didn't intend to move the discussion from the Omega-dev list to this one -- it just happened. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Let's now explain how TeX handle non ascii characters. TeX can read Unicode files, as xmltex demostrates, but non ascii chars cannot be represented internaly by TeX this way. Instead, it uses macros which are generated by inputenc, and which are expanded in turn into a true character (or a TeX macro) by fontenc: =E9 --- inputenc --> \'{e} --- fontenc --> ^^e9 That's true even for cyrillyc, arabic, etc. characters! Omega can represent internally non ascii chars and hence actual chars are used instead of macros (with a few exceptions). Trivial as it can seem, this difference is in fact a HUGE difference. For example, the path followed by =E9 will be: =E9 --an encoding ocp-| |-- T1 font ocp--> ^^e9 +-> U+00E9 -+ \'e -fontenc (!)----| |- OT1 font ocp -> \OT1\'{e} It's interesting to note that fontenc is used as a sort of input method! (Very likely, a package with the same funcionality but with different name will be used.) For that to be accomplished using ocp's we must note that we can divide them into two groups: those generating Unicode from an arbitrary input, and those rendering the resulting Unicode using suitable (or maybe just available :-) ) fonts. The Unicode text may be so analyzed and transformed by external ocp's at the right place. Lambda further divides these two groups into four (to repeat, these proposals are liable to change): 1a) encoding: converts the source text to Unicode. 1b) input: set input conventions. Keyboards has a limited number of keys, and hands a limited number of fingers. The goal of this group is to provide an easy way to enter Unicode chars using the most basic keys of keyboards (which means ascii chars in latin ones). Examples could be: * --- =3D> em-dash (a well known TeX input convention). * ij =3D> U+0133 (in Dutch). * no =3D> U+306E [the corresponding hiragana char] Now we have the Unicode (with TeX tags) memory representacion which has to be rendered: 2a) writing: contextual analysis, ligatures, spaced punctuation marks, and so on. 2b) font: conversion from Unicode to the local font encoding or the appropiate TeX macros (if the character is not available in the font). This scheme fits well in the Unicode Design Principles, which state that that Unicode deals with memory representation and not with text rendering or fonts (with is left to "appropiate standars"). Hence, most of so-called Unicode fonts cannot render properly text in many scripts because they lack the required glyphs. There are some additional processes to "shape" changes (case, script variants, etc.) ------_=_NextPart_001_01C09535.6663DD80 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Re: LaTeX's internal char prepresentation (UTF8 or = Unicode?)

Some random quick remarks. I'm trying to read the huge = amount
of messages

Which is the purpose of the LICR? Apparently, it's = only an
intermediate step before creating the final output. = That
can be true in TeX, but not in Omega because the LICR = can
be processed by external tools (spelling, syntax, = etc.)
There are lots of tools using Unicode and very likely = there
will be more in a future. However, there are only a = handful
of tools understanding the current LICR and it's = unlikely
there will be more (they are eventually expanded and = therefore
cannot be processed anyway, the very fact that = unicode chars
are actual `letter' chars is critical). So, having = true
Unicode text (perhaps with tags, which can be removed = if
necessary) at some part of the internal processing is = imo
an essential feature in future extensions to TeX. And = indeed
Omega is an extension which can cope with that; I = wouldn't like
renounce that.

Another aim of Omega is handling language = typographical
features without explicit markup. For instance: = German "ck,
Spanish "rr, Portuguese f{}i, Arabic ligatures, = etc. Of course,
vf can handle that, but must I create several = hundreds of
vf files only to remove the fi ligature? Omega = tranlation
processes can handle that very easily.

[Marcel:]
>  > Anyway, Frank, I just got your last = mail in my inbox (need to read the
>  > details more carefully), and I think = we agree that it's worth
>  > exploring if there would be a = substantial advantage for having some
>  > engine with Unicode internal = reprentation.
> [Frank:]
> it surely is, though i'm not convinced that the = time has come, given that the
> current LICR actually is as powerful (or more = powerful in fact) than unicode
> ever can be.

Please, could you explain why?

[Roozbeh:]
>  > Please note that with different = scripts, we have different font
>  > classifications also. I'm not sure if = the NFSS model is suitable for
>  > scripts other than Latin, Cyrillic, = and Greek (ok, there are some others
>  > here, like Armenian).
> [Frank:]
> i grant you that the way I developed the model = was by looking at fonts and
> their concepts available for languages close to = Latin and so it is quite
> likely that it is not suitable for scripts which = are quite different.
>
> However to be able to sensibly argue this I beg = you to give us some insight
> about these classifications and why you think = NFSS would be unable to model
> them (or say not really suitable)

I think that Roozbeh refers to the fact that the = arabic script does
not follow the occidental claasification of fonts = (serif, sans serif,
typewriter)

The draft I've written for lambda will allow to = say:

\scriptproperties{latin}{rmfamily =3D ptmr, sffamily = =3D phvr}
\scriptproperties{greek}{rmfamily =3D grtimes, = sffamily =3D grhelv}

(names are invented) but as you can see, it still uses = rm/sf/tt
model. If I switch from latin to greek and the
current font is sf (ie, phvr), then the greek text is = written using
grhelv, but which is the sf equivalent in Arabic = script?

Javier
________________________________________________________________= _
Javier = Bezos           &n= bsp;        | TeX y tipografia
jbezos arroba wanadoo punto es  | http://perso.wanadoo.es/jbezos/<= /A>




PS. I would also apologize for discussing a set of = macros which
has not been made public yet, but remember it's only = a
draft and many thing are liable to change (and = maybe
the final code can be quite different. As we = Spaniards say,
perhaps "no lo reconocer=E1 ni la madre que lo = pari=F3"). Anyway,
I'm going to reproduce part of a small text I sent to = the Omega
list sometime ago. I would like to note that I didn't = intend to
move the discussion from the Omega-dev list to this = one -- it just
happened.

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Let's now explain how TeX handle non ascii = characters. TeX
can read Unicode files, as xmltex demostrates, but = non ascii
chars cannot be represented internaly by TeX this = way. Instead,
it uses macros which are generated by inputenc, and = which are
expanded in turn into a true character (or a TeX = macro) by
fontenc:

  =E9 --- inputenc --> \'{e}  --- fontenc = --> ^^e9

That's true even for cyrillyc, arabic, etc. = characters!

Omega can represent internally non ascii chars and = hence
actual chars are used instead of macros (with a few = exceptions).
Trivial as it can seem, this difference is in fact a = HUGE
difference. For example, the path followed by =E9 = will be:

 =E9 --an encoding = ocp-|           |-- T1 = font ocp-->  ^^e9
          &nbs= p;          +-> U+00E9 = -+
 \'e -fontenc = (!)----|           |- = OT1 font ocp -> \OT1\'{e}


It's interesting to note that fontenc is used as a = sort of
input method! (Very likely, a package with the = same
funcionality but with different name will be = used.)

For that to be accomplished using ocp's we must note = that we
can divide them into two groups: those generating = Unicode from
an arbitrary input, and those rendering the resulting = Unicode
using suitable (or maybe just available :-) ) fonts. = The
Unicode text may be so analyzed and transformed by = external
ocp's at the right place. Lambda further divides = these two
groups into four (to repeat, these proposals are = liable to
change):

1a) encoding: converts the source text to = Unicode.
1b) input: set input conventions. Keyboards has a = limited
   number of keys, and hands a limited = number of fingers.
   The goal of this group is to provide an = easy way to enter
   Unicode chars using the most basic keys = of keyboards
   (which means ascii chars in latin ones). = Examples could
   be:
    *  --- =3D> em-dash  = (a well known TeX input convention).
    *  ij =3D> U+0133 (in = Dutch).
    *  no =3D> U+306E [the = corresponding hiragana char]

Now we have the Unicode (with TeX tags) memory = representacion
which has to be rendered:

2a) writing: contextual analysis, ligatures, spaced = punctuation
   marks, and so on.
2b) font: conversion from Unicode to the local font = encoding or
   the appropiate TeX macros (if the = character is not available in
   the font).

This scheme fits well in the Unicode Design = Principles,
which state that that Unicode deals with memory = representation
and not with text rendering or fonts (with is left to = "appropiate
standars"). Hence, most of so-called Unicode = fonts cannot
render properly text in many scripts because they = lack the
required glyphs.

There are some additional processes to = "shape" changes (case,
script variants, etc.)

------_=_NextPart_001_01C09535.6663DD80--