MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----_=_NextPart_001_01C09535.6663DD80"
Content-class: urn:content-classes:message
Subject:      Re: LaTeX's internal char prepresentation (UTF8 or Unicode?)
Date: Mon, 12 Feb 2001 21:45:58 +0100
Message-ID:  <200102122049.f1CKnvi13875@smtp.wanadoo.es>
From: "Javier Bezos" <jbezos@WANADOO.ES>
Sender: "Mailing list for the LaTeX3 project" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
To: "Multiple recipients of list LATEX-L" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
Reply-To: "Mailing list for the LaTeX3 project" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
Status: R

This is a multi-part message in MIME format.

------_=_NextPart_001_01C09535.6663DD80
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Some random quick remarks. I'm trying to read the huge amount
of messages

Which is the purpose of the LICR? Apparently, it's only an
intermediate step before creating the final output. That
can be true in TeX, but not in Omega because the LICR can
be processed by external tools (spelling, syntax, etc.)
There are lots of tools using Unicode and very likely there
will be more in a future. However, there are only a handful
of tools understanding the current LICR and it's unlikely
there will be more (they are eventually expanded and therefore
cannot be processed anyway, the very fact that unicode chars
are actual `letter' chars is critical). So, having true
Unicode text (perhaps with tags, which can be removed if
necessary) at some part of the internal processing is imo
an essential feature in future extensions to TeX. And indeed
Omega is an extension which can cope with that; I wouldn't like
renounce that.

Another aim of Omega is handling language typographical
features without explicit markup. For instance: German "ck,
Spanish "rr, Portuguese f{}i, Arabic ligatures, etc. Of course,
vf can handle that, but must I create several hundreds of
vf files only to remove the fi ligature? Omega tranlation
processes can handle that very easily.

[Marcel:]
>  > Anyway, Frank, I just got your last mail in my inbox (need to read =
the
>  > details more carefully), and I think we agree that it's worth
>  > exploring if there would be a substantial advantage for having some
>  > engine with Unicode internal reprentation.
> [Frank:]
> it surely is, though i'm not convinced that the time has come, given =
that the
> current LICR actually is as powerful (or more powerful in fact) than =
unicode
> ever can be.

Please, could you explain why?

[Roozbeh:]
>  > Please note that with different scripts, we have different font
>  > classifications also. I'm not sure if the NFSS model is suitable =
for
>  > scripts other than Latin, Cyrillic, and Greek (ok, there are some =
others
>  > here, like Armenian).
> [Frank:]
> i grant you that the way I developed the model was by looking at fonts =
and
> their concepts available for languages close to Latin and so it is =
quite
> likely that it is not suitable for scripts which are quite different.
>
> However to be able to sensibly argue this I beg you to give us some =
insight
> about these classifications and why you think NFSS would be unable to =
model
> them (or say not really suitable)

I think that Roozbeh refers to the fact that the arabic script does
not follow the occidental claasification of fonts (serif, sans serif,
typewriter)

The draft I've written for lambda will allow to say:

\scriptproperties{latin}{rmfamily =3D ptmr, sffamily =3D phvr}
\scriptproperties{greek}{rmfamily =3D grtimes, sffamily =3D grhelv}

(names are invented) but as you can see, it still uses rm/sf/tt
model. If I switch from latin to greek and the
current font is sf (ie, phvr), then the greek text is written using
grhelv, but which is the sf equivalent in Arabic script?

Javier
_________________________________________________________________
Javier Bezos                    | TeX y tipografia
jbezos arroba wanadoo punto es  | http://perso.wanadoo.es/jbezos/


PS. I would also apologize for discussing a set of macros which
has not been made public yet, but remember it's only a
draft and many thing are liable to change (and maybe
the final code can be quite different. As we Spaniards say,
perhaps "no lo reconocer=E1 ni la madre que lo pari=F3"). Anyway,
I'm going to reproduce part of a small text I sent to the Omega
list sometime ago. I would like to note that I didn't intend to
move the discussion from the Omega-dev list to this one -- it just
happened.

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Let's now explain how TeX handle non ascii characters. TeX
can read Unicode files, as xmltex demostrates, but non ascii
chars cannot be represented internaly by TeX this way. Instead,
it uses macros which are generated by inputenc, and which are
expanded in turn into a true character (or a TeX macro) by
fontenc:

  =E9 --- inputenc --> \'{e}  --- fontenc --> ^^e9

That's true even for cyrillyc, arabic, etc. characters!

Omega can represent internally non ascii chars and hence
actual chars are used instead of macros (with a few exceptions).
Trivial as it can seem, this difference is in fact a HUGE
difference. For example, the path followed by =E9 will be:

 =E9 --an encoding ocp-|           |-- T1 font ocp-->  ^^e9
                     +-> U+00E9 -+
 \'e -fontenc (!)----|           |- OT1 font ocp -> \OT1\'{e}


It's interesting to note that fontenc is used as a sort of
input method! (Very likely, a package with the same
funcionality but with different name will be used.)

For that to be accomplished using ocp's we must note that we
can divide them into two groups: those generating Unicode from
an arbitrary input, and those rendering the resulting Unicode
using suitable (or maybe just available :-) ) fonts. The
Unicode text may be so analyzed and transformed by external
ocp's at the right place. Lambda further divides these two
groups into four (to repeat, these proposals are liable to
change):

1a) encoding: converts the source text to Unicode.
1b) input: set input conventions. Keyboards has a limited
   number of keys, and hands a limited number of fingers.
   The goal of this group is to provide an easy way to enter
   Unicode chars using the most basic keys of keyboards
   (which means ascii chars in latin ones). Examples could
   be:
    *  --- =3D> em-dash  (a well known TeX input convention).
    *  ij =3D> U+0133 (in Dutch).
    *  no =3D> U+306E [the corresponding hiragana char]

Now we have the Unicode (with TeX tags) memory representacion
which has to be rendered:

2a) writing: contextual analysis, ligatures, spaced punctuation
   marks, and so on.
2b) font: conversion from Unicode to the local font encoding or
   the appropiate TeX macros (if the character is not available in
   the font).

This scheme fits well in the Unicode Design Principles,
which state that that Unicode deals with memory representation
and not with text rendering or fonts (with is left to "appropiate
standars"). Hence, most of so-called Unicode fonts cannot
render properly text in many scripts because they lack the
required glyphs.

There are some additional processes to "shape" changes (case,
script variants, etc.)

------_=_NextPart_001_01C09535.6663DD80
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Diso-8859-1">
<META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version =
6.5.7654.12">
<TITLE>     Re: LaTeX's internal char prepresentation (UTF8 or =
Unicode?)</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->

<P><FONT SIZE=3D2>Some random quick remarks. I'm trying to read the huge =
amount</FONT>

<BR><FONT SIZE=3D2>of messages</FONT>
</P>

<P><FONT SIZE=3D2>Which is the purpose of the LICR? Apparently, it's =
only an</FONT>

<BR><FONT SIZE=3D2>intermediate step before creating the final output. =
That</FONT>

<BR><FONT SIZE=3D2>can be true in TeX, but not in Omega because the LICR =
can</FONT>

<BR><FONT SIZE=3D2>be processed by external tools (spelling, syntax, =
etc.)</FONT>

<BR><FONT SIZE=3D2>There are lots of tools using Unicode and very likely =
there</FONT>

<BR><FONT SIZE=3D2>will be more in a future. However, there are only a =
handful</FONT>

<BR><FONT SIZE=3D2>of tools understanding the current LICR and it's =
unlikely</FONT>

<BR><FONT SIZE=3D2>there will be more (they are eventually expanded and =
therefore</FONT>

<BR><FONT SIZE=3D2>cannot be processed anyway, the very fact that =
unicode chars</FONT>

<BR><FONT SIZE=3D2>are actual `letter' chars is critical). So, having =
true</FONT>

<BR><FONT SIZE=3D2>Unicode text (perhaps with tags, which can be removed =
if</FONT>

<BR><FONT SIZE=3D2>necessary) at some part of the internal processing is =
imo</FONT>

<BR><FONT SIZE=3D2>an essential feature in future extensions to TeX. And =
indeed</FONT>

<BR><FONT SIZE=3D2>Omega is an extension which can cope with that; I =
wouldn't like</FONT>

<BR><FONT SIZE=3D2>renounce that.</FONT>
</P>

<P><FONT SIZE=3D2>Another aim of Omega is handling language =
typographical</FONT>

<BR><FONT SIZE=3D2>features without explicit markup. For instance: =
German &quot;ck,</FONT>

<BR><FONT SIZE=3D2>Spanish &quot;rr, Portuguese f{}i, Arabic ligatures, =
etc. Of course,</FONT>

<BR><FONT SIZE=3D2>vf can handle that, but must I create several =
hundreds of</FONT>

<BR><FONT SIZE=3D2>vf files only to remove the fi ligature? Omega =
tranlation</FONT>

<BR><FONT SIZE=3D2>processes can handle that very easily.</FONT>
</P>

<P><FONT SIZE=3D2>[Marcel:]</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp; &gt; Anyway, Frank, I just got your last =
mail in my inbox (need to read the</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp; &gt; details more carefully), and I think =
we agree that it's worth</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp; &gt; exploring if there would be a =
substantial advantage for having some</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp; &gt; engine with Unicode internal =
reprentation.</FONT>

<BR><FONT SIZE=3D2>&gt; [Frank:]</FONT>

<BR><FONT SIZE=3D2>&gt; it surely is, though i'm not convinced that the =
time has come, given that the</FONT>

<BR><FONT SIZE=3D2>&gt; current LICR actually is as powerful (or more =
powerful in fact) than unicode</FONT>

<BR><FONT SIZE=3D2>&gt; ever can be.</FONT>
</P>

<P><FONT SIZE=3D2>Please, could you explain why?</FONT>
</P>

<P><FONT SIZE=3D2>[Roozbeh:]</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp; &gt; Please note that with different =
scripts, we have different font</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp; &gt; classifications also. I'm not sure if =
the NFSS model is suitable for</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp; &gt; scripts other than Latin, Cyrillic, =
and Greek (ok, there are some others</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp; &gt; here, like Armenian).</FONT>

<BR><FONT SIZE=3D2>&gt; [Frank:]</FONT>

<BR><FONT SIZE=3D2>&gt; i grant you that the way I developed the model =
was by looking at fonts and</FONT>

<BR><FONT SIZE=3D2>&gt; their concepts available for languages close to =
Latin and so it is quite</FONT>

<BR><FONT SIZE=3D2>&gt; likely that it is not suitable for scripts which =
are quite different.</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt; However to be able to sensibly argue this I beg =
you to give us some insight</FONT>

<BR><FONT SIZE=3D2>&gt; about these classifications and why you think =
NFSS would be unable to model</FONT>

<BR><FONT SIZE=3D2>&gt; them (or say not really suitable)</FONT>
</P>

<P><FONT SIZE=3D2>I think that Roozbeh refers to the fact that the =
arabic script does</FONT>

<BR><FONT SIZE=3D2>not follow the occidental claasification of fonts =
(serif, sans serif,</FONT>

<BR><FONT SIZE=3D2>typewriter)</FONT>
</P>

<P><FONT SIZE=3D2>The draft I've written for lambda will allow to =
say:</FONT>
</P>

<P><FONT SIZE=3D2>\scriptproperties{latin}{rmfamily =3D ptmr, sffamily =
=3D phvr}</FONT>

<BR><FONT SIZE=3D2>\scriptproperties{greek}{rmfamily =3D grtimes, =
sffamily =3D grhelv}</FONT>
</P>

<P><FONT SIZE=3D2>(names are invented) but as you can see, it still uses =
rm/sf/tt</FONT>

<BR><FONT SIZE=3D2>model. If I switch from latin to greek and the</FONT>

<BR><FONT SIZE=3D2>current font is sf (ie, phvr), then the greek text is =
written using</FONT>

<BR><FONT SIZE=3D2>grhelv, but which is the sf equivalent in Arabic =
script?</FONT>
</P>

<P><FONT SIZE=3D2>Javier</FONT>

<BR><FONT =
SIZE=3D2>________________________________________________________________=
_</FONT>

<BR><FONT SIZE=3D2>Javier =
Bezos&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; | TeX y tipografia</FONT>

<BR><FONT SIZE=3D2>jbezos arroba wanadoo punto es&nbsp; | <A =
HREF=3D"http://perso.wanadoo.es/jbezos/">http://perso.wanadoo.es/jbezos/<=
/A></FONT>
</P>
<BR>
<BR>
<BR>

<P><FONT SIZE=3D2>PS. I would also apologize for discussing a set of =
macros which</FONT>

<BR><FONT SIZE=3D2>has not been made public yet, but remember it's only =
a</FONT>

<BR><FONT SIZE=3D2>draft and many thing are liable to change (and =
maybe</FONT>

<BR><FONT SIZE=3D2>the final code can be quite different. As we =
Spaniards say,</FONT>

<BR><FONT SIZE=3D2>perhaps &quot;no lo reconocer=E1 ni la madre que lo =
pari=F3&quot;). Anyway,</FONT>

<BR><FONT SIZE=3D2>I'm going to reproduce part of a small text I sent to =
the Omega</FONT>

<BR><FONT SIZE=3D2>list sometime ago. I would like to note that I didn't =
intend to</FONT>

<BR><FONT SIZE=3D2>move the discussion from the Omega-dev list to this =
one -- it just</FONT>

<BR><FONT SIZE=3D2>happened.</FONT>
</P>

<P><FONT SIZE=3D2>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D</FONT>

<BR><FONT SIZE=3D2>Let's now explain how TeX handle non ascii =
characters. TeX</FONT>

<BR><FONT SIZE=3D2>can read Unicode files, as xmltex demostrates, but =
non ascii</FONT>

<BR><FONT SIZE=3D2>chars cannot be represented internaly by TeX this =
way. Instead,</FONT>

<BR><FONT SIZE=3D2>it uses macros which are generated by inputenc, and =
which are</FONT>

<BR><FONT SIZE=3D2>expanded in turn into a true character (or a TeX =
macro) by</FONT>

<BR><FONT SIZE=3D2>fontenc:</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp; =E9 --- inputenc --&gt; \'{e}&nbsp; --- fontenc =
--&gt; ^^e9</FONT>
</P>

<P><FONT SIZE=3D2>That's true even for cyrillyc, arabic, etc. =
characters!</FONT>
</P>

<P><FONT SIZE=3D2>Omega can represent internally non ascii chars and =
hence</FONT>

<BR><FONT SIZE=3D2>actual chars are used instead of macros (with a few =
exceptions).</FONT>

<BR><FONT SIZE=3D2>Trivial as it can seem, this difference is in fact a =
HUGE</FONT>

<BR><FONT SIZE=3D2>difference. For example, the path followed by =E9 =
will be:</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp;=E9 --an encoding =
ocp-|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |-- T1 =
font ocp--&gt;&nbsp; ^^e9</FONT>

<BR><FONT =
SIZE=3D2>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; +-&gt; U+00E9 =
-+</FONT>

<BR><FONT SIZE=3D2>&nbsp;\'e -fontenc =
(!)----|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |- =
OT1 font ocp -&gt; \OT1\'{e}</FONT>
</P>
<BR>

<P><FONT SIZE=3D2>It's interesting to note that fontenc is used as a =
sort of</FONT>

<BR><FONT SIZE=3D2>input method! (Very likely, a package with the =
same</FONT>

<BR><FONT SIZE=3D2>funcionality but with different name will be =
used.)</FONT>
</P>

<P><FONT SIZE=3D2>For that to be accomplished using ocp's we must note =
that we</FONT>

<BR><FONT SIZE=3D2>can divide them into two groups: those generating =
Unicode from</FONT>

<BR><FONT SIZE=3D2>an arbitrary input, and those rendering the resulting =
Unicode</FONT>

<BR><FONT SIZE=3D2>using suitable (or maybe just available :-) ) fonts. =
The</FONT>

<BR><FONT SIZE=3D2>Unicode text may be so analyzed and transformed by =
external</FONT>

<BR><FONT SIZE=3D2>ocp's at the right place. Lambda further divides =
these two</FONT>

<BR><FONT SIZE=3D2>groups into four (to repeat, these proposals are =
liable to</FONT>

<BR><FONT SIZE=3D2>change):</FONT>
</P>

<P><FONT SIZE=3D2>1a) encoding: converts the source text to =
Unicode.</FONT>

<BR><FONT SIZE=3D2>1b) input: set input conventions. Keyboards has a =
limited</FONT>

<BR><FONT SIZE=3D2>&nbsp;&nbsp; number of keys, and hands a limited =
number of fingers.</FONT>

<BR><FONT SIZE=3D2>&nbsp;&nbsp; The goal of this group is to provide an =
easy way to enter</FONT>

<BR><FONT SIZE=3D2>&nbsp;&nbsp; Unicode chars using the most basic keys =
of keyboards</FONT>

<BR><FONT SIZE=3D2>&nbsp;&nbsp; (which means ascii chars in latin ones). =
Examples could</FONT>

<BR><FONT SIZE=3D2>&nbsp;&nbsp; be:</FONT>

<BR><FONT SIZE=3D2>&nbsp;&nbsp;&nbsp; *&nbsp; --- =3D&gt; em-dash&nbsp; =
(a well known TeX input convention).</FONT>

<BR><FONT SIZE=3D2>&nbsp;&nbsp;&nbsp; *&nbsp; ij =3D&gt; U+0133 (in =
Dutch).</FONT>

<BR><FONT SIZE=3D2>&nbsp;&nbsp;&nbsp; *&nbsp; no =3D&gt; U+306E [the =
corresponding hiragana char]</FONT>
</P>

<P><FONT SIZE=3D2>Now we have the Unicode (with TeX tags) memory =
representacion</FONT>

<BR><FONT SIZE=3D2>which has to be rendered:</FONT>
</P>

<P><FONT SIZE=3D2>2a) writing: contextual analysis, ligatures, spaced =
punctuation</FONT>

<BR><FONT SIZE=3D2>&nbsp;&nbsp; marks, and so on.</FONT>

<BR><FONT SIZE=3D2>2b) font: conversion from Unicode to the local font =
encoding or</FONT>

<BR><FONT SIZE=3D2>&nbsp;&nbsp; the appropiate TeX macros (if the =
character is not available in</FONT>

<BR><FONT SIZE=3D2>&nbsp;&nbsp; the font).</FONT>
</P>

<P><FONT SIZE=3D2>This scheme fits well in the Unicode Design =
Principles,</FONT>

<BR><FONT SIZE=3D2>which state that that Unicode deals with memory =
representation</FONT>

<BR><FONT SIZE=3D2>and not with text rendering or fonts (with is left to =
&quot;appropiate</FONT>

<BR><FONT SIZE=3D2>standars&quot;). Hence, most of so-called Unicode =
fonts cannot</FONT>

<BR><FONT SIZE=3D2>render properly text in many scripts because they =
lack the</FONT>

<BR><FONT SIZE=3D2>required glyphs.</FONT>
</P>

<P><FONT SIZE=3D2>There are some additional processes to =
&quot;shape&quot; changes (case,</FONT>

<BR><FONT SIZE=3D2>script variants, etc.)</FONT>
</P>

</BODY>
</HTML>
------_=_NextPart_001_01C09535.6663DD80--