MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----_=_NextPart_001_01C0DBD6.92DFD200"
Content-class: urn:content-classes:message
Subject:      Re: Multilingual Encodings Summary 2.2
Date: Sun, 13 May 2001 19:55:58 +0100
Message-ID:  <200105131759.f4DHxx723926@smtp.wanadoo.es>
From: "Javier Bezos" <jbezos@WANADOO.ES>
Sender: "Mailing list for the LaTeX3 project" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
To: "Multiple recipients of list LATEX-L" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
Reply-To: "Mailing list for the LaTeX3 project" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
Status: R

This is a multi-part message in MIME format.

------_=_NextPart_001_01C0DBD6.92DFD200
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

>>And
>>regarding font transformation, they should be handled by fonts, but
>>the main problem is that metric information (ie, tfm) cannot be
>>modified from within TeX, except a few parameters; I really wonder
>>if allowing more changes, mainly ligatures, is feasible (that
>>solution would be better than font ocp's and vf's, I think).
>
> I don't understand this. What kind of font transformations are you
> referring to?

For example, removing the fi ligature in Turkish. Or using an alternate
ortography in languages with contextual analysis.

>>Semantically or visually?
>
> I suspect Frank considers meaning to be a semantic concept, not a =
visual.

I also suspect that, but then if we pick a char it will be
undefined visually and its rendering (and TeX is essentially about
rendering) will need _always_ additional information about the context
(example: traditional idiograms in Japanese vs. simplified ones in
Chinese).

> I believe one of the main problems for multilinguality in LaTeX today =
is
> that there is no way of recording (or maybe even of determining) the
> current context so that this information can be moved around with =
every
> piece of code affected by it. Hence most current commands strive =
instead to
> convert the code to a context-free representation (the LICR) by use of
> protected expansion.

In such a case, we must find the way. Without it, proper rendering is
impossible. Of course, we may write an ideogram to the aux file as
a macro; for example ai (love) can be written as \japaneseai and
\chineseai depending on the context they are written, but that means
that the resulting code is not very different from the current mess
with Russian where we have \cyrA, \cyrB, etc. That's exactly what I want
to avoid. Orf course, that also means that changing things depending
on the target will become more difficult.

Further, by doing so we are creating again a closed system
using its own conventions with no links with external tools adapted
to Unicode. I will be able to process a file and extract information
from it with, say, Python very easily if they use a known representation
(iso encodings or Unicode), but if we have to parse things like =
\japaneseai
or similar, things become more difficult.  I think it's a lot easier
moving information with blocks of text and not with single chars.

I don't understand why we cannot determine the current language
context--either I'm missing something or I'm very optimistic about
the capabilities of TeX. Please, could you give an example where
the current language cannot be determined and/or moved?

>>> But such characters (the Spanish as well as the Hebrew) aren't =
allowed in
>>> names in LaTeX!
>>
>>But they should be allowed in the future in we want a true
>>multilingual environment.
>
> Why? They are not part of any text, but part of the markup!

Are you suggesting that Japaneses, Chineses, Tibetans, Arabs,
Persians, Greeks, Russians, etc. must use the Latin alphabet *always*?
That's not truly multilingual--maybe of interest for Occidental
scholars, but not for people actually using these scripts and
keyboards with these scripts. (Particularly messy is mixing
right to left scripts with Latin.)

> Isn't the \char primitive in Omega be able to produce arbitrary =
characters
> (at least arbitrary characters in the basic multilingual plane)?

Not exactly. The \char primitive is a char, but not intrinsically
Unicode--ocp's are also applied to \char (and therefore they are
transcoded).

> It looks quite reasonable to me, and it is certainly much better than =
the
> processing depicted in the example. Does this mean that the example =
should
> rather be
>
>     A     B        C          D        E
>    \'e   \'e   e^^^^0301   ^^^^00e9   ^^e9

As currently implemented, yes, it should. I'm not still sure if =
normalizing
in this way is the best solution. However, I find the arguments in the
Unicode book in favour of it quite convincing.

Regards
Javier

------_=_NextPart_001_01C0DBD6.92DFD200
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Diso-8859-1">
<META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version =
6.5.7654.12">
<TITLE>     Re: Multilingual Encodings Summary 2.2</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->

<P><FONT SIZE=3D2>&gt;&gt;And</FONT>

<BR><FONT SIZE=3D2>&gt;&gt;regarding font transformation, they should be =
handled by fonts, but</FONT>

<BR><FONT SIZE=3D2>&gt;&gt;the main problem is that metric information =
(ie, tfm) cannot be</FONT>

<BR><FONT SIZE=3D2>&gt;&gt;modified from within TeX, except a few =
parameters; I really wonder</FONT>

<BR><FONT SIZE=3D2>&gt;&gt;if allowing more changes, mainly ligatures, =
is feasible (that</FONT>

<BR><FONT SIZE=3D2>&gt;&gt;solution would be better than font ocp's and =
vf's, I think).</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt; I don't understand this. What kind of font =
transformations are you</FONT>

<BR><FONT SIZE=3D2>&gt; referring to?</FONT>
</P>

<P><FONT SIZE=3D2>For example, removing the fi ligature in Turkish. Or =
using an alternate</FONT>

<BR><FONT SIZE=3D2>ortography in languages with contextual =
analysis.</FONT>
</P>

<P><FONT SIZE=3D2>&gt;&gt;Semantically or visually?</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt; I suspect Frank considers meaning to be a =
semantic concept, not a visual.</FONT>
</P>

<P><FONT SIZE=3D2>I also suspect that, but then if we pick a char it =
will be</FONT>

<BR><FONT SIZE=3D2>undefined visually and its rendering (and TeX is =
essentially about</FONT>

<BR><FONT SIZE=3D2>rendering) will need _always_ additional information =
about the context</FONT>

<BR><FONT SIZE=3D2>(example: traditional idiograms in Japanese vs. =
simplified ones in</FONT>

<BR><FONT SIZE=3D2>Chinese).</FONT>
</P>

<P><FONT SIZE=3D2>&gt; I believe one of the main problems for =
multilinguality in LaTeX today is</FONT>

<BR><FONT SIZE=3D2>&gt; that there is no way of recording (or maybe even =
of determining) the</FONT>

<BR><FONT SIZE=3D2>&gt; current context so that this information can be =
moved around with every</FONT>

<BR><FONT SIZE=3D2>&gt; piece of code affected by it. Hence most current =
commands strive instead to</FONT>

<BR><FONT SIZE=3D2>&gt; convert the code to a context-free =
representation (the LICR) by use of</FONT>

<BR><FONT SIZE=3D2>&gt; protected expansion.</FONT>
</P>

<P><FONT SIZE=3D2>In such a case, we must find the way. Without it, =
proper rendering is</FONT>

<BR><FONT SIZE=3D2>impossible. Of course, we may write an ideogram to =
the aux file as</FONT>

<BR><FONT SIZE=3D2>a macro; for example ai (love) can be written as =
\japaneseai and</FONT>

<BR><FONT SIZE=3D2>\chineseai depending on the context they are written, =
but that means</FONT>

<BR><FONT SIZE=3D2>that the resulting code is not very different from =
the current mess</FONT>

<BR><FONT SIZE=3D2>with Russian where we have \cyrA, \cyrB, etc. That's =
exactly what I want</FONT>

<BR><FONT SIZE=3D2>to avoid. Orf course, that also means that changing =
things depending</FONT>

<BR><FONT SIZE=3D2>on the target will become more difficult.</FONT>
</P>

<P><FONT SIZE=3D2>Further, by doing so we are creating again a closed =
system</FONT>

<BR><FONT SIZE=3D2>using its own conventions with no links with external =
tools adapted</FONT>

<BR><FONT SIZE=3D2>to Unicode. I will be able to process a file and =
extract information</FONT>

<BR><FONT SIZE=3D2>from it with, say, Python very easily if they use a =
known representation</FONT>

<BR><FONT SIZE=3D2>(iso encodings or Unicode), but if we have to parse =
things like \japaneseai</FONT>

<BR><FONT SIZE=3D2>or similar, things become more difficult.&nbsp; I =
think it's a lot easier</FONT>

<BR><FONT SIZE=3D2>moving information with blocks of text and not with =
single chars.</FONT>
</P>

<P><FONT SIZE=3D2>I don't understand why we cannot determine the current =
language</FONT>

<BR><FONT SIZE=3D2>context--either I'm missing something or I'm very =
optimistic about</FONT>

<BR><FONT SIZE=3D2>the capabilities of TeX. Please, could you give an =
example where</FONT>

<BR><FONT SIZE=3D2>the current language cannot be determined and/or =
moved?</FONT>
</P>

<P><FONT SIZE=3D2>&gt;&gt;&gt; But such characters (the Spanish as well =
as the Hebrew) aren't allowed in</FONT>

<BR><FONT SIZE=3D2>&gt;&gt;&gt; names in LaTeX!</FONT>

<BR><FONT SIZE=3D2>&gt;&gt;</FONT>

<BR><FONT SIZE=3D2>&gt;&gt;But they should be allowed in the future in =
we want a true</FONT>

<BR><FONT SIZE=3D2>&gt;&gt;multilingual environment.</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt; Why? They are not part of any text, but part of =
the markup!</FONT>
</P>

<P><FONT SIZE=3D2>Are you suggesting that Japaneses, Chineses, Tibetans, =
Arabs,</FONT>

<BR><FONT SIZE=3D2>Persians, Greeks, Russians, etc. must use the Latin =
alphabet *always*?</FONT>

<BR><FONT SIZE=3D2>That's not truly multilingual--maybe of interest for =
Occidental</FONT>

<BR><FONT SIZE=3D2>scholars, but not for people actually using these =
scripts and</FONT>

<BR><FONT SIZE=3D2>keyboards with these scripts. (Particularly messy is =
mixing</FONT>

<BR><FONT SIZE=3D2>right to left scripts with Latin.)</FONT>
</P>

<P><FONT SIZE=3D2>&gt; Isn't the \char primitive in Omega be able to =
produce arbitrary characters</FONT>

<BR><FONT SIZE=3D2>&gt; (at least arbitrary characters in the basic =
multilingual plane)?</FONT>
</P>

<P><FONT SIZE=3D2>Not exactly. The \char primitive is a char, but not =
intrinsically</FONT>

<BR><FONT SIZE=3D2>Unicode--ocp's are also applied to \char (and =
therefore they are</FONT>

<BR><FONT SIZE=3D2>transcoded).</FONT>
</P>

<P><FONT SIZE=3D2>&gt; It looks quite reasonable to me, and it is =
certainly much better than the</FONT>

<BR><FONT SIZE=3D2>&gt; processing depicted in the example. Does this =
mean that the example should</FONT>

<BR><FONT SIZE=3D2>&gt; rather be</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp;&nbsp; =
A&nbsp;&nbsp;&nbsp;&nbsp; B&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
C&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
D&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; E</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp; \'e&nbsp;&nbsp; =
\'e&nbsp;&nbsp; e^^^^0301&nbsp;&nbsp; ^^^^00e9&nbsp;&nbsp; ^^e9</FONT>
</P>

<P><FONT SIZE=3D2>As currently implemented, yes, it should. I'm not =
still sure if normalizing</FONT>

<BR><FONT SIZE=3D2>in this way is the best solution. However, I find the =
arguments in the</FONT>

<BR><FONT SIZE=3D2>Unicode book in favour of it quite convincing.</FONT>
</P>

<P><FONT SIZE=3D2>Regards</FONT>

<BR><FONT SIZE=3D2>Javier</FONT>
</P>

</BODY>
</HTML>
------_=_NextPart_001_01C0DBD6.92DFD200--