MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----_=_NextPart_001_01C0DBF6.2248AD80"
In-Reply-To:  <200105131759.f4DHxx723926@smtp.wanadoo.es>
Content-class: urn:content-classes:message
Subject:      Re: Multilingual Encodings Summary 2.2
Date: Sun, 13 May 2001 22:46:03 +0100
Message-ID:  <l03102801b7249e4c7f77@[130.239.137.13]>
From: =?iso-8859-1?Q?Lars_Hellstr=F6m?= <Lars.Hellstrom@MATH.UMU.SE>
Sender: "Mailing list for the LaTeX3 project" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
To: "Multiple recipients of list LATEX-L" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
Reply-To: "Mailing list for the LaTeX3 project" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
Status: R

This is a multi-part message in MIME format.

------_=_NextPart_001_01C0DBF6.2248AD80
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

At 20.55 +0200 2001-05-13, Javier Bezos wrote:
>>>And
>>>regarding font transformation, they should be handled by fonts, but
>>>the main problem is that metric information (ie, tfm) cannot be
>>>modified from within TeX, except a few parameters; I really wonder
>>>if allowing more changes, mainly ligatures, is feasible (that
>>>solution would be better than font ocp's and vf's, I think).
>>
>> I don't understand this. What kind of font transformations are you
>> referring to?
>
>For example, removing the fi ligature in Turkish. Or using an alternate
>ortography in languages with contextual analysis.

That doesn't seem like metric transformations to me, but more like
exchanging some sequences of slots with others. I thought Omega employed
OCPs (which could be selected by the document) for this?

>>>Semantically or visually?
>>
>> I suspect Frank considers meaning to be a semantic concept, not a =
visual.
>
>I also suspect that, but then if we pick a char it will be
>undefined visually and its rendering (and TeX is essentially about
>rendering) will need _always_ additional information about the context
>(example: traditional idiograms in Japanese vs. simplified ones in
>Chinese).

In think the following quote from the Unicode standard (p. 261) answers =
that:

  There is some concern that unifying Han characters may lead to =
confusion
  because they are sometimes used differently by the various East Asian
  languages. Computationally, Han character unification presents no more
  difficulty than employing a single Latin character set that is used to
  write languages as different as English and French.

If they are not different in Unicode then there probably is no reason to
make them different in LaTeX either.

>Further, by doing so we are creating again a closed system
>using its own conventions with no links with external tools adapted
>to Unicode. I will be able to process a file and extract information
>from it with, say, Python very easily if they use a known =
representation
>(iso encodings or Unicode), but if we have to parse things like =
\japaneseai
>or similar, things become more difficult.

Agreed.

>  I think it's a lot easier
>moving information with blocks of text and not with single chars.

Depends on what type of information it is. For information specifying =
the
language almost certainly yes. If you want to move around information
saying "the 8-bit characters in this piece of text should be interpreted
according to the following input encoding" then I would say no (amongst
other things because it would constitute a representation not known to
other programs).

>I don't understand why we cannot determine the current language
>context--either I'm missing something or I'm very optimistic about
>the capabilities of TeX. Please, could you give an example where
>the current language cannot be determined and/or moved?

It's not my area of expertise, so I may be wrong, but I suspect there =
are
well-known examples. The problem is mainly that the current context is a
rather fuzzy concept; there are other aspects of it than the language.
Thoroughly thinking things through might however well produce a model =
where
the current context is easy to determine and pass around.

>>>> But such characters (the Spanish as well as the Hebrew) aren't =
allowed in
>>>> names in LaTeX!
>>>
>>>But they should be allowed in the future in we want a true
>>>multilingual environment.
>>
>> Why? They are not part of any text, but part of the markup!
>
>Are you suggesting that Japaneses, Chineses, Tibetans, Arabs,
>Persians, Greeks, Russians, etc. must use the Latin alphabet *always*?
>That's not truly multilingual--maybe of interest for Occidental
>scholars, but not for people actually using these scripts and
>keyboards with these scripts.

Good point! I hadn't thought of that.

>(Particularly messy is mixing
>right to left scripts with Latin.)

Because of limitations in the editors or because of something else?

>> Isn't the \char primitive in Omega be able to produce arbitrary =
characters
>> (at least arbitrary characters in the basic multilingual plane)?
>
>Not exactly. The \char primitive is a char, but not intrinsically
>Unicode--ocp's are also applied to \char (and therefore they are
>transcoded).

Why should there exist characters which are not encoded using Unicode en
route from the mouth to the stomach, if we're anyway using Unicode for =
e.g.
hyphenation?

>> It looks quite reasonable to me, and it is certainly much better than =
the
>> processing depicted in the example. Does this mean that the example =
should
>> rather be
>>
>>     A     B        C          D        E
>>    \'e   \'e   e^^^^0301   ^^^^00e9   ^^e9
>
>As currently implemented, yes, it should.

Good, then we've straightened that out! Now what about the other example
line (explicit "82 from column A)?

>I'm not still sure if normalizing
>in this way is the best solution. However, I find the arguments in the
>Unicode book in favour of it quite convincing.

Exactly in what way normalization should be applied and when clearly =
needs
further study.

Lars Hellstr=F6m

------_=_NextPart_001_01C0DBF6.2248AD80
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Diso-8859-1">
<META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version =
6.5.7654.12">
<TITLE>     Re: Multilingual Encodings Summary 2.2</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->

<P><FONT SIZE=3D2>At 20.55 +0200 2001-05-13, Javier Bezos wrote:</FONT>

<BR><FONT SIZE=3D2>&gt;&gt;&gt;And</FONT>

<BR><FONT SIZE=3D2>&gt;&gt;&gt;regarding font transformation, they =
should be handled by fonts, but</FONT>

<BR><FONT SIZE=3D2>&gt;&gt;&gt;the main problem is that metric =
information (ie, tfm) cannot be</FONT>

<BR><FONT SIZE=3D2>&gt;&gt;&gt;modified from within TeX, except a few =
parameters; I really wonder</FONT>

<BR><FONT SIZE=3D2>&gt;&gt;&gt;if allowing more changes, mainly =
ligatures, is feasible (that</FONT>

<BR><FONT SIZE=3D2>&gt;&gt;&gt;solution would be better than font ocp's =
and vf's, I think).</FONT>

<BR><FONT SIZE=3D2>&gt;&gt;</FONT>

<BR><FONT SIZE=3D2>&gt;&gt; I don't understand this. What kind of font =
transformations are you</FONT>

<BR><FONT SIZE=3D2>&gt;&gt; referring to?</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt;For example, removing the fi ligature in Turkish. =
Or using an alternate</FONT>

<BR><FONT SIZE=3D2>&gt;ortography in languages with contextual =
analysis.</FONT>
</P>

<P><FONT SIZE=3D2>That doesn't seem like metric transformations to me, =
but more like</FONT>

<BR><FONT SIZE=3D2>exchanging some sequences of slots with others. I =
thought Omega employed</FONT>

<BR><FONT SIZE=3D2>OCPs (which could be selected by the document) for =
this?</FONT>
</P>

<P><FONT SIZE=3D2>&gt;&gt;&gt;Semantically or visually?</FONT>

<BR><FONT SIZE=3D2>&gt;&gt;</FONT>

<BR><FONT SIZE=3D2>&gt;&gt; I suspect Frank considers meaning to be a =
semantic concept, not a visual.</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt;I also suspect that, but then if we pick a char =
it will be</FONT>

<BR><FONT SIZE=3D2>&gt;undefined visually and its rendering (and TeX is =
essentially about</FONT>

<BR><FONT SIZE=3D2>&gt;rendering) will need _always_ additional =
information about the context</FONT>

<BR><FONT SIZE=3D2>&gt;(example: traditional idiograms in Japanese vs. =
simplified ones in</FONT>

<BR><FONT SIZE=3D2>&gt;Chinese).</FONT>
</P>

<P><FONT SIZE=3D2>In think the following quote from the Unicode standard =
(p. 261) answers that:</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp; There is some concern that unifying Han =
characters may lead to confusion</FONT>

<BR><FONT SIZE=3D2>&nbsp; because they are sometimes used differently by =
the various East Asian</FONT>

<BR><FONT SIZE=3D2>&nbsp; languages. Computationally, Han character =
unification presents no more</FONT>

<BR><FONT SIZE=3D2>&nbsp; difficulty than employing a single Latin =
character set that is used to</FONT>

<BR><FONT SIZE=3D2>&nbsp; write languages as different as English and =
French.</FONT>
</P>

<P><FONT SIZE=3D2>If they are not different in Unicode then there =
probably is no reason to</FONT>

<BR><FONT SIZE=3D2>make them different in LaTeX either.</FONT>
</P>

<P><FONT SIZE=3D2>&gt;Further, by doing so we are creating again a =
closed system</FONT>

<BR><FONT SIZE=3D2>&gt;using its own conventions with no links with =
external tools adapted</FONT>

<BR><FONT SIZE=3D2>&gt;to Unicode. I will be able to process a file and =
extract information</FONT>

<BR><FONT SIZE=3D2>&gt;from it with, say, Python very easily if they use =
a known representation</FONT>

<BR><FONT SIZE=3D2>&gt;(iso encodings or Unicode), but if we have to =
parse things like \japaneseai</FONT>

<BR><FONT SIZE=3D2>&gt;or similar, things become more difficult.</FONT>
</P>

<P><FONT SIZE=3D2>Agreed.</FONT>
</P>

<P><FONT SIZE=3D2>&gt;&nbsp; I think it's a lot easier</FONT>

<BR><FONT SIZE=3D2>&gt;moving information with blocks of text and not =
with single chars.</FONT>
</P>

<P><FONT SIZE=3D2>Depends on what type of information it is. For =
information specifying the</FONT>

<BR><FONT SIZE=3D2>language almost certainly yes. If you want to move =
around information</FONT>

<BR><FONT SIZE=3D2>saying &quot;the 8-bit characters in this piece of =
text should be interpreted</FONT>

<BR><FONT SIZE=3D2>according to the following input encoding&quot; then =
I would say no (amongst</FONT>

<BR><FONT SIZE=3D2>other things because it would constitute a =
representation not known to</FONT>

<BR><FONT SIZE=3D2>other programs).</FONT>
</P>

<P><FONT SIZE=3D2>&gt;I don't understand why we cannot determine the =
current language</FONT>

<BR><FONT SIZE=3D2>&gt;context--either I'm missing something or I'm very =
optimistic about</FONT>

<BR><FONT SIZE=3D2>&gt;the capabilities of TeX. Please, could you give =
an example where</FONT>

<BR><FONT SIZE=3D2>&gt;the current language cannot be determined and/or =
moved?</FONT>
</P>

<P><FONT SIZE=3D2>It's not my area of expertise, so I may be wrong, but =
I suspect there are</FONT>

<BR><FONT SIZE=3D2>well-known examples. The problem is mainly that the =
current context is a</FONT>

<BR><FONT SIZE=3D2>rather fuzzy concept; there are other aspects of it =
than the language.</FONT>

<BR><FONT SIZE=3D2>Thoroughly thinking things through might however well =
produce a model where</FONT>

<BR><FONT SIZE=3D2>the current context is easy to determine and pass =
around.</FONT>
</P>

<P><FONT SIZE=3D2>&gt;&gt;&gt;&gt; But such characters (the Spanish as =
well as the Hebrew) aren't allowed in</FONT>

<BR><FONT SIZE=3D2>&gt;&gt;&gt;&gt; names in LaTeX!</FONT>

<BR><FONT SIZE=3D2>&gt;&gt;&gt;</FONT>

<BR><FONT SIZE=3D2>&gt;&gt;&gt;But they should be allowed in the future =
in we want a true</FONT>

<BR><FONT SIZE=3D2>&gt;&gt;&gt;multilingual environment.</FONT>

<BR><FONT SIZE=3D2>&gt;&gt;</FONT>

<BR><FONT SIZE=3D2>&gt;&gt; Why? They are not part of any text, but part =
of the markup!</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt;Are you suggesting that Japaneses, Chineses, =
Tibetans, Arabs,</FONT>

<BR><FONT SIZE=3D2>&gt;Persians, Greeks, Russians, etc. must use the =
Latin alphabet *always*?</FONT>

<BR><FONT SIZE=3D2>&gt;That's not truly multilingual--maybe of interest =
for Occidental</FONT>

<BR><FONT SIZE=3D2>&gt;scholars, but not for people actually using these =
scripts and</FONT>

<BR><FONT SIZE=3D2>&gt;keyboards with these scripts.</FONT>
</P>

<P><FONT SIZE=3D2>Good point! I hadn't thought of that.</FONT>
</P>

<P><FONT SIZE=3D2>&gt;(Particularly messy is mixing</FONT>

<BR><FONT SIZE=3D2>&gt;right to left scripts with Latin.)</FONT>
</P>

<P><FONT SIZE=3D2>Because of limitations in the editors or because of =
something else?</FONT>
</P>

<P><FONT SIZE=3D2>&gt;&gt; Isn't the \char primitive in Omega be able to =
produce arbitrary characters</FONT>

<BR><FONT SIZE=3D2>&gt;&gt; (at least arbitrary characters in the basic =
multilingual plane)?</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt;Not exactly. The \char primitive is a char, but =
not intrinsically</FONT>

<BR><FONT SIZE=3D2>&gt;Unicode--ocp's are also applied to \char (and =
therefore they are</FONT>

<BR><FONT SIZE=3D2>&gt;transcoded).</FONT>
</P>

<P><FONT SIZE=3D2>Why should there exist characters which are not =
encoded using Unicode en</FONT>

<BR><FONT SIZE=3D2>route from the mouth to the stomach, if we're anyway =
using Unicode for e.g.</FONT>

<BR><FONT SIZE=3D2>hyphenation?</FONT>
</P>

<P><FONT SIZE=3D2>&gt;&gt; It looks quite reasonable to me, and it is =
certainly much better than the</FONT>

<BR><FONT SIZE=3D2>&gt;&gt; processing depicted in the example. Does =
this mean that the example should</FONT>

<BR><FONT SIZE=3D2>&gt;&gt; rather be</FONT>

<BR><FONT SIZE=3D2>&gt;&gt;</FONT>

<BR><FONT SIZE=3D2>&gt;&gt;&nbsp;&nbsp;&nbsp;&nbsp; =
A&nbsp;&nbsp;&nbsp;&nbsp; B&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
C&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
D&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; E</FONT>

<BR><FONT SIZE=3D2>&gt;&gt;&nbsp;&nbsp;&nbsp; \'e&nbsp;&nbsp; =
\'e&nbsp;&nbsp; e^^^^0301&nbsp;&nbsp; ^^^^00e9&nbsp;&nbsp; ^^e9</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt;As currently implemented, yes, it should.</FONT>
</P>

<P><FONT SIZE=3D2>Good, then we've straightened that out! Now what about =
the other example</FONT>

<BR><FONT SIZE=3D2>line (explicit &quot;82 from column A)?</FONT>
</P>

<P><FONT SIZE=3D2>&gt;I'm not still sure if normalizing</FONT>

<BR><FONT SIZE=3D2>&gt;in this way is the best solution. However, I find =
the arguments in the</FONT>

<BR><FONT SIZE=3D2>&gt;Unicode book in favour of it quite =
convincing.</FONT>
</P>

<P><FONT SIZE=3D2>Exactly in what way normalization should be applied =
and when clearly needs</FONT>

<BR><FONT SIZE=3D2>further study.</FONT>
</P>

<P><FONT SIZE=3D2>Lars Hellstr=F6m</FONT>
</P>

</BODY>
</HTML>
------_=_NextPart_001_01C0DBF6.2248AD80--