Received: from webgate.proteosys.de (mail.proteosys-ag.com [62.225.9.49]) by lucy.proteosys (8.11.0/8.9.3/SuSE Linux 8.9.3-0.1) with ESMTP id f4DI0Jf17528 for ; Sun, 13 May 2001 20:00:20 +0200 Received: by webgate.proteosys.de (8.11.0/8.11.0) with ESMTP id f4DI0J716977 . for ; Sun, 13 May 2001 20:00:19 +0200 MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----_=_NextPart_001_01C0DBD6.92DFD200" Received: from mail.Uni-Mainz.DE (mailserver1.zdv.Uni-Mainz.DE [134.93.8.30]) by mailgate1.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f4DI0EU00472 for ; Sun, 13 May 2001 20:00:14 +0200 (MET DST) Received: from mailgate1.zdv.Uni-Mainz.DE (mailgate1.zdv.Uni-Mainz.DE [134.93.8.56]) by mail.Uni-Mainz.DE (8.9.3/8.9.3) with ESMTP id UAA20102 for ; Sun, 13 May 2001 20:00:13 +0200 (MEST) X-MimeOLE: Produced By Microsoft Exchange V6.5 Received: from mail.listserv.gmd.de (mail.listserv.gmd.de [192.88.97.5]) by mailgate1.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f4DI0DU00468 for ; Sun, 13 May 2001 20:00:13 +0200 (MET DST) Received: from mail.listserv.gmd.de (192.88.97.5) by mail.listserv.gmd.de (LSMTP for OpenVMS v1.1a) with SMTP id <13.8FC7E20F@mail.listserv.gmd.de>; Sun, 13 May 2001 19:58:37 +0200 Received: from RELAY.URZ.UNI-HEIDELBERG.DE by RELAY.URZ.UNI-HEIDELBERG.DE (LISTSERV-TCP/IP release 1.8b) with spool id 495330 for LATEX-L@RELAY.URZ.UNI-HEIDELBERG.DE; Sun, 13 May 2001 20:00:09 +0200 Received: from ix.urz.uni-heidelberg.de (mail.urz.uni-heidelberg.de [129.206.119.234]) by relay.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id UAA04397 for ; Sun, 13 May 2001 20:00:08 +0200 (MET DST) Received: from relay.uni-heidelberg.de (relay.uni-heidelberg.de [129.206.100.212]) by ix.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id UAA32950 for ; Sun, 13 May 2001 20:00:09 +0200 Received: from smtp.wanadoo.es (m1smtpisp01.wanadoo.es [62.36.220.61] (may be forged)) by relay.uni-heidelberg.de (8.10.2+Sun/8.10.2) with ESMTP id f4DI09Q25893 for ; Sun, 13 May 2001 20:00:09 +0200 (MET DST) Received: from [62.37.92.86] (62-37-92-86.dialup.uni2.es [62.37.92.86]) by smtp.wanadoo.es (8.10.2/8.10.2) with ESMTP id f4DHxx723926 for ; Sun, 13 May 2001 19:59:59 +0200 (MET DST) Return-Path: X-Mailer: Microsoft Outlook Express Macintosh Edition - 4.5 (0410) Content-class: urn:content-classes:message Subject: Re: Multilingual Encodings Summary 2.2 Date: Sun, 13 May 2001 19:55:58 +0100 Message-ID: <200105131759.f4DHxx723926@smtp.wanadoo.es> X-MS-Has-Attach: X-MS-TNEF-Correlator: From: "Javier Bezos" Sender: "Mailing list for the LaTeX3 project" To: "Multiple recipients of list LATEX-L" Reply-To: "Mailing list for the LaTeX3 project" Status: R X-Status: X-Keywords: X-UID: 4050 This is a multi-part message in MIME format. ------_=_NextPart_001_01C0DBD6.92DFD200 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable >>And >>regarding font transformation, they should be handled by fonts, but >>the main problem is that metric information (ie, tfm) cannot be >>modified from within TeX, except a few parameters; I really wonder >>if allowing more changes, mainly ligatures, is feasible (that >>solution would be better than font ocp's and vf's, I think). > > I don't understand this. What kind of font transformations are you > referring to? For example, removing the fi ligature in Turkish. Or using an alternate ortography in languages with contextual analysis. >>Semantically or visually? > > I suspect Frank considers meaning to be a semantic concept, not a = visual. I also suspect that, but then if we pick a char it will be undefined visually and its rendering (and TeX is essentially about rendering) will need _always_ additional information about the context (example: traditional idiograms in Japanese vs. simplified ones in Chinese). > I believe one of the main problems for multilinguality in LaTeX today = is > that there is no way of recording (or maybe even of determining) the > current context so that this information can be moved around with = every > piece of code affected by it. Hence most current commands strive = instead to > convert the code to a context-free representation (the LICR) by use of > protected expansion. In such a case, we must find the way. Without it, proper rendering is impossible. Of course, we may write an ideogram to the aux file as a macro; for example ai (love) can be written as \japaneseai and \chineseai depending on the context they are written, but that means that the resulting code is not very different from the current mess with Russian where we have \cyrA, \cyrB, etc. That's exactly what I want to avoid. Orf course, that also means that changing things depending on the target will become more difficult. Further, by doing so we are creating again a closed system using its own conventions with no links with external tools adapted to Unicode. I will be able to process a file and extract information from it with, say, Python very easily if they use a known representation (iso encodings or Unicode), but if we have to parse things like = \japaneseai or similar, things become more difficult. I think it's a lot easier moving information with blocks of text and not with single chars. I don't understand why we cannot determine the current language context--either I'm missing something or I'm very optimistic about the capabilities of TeX. Please, could you give an example where the current language cannot be determined and/or moved? >>> But such characters (the Spanish as well as the Hebrew) aren't = allowed in >>> names in LaTeX! >> >>But they should be allowed in the future in we want a true >>multilingual environment. > > Why? They are not part of any text, but part of the markup! Are you suggesting that Japaneses, Chineses, Tibetans, Arabs, Persians, Greeks, Russians, etc. must use the Latin alphabet *always*? That's not truly multilingual--maybe of interest for Occidental scholars, but not for people actually using these scripts and keyboards with these scripts. (Particularly messy is mixing right to left scripts with Latin.) > Isn't the \char primitive in Omega be able to produce arbitrary = characters > (at least arbitrary characters in the basic multilingual plane)? Not exactly. The \char primitive is a char, but not intrinsically Unicode--ocp's are also applied to \char (and therefore they are transcoded). > It looks quite reasonable to me, and it is certainly much better than = the > processing depicted in the example. Does this mean that the example = should > rather be > > A B C D E > \'e \'e e^^^^0301 ^^^^00e9 ^^e9 As currently implemented, yes, it should. I'm not still sure if = normalizing in this way is the best solution. However, I find the arguments in the Unicode book in favour of it quite convincing. Regards Javier ------_=_NextPart_001_01C0DBD6.92DFD200 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Re: Multilingual Encodings Summary 2.2

>>And
>>regarding font transformation, they should be = handled by fonts, but
>>the main problem is that metric information = (ie, tfm) cannot be
>>modified from within TeX, except a few = parameters; I really wonder
>>if allowing more changes, mainly ligatures, = is feasible (that
>>solution would be better than font ocp's and = vf's, I think).
>
> I don't understand this. What kind of font = transformations are you
> referring to?

For example, removing the fi ligature in Turkish. Or = using an alternate
ortography in languages with contextual = analysis.

>>Semantically or visually?
>
> I suspect Frank considers meaning to be a = semantic concept, not a visual.

I also suspect that, but then if we pick a char it = will be
undefined visually and its rendering (and TeX is = essentially about
rendering) will need _always_ additional information = about the context
(example: traditional idiograms in Japanese vs. = simplified ones in
Chinese).

> I believe one of the main problems for = multilinguality in LaTeX today is
> that there is no way of recording (or maybe even = of determining) the
> current context so that this information can be = moved around with every
> piece of code affected by it. Hence most current = commands strive instead to
> convert the code to a context-free = representation (the LICR) by use of
> protected expansion.

In such a case, we must find the way. Without it, = proper rendering is
impossible. Of course, we may write an ideogram to = the aux file as
a macro; for example ai (love) can be written as = \japaneseai and
\chineseai depending on the context they are written, = but that means
that the resulting code is not very different from = the current mess
with Russian where we have \cyrA, \cyrB, etc. That's = exactly what I want
to avoid. Orf course, that also means that changing = things depending
on the target will become more difficult.

Further, by doing so we are creating again a closed = system
using its own conventions with no links with external = tools adapted
to Unicode. I will be able to process a file and = extract information
from it with, say, Python very easily if they use a = known representation
(iso encodings or Unicode), but if we have to parse = things like \japaneseai
or similar, things become more difficult.  I = think it's a lot easier
moving information with blocks of text and not with = single chars.

I don't understand why we cannot determine the current = language
context--either I'm missing something or I'm very = optimistic about
the capabilities of TeX. Please, could you give an = example where
the current language cannot be determined and/or = moved?

>>> But such characters (the Spanish as well = as the Hebrew) aren't allowed in
>>> names in LaTeX!
>>
>>But they should be allowed in the future in = we want a true
>>multilingual environment.
>
> Why? They are not part of any text, but part of = the markup!

Are you suggesting that Japaneses, Chineses, Tibetans, = Arabs,
Persians, Greeks, Russians, etc. must use the Latin = alphabet *always*?
That's not truly multilingual--maybe of interest for = Occidental
scholars, but not for people actually using these = scripts and
keyboards with these scripts. (Particularly messy is = mixing
right to left scripts with Latin.)

> Isn't the \char primitive in Omega be able to = produce arbitrary characters
> (at least arbitrary characters in the basic = multilingual plane)?

Not exactly. The \char primitive is a char, but not = intrinsically
Unicode--ocp's are also applied to \char (and = therefore they are
transcoded).

> It looks quite reasonable to me, and it is = certainly much better than the
> processing depicted in the example. Does this = mean that the example should
> rather be
>
>     = A     B        = C          = D        E
>    \'e   = \'e   e^^^^0301   ^^^^00e9   ^^e9

As currently implemented, yes, it should. I'm not = still sure if normalizing
in this way is the best solution. However, I find the = arguments in the
Unicode book in favour of it quite convincing.

Regards
Javier

------_=_NextPart_001_01C0DBD6.92DFD200--