Received: from webgate.proteosys.de (mail.proteosys-ag.com [62.225.9.49]) by lucy.proteosys (8.11.0/8.9.3/SuSE Linux 8.9.3-0.1) with ESMTP id f4DLkFf18060 for ; Sun, 13 May 2001 23:46:15 +0200 Received: by webgate.proteosys.de (8.11.0/8.11.0) with ESMTP id f4DLkF717820 . for ; Sun, 13 May 2001 23:46:15 +0200 MIME-Version: 1.0 Received: from mail.Uni-Mainz.DE (mailserver1.zdv.Uni-Mainz.DE [134.93.8.30]) by mailgate2.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f4DLk9009062 for ; Sun, 13 May 2001 23:46:09 +0200 (MET DST) Content-Type: multipart/alternative; boundary="----_=_NextPart_001_01C0DBF6.2248AD80" Received: from mailgate1.zdv.Uni-Mainz.DE (mailgate1.zdv.Uni-Mainz.DE [134.93.8.56]) by mail.Uni-Mainz.DE (8.9.3/8.9.3) with ESMTP id XAA28977 for ; Sun, 13 May 2001 23:46:09 +0200 (MEST) X-MimeOLE: Produced By Microsoft Exchange V6.5 Received: from mail.listserv.gmd.de (mail.listserv.gmd.de [192.88.97.5]) by mailgate1.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f4DLk9U09581 for ; Sun, 13 May 2001 23:46:09 +0200 (MET DST) Received: from mail.listserv.gmd.de (192.88.97.5) by mail.listserv.gmd.de (LSMTP for OpenVMS v1.1a) with SMTP id <0.1F775E37@mail.listserv.gmd.de>; Sun, 13 May 2001 23:44:32 +0200 Received: from RELAY.URZ.UNI-HEIDELBERG.DE by RELAY.URZ.UNI-HEIDELBERG.DE (LISTSERV-TCP/IP release 1.8b) with spool id 495430 for LATEX-L@RELAY.URZ.UNI-HEIDELBERG.DE; Sun, 13 May 2001 23:46:05 +0200 Received: from ix.urz.uni-heidelberg.de (mail.urz.uni-heidelberg.de [129.206.119.234]) by relay.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id XAA07051 for ; Sun, 13 May 2001 23:46:04 +0200 (MET DST) Received: from relay.uni-heidelberg.de (relay.uni-heidelberg.de [129.206.100.212]) by ix.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id XAA16294 for ; Sun, 13 May 2001 23:46:05 +0200 Received: from mail.umu.se (custer.umdac.umu.se [130.239.8.14]) by relay.uni-heidelberg.de (8.10.2+Sun/8.10.2) with ESMTP id f4DLk3Q17645 for ; Sun, 13 May 2001 23:46:03 +0200 (MET DST) Received: from [130.239.137.13] (mariehemsv093.sn.umu.se [130.239.137.13]) by mail.umu.se (8.8.8/8.8.8) with ESMTP id XAA04689; Sun, 13 May 2001 23:46:04 +0200 (MET DST) In-Reply-To: <200105131759.f4DHxx723926@smtp.wanadoo.es> Return-Path: X-Sender: lars@abel.math.umu.se x-mime-autoconverted: from quoted-printable to 8bit by relay.urz.uni-heidelberg.de id XAA07053 Content-class: urn:content-classes:message Subject: Re: Multilingual Encodings Summary 2.2 Date: Sun, 13 May 2001 22:46:03 +0100 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: From: =?iso-8859-1?Q?Lars_Hellstr=F6m?= Sender: "Mailing list for the LaTeX3 project" To: "Multiple recipients of list LATEX-L" Reply-To: "Mailing list for the LaTeX3 project" Status: R X-Status: X-Keywords: X-UID: 4054 This is a multi-part message in MIME format. ------_=_NextPart_001_01C0DBF6.2248AD80 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable At 20.55 +0200 2001-05-13, Javier Bezos wrote: >>>And >>>regarding font transformation, they should be handled by fonts, but >>>the main problem is that metric information (ie, tfm) cannot be >>>modified from within TeX, except a few parameters; I really wonder >>>if allowing more changes, mainly ligatures, is feasible (that >>>solution would be better than font ocp's and vf's, I think). >> >> I don't understand this. What kind of font transformations are you >> referring to? > >For example, removing the fi ligature in Turkish. Or using an alternate >ortography in languages with contextual analysis. That doesn't seem like metric transformations to me, but more like exchanging some sequences of slots with others. I thought Omega employed OCPs (which could be selected by the document) for this? >>>Semantically or visually? >> >> I suspect Frank considers meaning to be a semantic concept, not a = visual. > >I also suspect that, but then if we pick a char it will be >undefined visually and its rendering (and TeX is essentially about >rendering) will need _always_ additional information about the context >(example: traditional idiograms in Japanese vs. simplified ones in >Chinese). In think the following quote from the Unicode standard (p. 261) answers = that: There is some concern that unifying Han characters may lead to = confusion because they are sometimes used differently by the various East Asian languages. Computationally, Han character unification presents no more difficulty than employing a single Latin character set that is used to write languages as different as English and French. If they are not different in Unicode then there probably is no reason to make them different in LaTeX either. >Further, by doing so we are creating again a closed system >using its own conventions with no links with external tools adapted >to Unicode. I will be able to process a file and extract information >from it with, say, Python very easily if they use a known = representation >(iso encodings or Unicode), but if we have to parse things like = \japaneseai >or similar, things become more difficult. Agreed. > I think it's a lot easier >moving information with blocks of text and not with single chars. Depends on what type of information it is. For information specifying = the language almost certainly yes. If you want to move around information saying "the 8-bit characters in this piece of text should be interpreted according to the following input encoding" then I would say no (amongst other things because it would constitute a representation not known to other programs). >I don't understand why we cannot determine the current language >context--either I'm missing something or I'm very optimistic about >the capabilities of TeX. Please, could you give an example where >the current language cannot be determined and/or moved? It's not my area of expertise, so I may be wrong, but I suspect there = are well-known examples. The problem is mainly that the current context is a rather fuzzy concept; there are other aspects of it than the language. Thoroughly thinking things through might however well produce a model = where the current context is easy to determine and pass around. >>>> But such characters (the Spanish as well as the Hebrew) aren't = allowed in >>>> names in LaTeX! >>> >>>But they should be allowed in the future in we want a true >>>multilingual environment. >> >> Why? They are not part of any text, but part of the markup! > >Are you suggesting that Japaneses, Chineses, Tibetans, Arabs, >Persians, Greeks, Russians, etc. must use the Latin alphabet *always*? >That's not truly multilingual--maybe of interest for Occidental >scholars, but not for people actually using these scripts and >keyboards with these scripts. Good point! I hadn't thought of that. >(Particularly messy is mixing >right to left scripts with Latin.) Because of limitations in the editors or because of something else? >> Isn't the \char primitive in Omega be able to produce arbitrary = characters >> (at least arbitrary characters in the basic multilingual plane)? > >Not exactly. The \char primitive is a char, but not intrinsically >Unicode--ocp's are also applied to \char (and therefore they are >transcoded). Why should there exist characters which are not encoded using Unicode en route from the mouth to the stomach, if we're anyway using Unicode for = e.g. hyphenation? >> It looks quite reasonable to me, and it is certainly much better than = the >> processing depicted in the example. Does this mean that the example = should >> rather be >> >> A B C D E >> \'e \'e e^^^^0301 ^^^^00e9 ^^e9 > >As currently implemented, yes, it should. Good, then we've straightened that out! Now what about the other example line (explicit "82 from column A)? >I'm not still sure if normalizing >in this way is the best solution. However, I find the arguments in the >Unicode book in favour of it quite convincing. Exactly in what way normalization should be applied and when clearly = needs further study. Lars Hellstr=F6m ------_=_NextPart_001_01C0DBF6.2248AD80 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Re: Multilingual Encodings Summary 2.2

At 20.55 +0200 2001-05-13, Javier Bezos wrote:
>>>And
>>>regarding font transformation, they = should be handled by fonts, but
>>>the main problem is that metric = information (ie, tfm) cannot be
>>>modified from within TeX, except a few = parameters; I really wonder
>>>if allowing more changes, mainly = ligatures, is feasible (that
>>>solution would be better than font ocp's = and vf's, I think).
>>
>> I don't understand this. What kind of font = transformations are you
>> referring to?
>
>For example, removing the fi ligature in Turkish. = Or using an alternate
>ortography in languages with contextual = analysis.

That doesn't seem like metric transformations to me, = but more like
exchanging some sequences of slots with others. I = thought Omega employed
OCPs (which could be selected by the document) for = this?

>>>Semantically or visually?
>>
>> I suspect Frank considers meaning to be a = semantic concept, not a visual.
>
>I also suspect that, but then if we pick a char = it will be
>undefined visually and its rendering (and TeX is = essentially about
>rendering) will need _always_ additional = information about the context
>(example: traditional idiograms in Japanese vs. = simplified ones in
>Chinese).

In think the following quote from the Unicode standard = (p. 261) answers that:

  There is some concern that unifying Han = characters may lead to confusion
  because they are sometimes used differently by = the various East Asian
  languages. Computationally, Han character = unification presents no more
  difficulty than employing a single Latin = character set that is used to
  write languages as different as English and = French.

If they are not different in Unicode then there = probably is no reason to
make them different in LaTeX either.

>Further, by doing so we are creating again a = closed system
>using its own conventions with no links with = external tools adapted
>to Unicode. I will be able to process a file and = extract information
>from it with, say, Python very easily if they use = a known representation
>(iso encodings or Unicode), but if we have to = parse things like \japaneseai
>or similar, things become more difficult.

Agreed.

>  I think it's a lot easier
>moving information with blocks of text and not = with single chars.

Depends on what type of information it is. For = information specifying the
language almost certainly yes. If you want to move = around information
saying "the 8-bit characters in this piece of = text should be interpreted
according to the following input encoding" then = I would say no (amongst
other things because it would constitute a = representation not known to
other programs).

>I don't understand why we cannot determine the = current language
>context--either I'm missing something or I'm very = optimistic about
>the capabilities of TeX. Please, could you give = an example where
>the current language cannot be determined and/or = moved?

It's not my area of expertise, so I may be wrong, but = I suspect there are
well-known examples. The problem is mainly that the = current context is a
rather fuzzy concept; there are other aspects of it = than the language.
Thoroughly thinking things through might however well = produce a model where
the current context is easy to determine and pass = around.

>>>> But such characters (the Spanish as = well as the Hebrew) aren't allowed in
>>>> names in LaTeX!
>>>
>>>But they should be allowed in the future = in we want a true
>>>multilingual environment.
>>
>> Why? They are not part of any text, but part = of the markup!
>
>Are you suggesting that Japaneses, Chineses, = Tibetans, Arabs,
>Persians, Greeks, Russians, etc. must use the = Latin alphabet *always*?
>That's not truly multilingual--maybe of interest = for Occidental
>scholars, but not for people actually using these = scripts and
>keyboards with these scripts.

Good point! I hadn't thought of that.

>(Particularly messy is mixing
>right to left scripts with Latin.)

Because of limitations in the editors or because of = something else?

>> Isn't the \char primitive in Omega be able to = produce arbitrary characters
>> (at least arbitrary characters in the basic = multilingual plane)?
>
>Not exactly. The \char primitive is a char, but = not intrinsically
>Unicode--ocp's are also applied to \char (and = therefore they are
>transcoded).

Why should there exist characters which are not = encoded using Unicode en
route from the mouth to the stomach, if we're anyway = using Unicode for e.g.
hyphenation?

>> It looks quite reasonable to me, and it is = certainly much better than the
>> processing depicted in the example. Does = this mean that the example should
>> rather be
>>
>>     = A     B        = C          = D        E
>>    \'e   = \'e   e^^^^0301   ^^^^00e9   ^^e9
>
>As currently implemented, yes, it should.

Good, then we've straightened that out! Now what about = the other example
line (explicit "82 from column A)?

>I'm not still sure if normalizing
>in this way is the best solution. However, I find = the arguments in the
>Unicode book in favour of it quite = convincing.

Exactly in what way normalization should be applied = and when clearly needs
further study.

Lars Hellstr=F6m

------_=_NextPart_001_01C0DBF6.2248AD80--