Received: from webgate.proteosys.de (mail.proteosys-ag.com [62.225.9.49]) by lucy.proteosys (8.11.0/8.9.3/SuSE Linux 8.9.3-0.1) with ESMTP id f55FN4f05956 for ; Tue, 5 Jun 2001 17:23:04 +0200 Received: by webgate.proteosys.de (8.11.0/8.11.0) with ESMTP id f55FN3p21055 . for ; Tue, 5 Jun 2001 17:23:03 +0200 MIME-Version: 1.0 Received: from mail.Uni-Mainz.DE (mailserver1.zdv.Uni-Mainz.DE [134.93.8.30]) by mailgate2.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f55FN2019794 for ; Tue, 5 Jun 2001 17:23:02 +0200 (MET DST) Content-Type: multipart/alternative; boundary="----_=_NextPart_001_01C0EDD3.6A14BC00" Received: from mailgate1.zdv.Uni-Mainz.DE (mailgate1.zdv.Uni-Mainz.DE [134.93.8.56]) by mail.Uni-Mainz.DE (8.9.3/8.9.3) with ESMTP id RAA19121 for ; Tue, 5 Jun 2001 17:23:01 +0200 (MEST) X-MimeOLE: Produced By Microsoft Exchange V6.5 Received: from mail.listserv.gmd.de (mail.listserv.gmd.de [192.88.97.5]) by mailgate1.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f55FN1U25188 for ; Tue, 5 Jun 2001 17:23:01 +0200 (MET DST) Received: from mail.listserv.gmd.de (192.88.97.5) by mail.listserv.gmd.de (LSMTP for OpenVMS v1.1a) with SMTP id <7.52D489A4@mail.listserv.gmd.de>; Tue, 5 Jun 2001 17:20:47 +0200 Received: from RELAY.URZ.UNI-HEIDELBERG.DE by RELAY.URZ.UNI-HEIDELBERG.DE (LISTSERV-TCP/IP release 1.8b) with spool id 497942 for LATEX-L@RELAY.URZ.UNI-HEIDELBERG.DE; Tue, 5 Jun 2001 17:22:57 +0200 Received: from ix.urz.uni-heidelberg.de (mail.urz.uni-heidelberg.de [129.206.119.234]) by relay.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id RAA15461 for ; Tue, 5 Jun 2001 17:22:56 +0200 (MET DST) Received: from relay.uni-heidelberg.de (relay.uni-heidelberg.de [129.206.100.212]) by ix.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id RAA64618 for ; Tue, 5 Jun 2001 17:22:56 +0200 Received: from algonet.se (delenn.tninet.se [195.100.94.104]) by relay.uni-heidelberg.de (8.10.2+Sun/8.10.2) with ESMTP id f55FMu111454 for ; Tue, 5 Jun 2001 17:22:56 +0200 (MET DST) Received: from [195.100.226.137] (du137-226.ppp.su-anst.tninet.se [195.100.226.137]) by delenn.tninet.se (BLUETAIL Mail Robustifier 2.2.2) with ESMTP id 196604.754573.991delenn-s1 for ; Tue, 05 Jun 2001 17:22:53 +0200 In-Reply-To: <15132.53407.696622.160198@fell.open.ac.uk> References: <15119.62808.151690.192812@gargle.gargle.HOWL> <15119.62808.151690.192812@gargle.gargle.HOWL> Return-Path: X-Sender: haberg@pop.matematik.su.se Content-class: urn:content-classes:message Subject: Re: \InputTranslation Date: Tue, 5 Jun 2001 16:21:20 +0100 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: From: "Hans Aberg" Sender: "Mailing list for the LaTeX3 project" To: "Multiple recipients of list LATEX-L" Reply-To: "Mailing list for the LaTeX3 project" Status: R X-Status: X-Keywords: X-UID: 4115 This is a multi-part message in MIME format. ------_=_NextPart_001_01C0EDD3.6A14BC00 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable At 13:29 +0100 2001/06/05, Chris Rowley wrote: >... rather than attempting to categorise the necessary >information and devise suitable ways to provide it, Frank and I came >up with the idea of simply supplying a single logical label for every >ICR string. Since the first, and still the overwhelmingly most >diverse, parts of this information came from the needs of multi-lingual >documents, we called this label the `language' (maybe not a good >choice). Our thesis is that `every text string must have a >language-label'. So this what I arrived at when playing around with these ideas in my = mind: A "language" is a set of parameters on which the typesetting procedures depend when typesetting what is considered a human textual language. For example, US and UK English differ in whether the first quotes should = be ``...'' or `...', so if one is supposed to enter quotes as say = \quote{...} and let the language sort it out, US and UK English are different languages. So first, one may classify some such common languages. But then it might be possible to customize: For example, in Swedish, one writes dates as 2001-06-05, but I happen to prefer the format 2001/06/05 (even though I rarely write anything in Swedish). So if dates are = entered say by \date{2001}{06}{05} and the rendering is sorted out by the choice = of language, I have created a new "language" by customizing the Swedish = date format. The exact implementation is really a question of implementation (which I describe merely to focus a little on the topic): For the sake of efficiency, one could decide to have a lookup table over the languages = in use, with keys say a 32-bit number. The key 0 could mean old-TeX compatibility, keys 1-65536 could be user defined languages (number = varying from document to document), 65537 US English, 65538 UK English, and so = on with other classified languages. This makes it easy to everywhere light-weight stamp language context, = and also add more language parameters by expanding the language lookup = table. >...In order to distinguish these logical language-labels from anything >else in the TeX world let us call them LLLs. ... >-- whenever a character token list (in an ICR) is constructed or > moved, then its LLL must go with it; In addition to merely stamp the language label on a string, I think that possibly one may have to stack it, that is, if there is a quote of = French within English, then one can from within the French quote know that it = is within an English quote. >But if you want \foo to be exclusively a bit of Mandarin text then you >could (or even should) define something like (syntax is probably >dreadful): > > \newcommand{\foo}{\languageIC{manadrin}{\unichar{}}} > >How clever the expansion of \languageIC needs to be will depend on how >such input will be used. My guess is that you are here saying that a language can also restrict = the characters available in it. For example, if somebody tries to use Greek letters in an English text, something is wrong. One can also think of having a dictionary that checks for words, and whenever possible compares it with a given language context. Then one = would get a warning if a French word which is not in the English dictionary appears in an English context. >All of the above is completely independent of what input scheme is >used, I can think of hybrids, that one has the option to indicate the default language when opening a file. If this language context is different than that from which the file was opened, the language contexts merely stack = up. But perhaps this effect can easily be achieved by some other commands. >IMPORTANT: After that first time input conversion the input encoding >that was used is unknown and not needed; this is a vital property of >our ICR model. This is also the model I have in my mind: Once input has been done and translated into Unicode+ plus other eventual parameters, the input = encoding becomes a non-issue as far as further TeX/LaTeX processing go. >1. How should LICR strings be written out to files used only by LaTeX > itself? > >2. How should LICR strings be written out to files read by other > applications? > >My feeling is that the answer to 1. should, if possible, be something >independent of any input schemes in use. > >It is not so clear that this is possible for 2. and there may be good >reasons why these two outputs should be the same. Oops. I did not think about these. But I think that the ideal would be = that 1 & 2 are the same. If language context is stamped everywhere (at least on text), and one should be able to pick it up again, I see two possible solutions: One is to define an encoding specifying the hierarchy of languages. For example (pseudo-code) begin_English English text \quote{begin_French French text end_French} ... end_English Here the begin_English, end_English, begin_French, end_French are some = file markers of the encoding scheme chosen (which could something compact, = like special Unicode characters not used for something else). The other method that comes to my mind is to write two files: One with = the language context unstamped characters, and another indicating the = language contexts, where they start and end in the other file. This is a common = way to handle say styled text at least in old MacOS pre_X (where the = additional information goes into the so called resource fork), but it makes it virtually impossible for humans to edit it by hand. >So have I removed the question: "do we need to record the input >encoding?"? Or merely cleverly hidden it? This is the picture I have in my mind too: Once the input has been processed properly, the input encoding is not anywhere present. The original TeX (from your discussions here) does not seem to be built = really to handle multiple present input encodings. But also in other programming, such as C/C++, I think it will be = difficult to handle more than one internal character encoding in the same program. = So from the point of efficiency, I think it is safest to stick to just one internal character encoding, if possible. Hans Aberg ------_=_NextPart_001_01C0EDD3.6A14BC00 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Re: \InputTranslation

At 13:29 +0100 2001/06/05, Chris Rowley wrote:
>... rather than attempting to categorise the = necessary
>information and devise suitable ways to provide = it, Frank and I came
>up with the idea of simply supplying a single = logical label for every
>ICR string.  Since the first, and still the = overwhelmingly most
>diverse, parts of this information came from the = needs of multi-lingual
>documents, we called this label the `language' = (maybe not a good
>choice).  Our thesis is that `every text = string must have a
>language-label'.

So this what I arrived at when playing around with = these ideas in my mind:

A "language" is a set of parameters on which = the typesetting procedures
depend when typesetting what is considered a human = textual language.

For example, US and UK English differ in whether the = first quotes should be
``...'' or `...', so if one is supposed to enter = quotes as say \quote{...}
and let the language sort it out, US and UK English = are different
languages. So first, one may classify some such = common languages.

But then it might be possible to customize: For = example, in Swedish, one
writes dates as 2001-06-05, but I happen to prefer = the format 2001/06/05
(even though I rarely write anything in Swedish). So = if dates are entered
say by \date{2001}{06}{05} and the rendering is = sorted out by the choice of
language, I have created a new "language" = by customizing the Swedish date
format.

The exact implementation is really a question of = implementation (which I
describe merely to focus a little on the topic): For = the sake of
efficiency, one could decide to have a lookup table = over the languages in
use, with keys say a 32-bit number. The key 0 could = mean old-TeX
compatibility, keys 1-65536 could be user defined = languages (number varying
from document to document), 65537 US English, 65538 = UK English, and so on
with other classified languages.

This makes it easy to everywhere light-weight stamp = language context, and
also add more language parameters by expanding the = language lookup table.

>...In order to distinguish these logical = language-labels from anything
>else in the TeX world let us call them = LLLs.
...
>-- whenever a character token list (in an ICR) is = constructed or
>   moved, then its LLL must go with = it;

In addition to merely stamp the language label on a = string, I think that
possibly one may have to stack it, that is, if there = is a quote of French
within English, then one can from within the French = quote know that it is
within an English quote.

>But if you want \foo to be exclusively a bit of = Mandarin text then you
>could (or even should) define something like = (syntax is probably
>dreadful):
>
>  = \newcommand{\foo}{\languageIC{manadrin}{\unichar{<Unicode = code>}}}
>
>How clever the expansion of \languageIC needs to = be will depend on how
>such input will be used.

My guess is that you are here saying that a language = can also restrict the
characters available in it. For example, if somebody = tries to use Greek
letters in an English text, something is = wrong.

One can also think of having a dictionary that checks = for words, and
whenever possible compares it with a given language = context. Then one would
get a warning if a French word which is not in the = English dictionary
appears in an English context.

>All of the above is completely independent of what = input scheme is
>used,

I can think of hybrids, that one has the option to = indicate the default
language when opening a file. If this language = context is different than
that from which the file was opened, the language = contexts merely stack up.

But perhaps this effect can easily be achieved by some = other commands.

>IMPORTANT: After that first time input conversion = the input encoding
>that was used is unknown and not needed; this is = a vital property of
>our ICR model.

This is also the model I have in my mind: Once input = has been done and
translated into Unicode+ plus other eventual = parameters, the input encoding
becomes a non-issue as far as further TeX/LaTeX = processing go.

>1.  How should LICR strings be written out to = files used only by LaTeX
>    itself?
>
>2.  How should LICR strings be written out = to files read by other
>    applications?
>
>My feeling is that the answer to 1. should, if = possible, be something
>independent of any input schemes in use.
>
>It is not so clear that this is possible for 2. = and there may be good
>reasons why these two outputs should be the = same.

Oops. I did not think about these. But I think that = the ideal would be that
1 & 2 are the same.

If language context is stamped everywhere (at least on = text), and one
should be able to pick it up again, I see two = possible solutions:

One is to define an encoding specifying the hierarchy = of languages. For
example (pseudo-code)
  begin_English English text \quote{begin_French = French text end_French}
    ... end_English
Here the begin_English, end_English, begin_French, = end_French are some file
markers of the encoding scheme chosen (which could = something compact, like
special Unicode characters not used for something = else).

The other method that comes to my mind is to write two = files: One with the
language context unstamped characters, and another = indicating the language
contexts, where they start and end in the other file. = This is a common way
to handle say styled text at least in old MacOS pre_X = (where the additional
information goes into the so called resource fork), = but it makes it
virtually impossible for humans to edit it by = hand.

>So have I removed the question: "do we need = to record the input
>encoding?"?  Or merely cleverly hidden = it?

This is the picture I have in my mind too: Once the = input has been
processed properly, the input encoding is not = anywhere present. The
original TeX (from your discussions here) does not = seem to be built really
to handle multiple present input encodings.

But also in other programming, such as C/C++, I think = it will be difficult
to handle more than one internal character encoding = in the same program. So
from the point of efficiency, I think it is safest to = stick to just one
internal character encoding, if possible.

  Hans Aberg

------_=_NextPart_001_01C0EDD3.6A14BC00--