MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----_=_NextPart_001_01C0EDD3.6A14BC00"
In-Reply-To:  <15132.53407.696622.160198@fell.open.ac.uk>
References: <15119.62808.151690.192812@gargle.gargle.HOWL>            <15119.62808.151690.192812@gargle.gargle.HOWL>
Content-class: urn:content-classes:message
Subject:      Re: \InputTranslation
Date: Tue, 5 Jun 2001 16:21:20 +0100
Message-ID:  <v03110701b74299a1fb73@[195.100.226.130]>
From: "Hans Aberg" <haberg@MATEMATIK.SU.SE>
Sender: "Mailing list for the LaTeX3 project" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
To: "Multiple recipients of list LATEX-L" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
Reply-To: "Mailing list for the LaTeX3 project" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
Status: R

This is a multi-part message in MIME format.

------_=_NextPart_001_01C0EDD3.6A14BC00
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

At 13:29 +0100 2001/06/05, Chris Rowley wrote:
>... rather than attempting to categorise the necessary
>information and devise suitable ways to provide it, Frank and I came
>up with the idea of simply supplying a single logical label for every
>ICR string.  Since the first, and still the overwhelmingly most
>diverse, parts of this information came from the needs of multi-lingual
>documents, we called this label the `language' (maybe not a good
>choice).  Our thesis is that `every text string must have a
>language-label'.

So this what I arrived at when playing around with these ideas in my =
mind:

A "language" is a set of parameters on which the typesetting procedures
depend when typesetting what is considered a human textual language.

For example, US and UK English differ in whether the first quotes should =
be
``...'' or `...', so if one is supposed to enter quotes as say =
\quote{...}
and let the language sort it out, US and UK English are different
languages. So first, one may classify some such common languages.

But then it might be possible to customize: For example, in Swedish, one
writes dates as 2001-06-05, but I happen to prefer the format 2001/06/05
(even though I rarely write anything in Swedish). So if dates are =
entered
say by \date{2001}{06}{05} and the rendering is sorted out by the choice =
of
language, I have created a new "language" by customizing the Swedish =
date
format.

The exact implementation is really a question of implementation (which I
describe merely to focus a little on the topic): For the sake of
efficiency, one could decide to have a lookup table over the languages =
in
use, with keys say a 32-bit number. The key 0 could mean old-TeX
compatibility, keys 1-65536 could be user defined languages (number =
varying
from document to document), 65537 US English, 65538 UK English, and so =
on
with other classified languages.

This makes it easy to everywhere light-weight stamp language context, =
and
also add more language parameters by expanding the language lookup =
table.

>...In order to distinguish these logical language-labels from anything
>else in the TeX world let us call them LLLs.
...
>-- whenever a character token list (in an ICR) is constructed or
>   moved, then its LLL must go with it;

In addition to merely stamp the language label on a string, I think that
possibly one may have to stack it, that is, if there is a quote of =
French
within English, then one can from within the French quote know that it =
is
within an English quote.

>But if you want \foo to be exclusively a bit of Mandarin text then you
>could (or even should) define something like (syntax is probably
>dreadful):
>
>  \newcommand{\foo}{\languageIC{manadrin}{\unichar{<Unicode code>}}}
>
>How clever the expansion of \languageIC needs to be will depend on how
>such input will be used.

My guess is that you are here saying that a language can also restrict =
the
characters available in it. For example, if somebody tries to use Greek
letters in an English text, something is wrong.

One can also think of having a dictionary that checks for words, and
whenever possible compares it with a given language context. Then one =
would
get a warning if a French word which is not in the English dictionary
appears in an English context.

>All of the above is completely independent of what input scheme is
>used,

I can think of hybrids, that one has the option to indicate the default
language when opening a file. If this language context is different than
that from which the file was opened, the language contexts merely stack =
up.

But perhaps this effect can easily be achieved by some other commands.

>IMPORTANT: After that first time input conversion the input encoding
>that was used is unknown and not needed; this is a vital property of
>our ICR model.

This is also the model I have in my mind: Once input has been done and
translated into Unicode+ plus other eventual parameters, the input =
encoding
becomes a non-issue as far as further TeX/LaTeX processing go.

>1.  How should LICR strings be written out to files used only by LaTeX
>    itself?
>
>2.  How should LICR strings be written out to files read by other
>    applications?
>
>My feeling is that the answer to 1. should, if possible, be something
>independent of any input schemes in use.
>
>It is not so clear that this is possible for 2. and there may be good
>reasons why these two outputs should be the same.

Oops. I did not think about these. But I think that the ideal would be =
that
1 & 2 are the same.

If language context is stamped everywhere (at least on text), and one
should be able to pick it up again, I see two possible solutions:

One is to define an encoding specifying the hierarchy of languages. For
example (pseudo-code)
  begin_English English text \quote{begin_French French text end_French}
    ... end_English
Here the begin_English, end_English, begin_French, end_French are some =
file
markers of the encoding scheme chosen (which could something compact, =
like
special Unicode characters not used for something else).

The other method that comes to my mind is to write two files: One with =
the
language context unstamped characters, and another indicating the =
language
contexts, where they start and end in the other file. This is a common =
way
to handle say styled text at least in old MacOS pre_X (where the =
additional
information goes into the so called resource fork), but it makes it
virtually impossible for humans to edit it by hand.

>So have I removed the question: "do we need to record the input
>encoding?"?  Or merely cleverly hidden it?

This is the picture I have in my mind too: Once the input has been
processed properly, the input encoding is not anywhere present. The
original TeX (from your discussions here) does not seem to be built =
really
to handle multiple present input encodings.

But also in other programming, such as C/C++, I think it will be =
difficult
to handle more than one internal character encoding in the same program. =
So
from the point of efficiency, I think it is safest to stick to just one
internal character encoding, if possible.

  Hans Aberg

------_=_NextPart_001_01C0EDD3.6A14BC00
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Diso-8859-1">
<META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version =
6.5.7654.12">
<TITLE>     Re: \InputTranslation</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->

<P><FONT SIZE=3D2>At 13:29 +0100 2001/06/05, Chris Rowley wrote:</FONT>

<BR><FONT SIZE=3D2>&gt;... rather than attempting to categorise the =
necessary</FONT>

<BR><FONT SIZE=3D2>&gt;information and devise suitable ways to provide =
it, Frank and I came</FONT>

<BR><FONT SIZE=3D2>&gt;up with the idea of simply supplying a single =
logical label for every</FONT>

<BR><FONT SIZE=3D2>&gt;ICR string.&nbsp; Since the first, and still the =
overwhelmingly most</FONT>

<BR><FONT SIZE=3D2>&gt;diverse, parts of this information came from the =
needs of multi-lingual</FONT>

<BR><FONT SIZE=3D2>&gt;documents, we called this label the `language' =
(maybe not a good</FONT>

<BR><FONT SIZE=3D2>&gt;choice).&nbsp; Our thesis is that `every text =
string must have a</FONT>

<BR><FONT SIZE=3D2>&gt;language-label'.</FONT>
</P>

<P><FONT SIZE=3D2>So this what I arrived at when playing around with =
these ideas in my mind:</FONT>
</P>

<P><FONT SIZE=3D2>A &quot;language&quot; is a set of parameters on which =
the typesetting procedures</FONT>

<BR><FONT SIZE=3D2>depend when typesetting what is considered a human =
textual language.</FONT>
</P>

<P><FONT SIZE=3D2>For example, US and UK English differ in whether the =
first quotes should be</FONT>

<BR><FONT SIZE=3D2>``...'' or `...', so if one is supposed to enter =
quotes as say \quote{...}</FONT>

<BR><FONT SIZE=3D2>and let the language sort it out, US and UK English =
are different</FONT>

<BR><FONT SIZE=3D2>languages. So first, one may classify some such =
common languages.</FONT>
</P>

<P><FONT SIZE=3D2>But then it might be possible to customize: For =
example, in Swedish, one</FONT>

<BR><FONT SIZE=3D2>writes dates as 2001-06-05, but I happen to prefer =
the format 2001/06/05</FONT>

<BR><FONT SIZE=3D2>(even though I rarely write anything in Swedish). So =
if dates are entered</FONT>

<BR><FONT SIZE=3D2>say by \date{2001}{06}{05} and the rendering is =
sorted out by the choice of</FONT>

<BR><FONT SIZE=3D2>language, I have created a new &quot;language&quot; =
by customizing the Swedish date</FONT>

<BR><FONT SIZE=3D2>format.</FONT>
</P>

<P><FONT SIZE=3D2>The exact implementation is really a question of =
implementation (which I</FONT>

<BR><FONT SIZE=3D2>describe merely to focus a little on the topic): For =
the sake of</FONT>

<BR><FONT SIZE=3D2>efficiency, one could decide to have a lookup table =
over the languages in</FONT>

<BR><FONT SIZE=3D2>use, with keys say a 32-bit number. The key 0 could =
mean old-TeX</FONT>

<BR><FONT SIZE=3D2>compatibility, keys 1-65536 could be user defined =
languages (number varying</FONT>

<BR><FONT SIZE=3D2>from document to document), 65537 US English, 65538 =
UK English, and so on</FONT>

<BR><FONT SIZE=3D2>with other classified languages.</FONT>
</P>

<P><FONT SIZE=3D2>This makes it easy to everywhere light-weight stamp =
language context, and</FONT>

<BR><FONT SIZE=3D2>also add more language parameters by expanding the =
language lookup table.</FONT>
</P>

<P><FONT SIZE=3D2>&gt;...In order to distinguish these logical =
language-labels from anything</FONT>

<BR><FONT SIZE=3D2>&gt;else in the TeX world let us call them =
LLLs.</FONT>

<BR><FONT SIZE=3D2>...</FONT>

<BR><FONT SIZE=3D2>&gt;-- whenever a character token list (in an ICR) is =
constructed or</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp; moved, then its LLL must go with =
it;</FONT>
</P>

<P><FONT SIZE=3D2>In addition to merely stamp the language label on a =
string, I think that</FONT>

<BR><FONT SIZE=3D2>possibly one may have to stack it, that is, if there =
is a quote of French</FONT>

<BR><FONT SIZE=3D2>within English, then one can from within the French =
quote know that it is</FONT>

<BR><FONT SIZE=3D2>within an English quote.</FONT>
</P>

<P><FONT SIZE=3D2>&gt;But if you want \foo to be exclusively a bit of =
Mandarin text then you</FONT>

<BR><FONT SIZE=3D2>&gt;could (or even should) define something like =
(syntax is probably</FONT>

<BR><FONT SIZE=3D2>&gt;dreadful):</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp; =
\newcommand{\foo}{\languageIC{manadrin}{\unichar{&lt;Unicode =
code&gt;}}}</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt;How clever the expansion of \languageIC needs to =
be will depend on how</FONT>

<BR><FONT SIZE=3D2>&gt;such input will be used.</FONT>
</P>

<P><FONT SIZE=3D2>My guess is that you are here saying that a language =
can also restrict the</FONT>

<BR><FONT SIZE=3D2>characters available in it. For example, if somebody =
tries to use Greek</FONT>

<BR><FONT SIZE=3D2>letters in an English text, something is =
wrong.</FONT>
</P>

<P><FONT SIZE=3D2>One can also think of having a dictionary that checks =
for words, and</FONT>

<BR><FONT SIZE=3D2>whenever possible compares it with a given language =
context. Then one would</FONT>

<BR><FONT SIZE=3D2>get a warning if a French word which is not in the =
English dictionary</FONT>

<BR><FONT SIZE=3D2>appears in an English context.</FONT>
</P>

<P><FONT SIZE=3D2>&gt;All of the above is completely independent of what =
input scheme is</FONT>

<BR><FONT SIZE=3D2>&gt;used,</FONT>
</P>

<P><FONT SIZE=3D2>I can think of hybrids, that one has the option to =
indicate the default</FONT>

<BR><FONT SIZE=3D2>language when opening a file. If this language =
context is different than</FONT>

<BR><FONT SIZE=3D2>that from which the file was opened, the language =
contexts merely stack up.</FONT>
</P>

<P><FONT SIZE=3D2>But perhaps this effect can easily be achieved by some =
other commands.</FONT>
</P>

<P><FONT SIZE=3D2>&gt;IMPORTANT: After that first time input conversion =
the input encoding</FONT>

<BR><FONT SIZE=3D2>&gt;that was used is unknown and not needed; this is =
a vital property of</FONT>

<BR><FONT SIZE=3D2>&gt;our ICR model.</FONT>
</P>

<P><FONT SIZE=3D2>This is also the model I have in my mind: Once input =
has been done and</FONT>

<BR><FONT SIZE=3D2>translated into Unicode+ plus other eventual =
parameters, the input encoding</FONT>

<BR><FONT SIZE=3D2>becomes a non-issue as far as further TeX/LaTeX =
processing go.</FONT>
</P>

<P><FONT SIZE=3D2>&gt;1.&nbsp; How should LICR strings be written out to =
files used only by LaTeX</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp; itself?</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt;2.&nbsp; How should LICR strings be written out =
to files read by other</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp; applications?</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt;My feeling is that the answer to 1. should, if =
possible, be something</FONT>

<BR><FONT SIZE=3D2>&gt;independent of any input schemes in use.</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt;It is not so clear that this is possible for 2. =
and there may be good</FONT>

<BR><FONT SIZE=3D2>&gt;reasons why these two outputs should be the =
same.</FONT>
</P>

<P><FONT SIZE=3D2>Oops. I did not think about these. But I think that =
the ideal would be that</FONT>

<BR><FONT SIZE=3D2>1 &amp; 2 are the same.</FONT>
</P>

<P><FONT SIZE=3D2>If language context is stamped everywhere (at least on =
text), and one</FONT>

<BR><FONT SIZE=3D2>should be able to pick it up again, I see two =
possible solutions:</FONT>
</P>

<P><FONT SIZE=3D2>One is to define an encoding specifying the hierarchy =
of languages. For</FONT>

<BR><FONT SIZE=3D2>example (pseudo-code)</FONT>

<BR><FONT SIZE=3D2>&nbsp; begin_English English text \quote{begin_French =
French text end_French}</FONT>

<BR><FONT SIZE=3D2>&nbsp;&nbsp;&nbsp; ... end_English</FONT>

<BR><FONT SIZE=3D2>Here the begin_English, end_English, begin_French, =
end_French are some file</FONT>

<BR><FONT SIZE=3D2>markers of the encoding scheme chosen (which could =
something compact, like</FONT>

<BR><FONT SIZE=3D2>special Unicode characters not used for something =
else).</FONT>
</P>

<P><FONT SIZE=3D2>The other method that comes to my mind is to write two =
files: One with the</FONT>

<BR><FONT SIZE=3D2>language context unstamped characters, and another =
indicating the language</FONT>

<BR><FONT SIZE=3D2>contexts, where they start and end in the other file. =
This is a common way</FONT>

<BR><FONT SIZE=3D2>to handle say styled text at least in old MacOS pre_X =
(where the additional</FONT>

<BR><FONT SIZE=3D2>information goes into the so called resource fork), =
but it makes it</FONT>

<BR><FONT SIZE=3D2>virtually impossible for humans to edit it by =
hand.</FONT>
</P>

<P><FONT SIZE=3D2>&gt;So have I removed the question: &quot;do we need =
to record the input</FONT>

<BR><FONT SIZE=3D2>&gt;encoding?&quot;?&nbsp; Or merely cleverly hidden =
it?</FONT>
</P>

<P><FONT SIZE=3D2>This is the picture I have in my mind too: Once the =
input has been</FONT>

<BR><FONT SIZE=3D2>processed properly, the input encoding is not =
anywhere present. The</FONT>

<BR><FONT SIZE=3D2>original TeX (from your discussions here) does not =
seem to be built really</FONT>

<BR><FONT SIZE=3D2>to handle multiple present input encodings.</FONT>
</P>

<P><FONT SIZE=3D2>But also in other programming, such as C/C++, I think =
it will be difficult</FONT>

<BR><FONT SIZE=3D2>to handle more than one internal character encoding =
in the same program. So</FONT>

<BR><FONT SIZE=3D2>from the point of efficiency, I think it is safest to =
stick to just one</FONT>

<BR><FONT SIZE=3D2>internal character encoding, if possible.</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp; Hans Aberg</FONT>
</P>

</BODY>
</HTML>
------_=_NextPart_001_01C0EDD3.6A14BC00--