MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----_=_NextPart_001_01C0E6E1.00B0F100"
In-Reply-To:  <15120.57621.256542.391864@gargle.gargle.HOWL>
References: <200105270954.f4R9sBI23611@smtp.wanadoo.es>            <200105270954.f4R9sBI23611@smtp.wanadoo.es>
Content-class: urn:content-classes:message
Subject:      Re: \InputTranslation
Date: Sun, 27 May 2001 20:10:33 +0100
Message-ID:  <v03110700b736f1dd6f0c@[195.100.226.133]>
From: "Hans Aberg" <haberg@MATEMATIK.SU.SE>
Sender: "Mailing list for the LaTeX3 project" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
To: "Multiple recipients of list LATEX-L" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
Reply-To: "Mailing list for the LaTeX3 project" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
Status: R

This is a multi-part message in MIME format.

------_=_NextPart_001_01C0E6E1.00B0F100
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

At 13:12 +0200 2001/05/27, Marcel Oliver wrote:
>...it looks like there are a couple of strategies:
>
>1. Store the full language context with every character token sequence
>   along the lines that Javier suggests.

I think that this might turn out to be a no-no for the simple sake of
speed: Characters are at such a fundamental level that they should be
computationally as simple as possible.

I got the impression that the current Omega makes use of only 16-bit
characters (right?). -- It is however possible with C/C++ to guarantee =
an
integral type with at least 32 bits in it, if one stays away from =
wchar_t.
:-)

>2. Treat input encoding completely separate from language context.
>   Input encoding just determines how to get from an arbitrary
>   encoding to the Unicode(-like) ICR.  Thus, switches in the language
>   context have to be tagged explicitly by the user.
...
>3. Extreme version of 2 (the only strategy that seems to be cleanly
>   implementable on current Omega):
>
>   We simply define the \InputTranslation to be fixed on a per-file
>   basis.

I think of a hybrid between these two:

One advantage of the last one, 3, is that formats become independent of =
IO
encodings: If there is a mechanism external to the file selecting the
encoding, it will be possible to choose the encoding of .aux files etc.,
and then get Omega get to read it back without changing any pre-compiled
format. If the only mechanism is selecting encoding from within a file =
that
is compiled, this will not be possible.

> In other words, we acknowledge that it does not make any
>   sense in terms of usability to mix input encodings, as such files
>   simply cannot (and should not) be displayed cleanly in any editor.

This does not follow: One can easily define an translation that can =
handle
different input encodings in the same file.

The requirement is instead that the translator must know when it reads =
the
file byte by byte when and how to switch. If you integrate these =
switches
with TeX's macro system, then switches can be hard to predict, but that =
is
all.

On the other hand, Robin Fairbairns didn't like the approach 3, because =
the
directory might become littered with files indicating the encoding.

So why not do this: When Omega starts, one indicates the encoding in the
first file that Omega is reading. This would be a mode (cf Omega draft, =
ch
12), plus an  OTP (loc.cit. ch. 8). There can be some simplifying =
defaults
corresponding to formats that editors can handle (like ASCII and =
Unicode).

Then other files can be opened using information about mode + OTP as I
figure is   the case now.

But in addition, one can provide external encoding information about a =
file
that overrides the translation information in the command opening the =
file.

This way, even though a format is compiled to write and read .aux files =
in
say Unicode, one may override it and get Omega to write and read .aux =
files
in say UTF8.

The question though, when playing around with these ideas, is how people
will use the features implemented.

  Hans Aberg

------_=_NextPart_001_01C0E6E1.00B0F100
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Diso-8859-1">
<META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version =
6.5.7654.12">
<TITLE>     Re: \InputTranslation</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->

<P><FONT SIZE=3D2>At 13:12 +0200 2001/05/27, Marcel Oliver wrote:</FONT>

<BR><FONT SIZE=3D2>&gt;...it looks like there are a couple of =
strategies:</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt;1. Store the full language context with every =
character token sequence</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp; along the lines that Javier =
suggests.</FONT>
</P>

<P><FONT SIZE=3D2>I think that this might turn out to be a no-no for the =
simple sake of</FONT>

<BR><FONT SIZE=3D2>speed: Characters are at such a fundamental level =
that they should be</FONT>

<BR><FONT SIZE=3D2>computationally as simple as possible.</FONT>
</P>

<P><FONT SIZE=3D2>I got the impression that the current Omega makes use =
of only 16-bit</FONT>

<BR><FONT SIZE=3D2>characters (right?). -- It is however possible with =
C/C++ to guarantee an</FONT>

<BR><FONT SIZE=3D2>integral type with at least 32 bits in it, if one =
stays away from wchar_t.</FONT>

<BR><FONT SIZE=3D2>:-)</FONT>
</P>

<P><FONT SIZE=3D2>&gt;2. Treat input encoding completely separate from =
language context.</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp; Input encoding just determines how =
to get from an arbitrary</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp; encoding to the Unicode(-like) =
ICR.&nbsp; Thus, switches in the language</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp; context have to be tagged explicitly =
by the user.</FONT>

<BR><FONT SIZE=3D2>...</FONT>

<BR><FONT SIZE=3D2>&gt;3. Extreme version of 2 (the only strategy that =
seems to be cleanly</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp; implementable on current =
Omega):</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp; We simply define the =
\InputTranslation to be fixed on a per-file</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp; basis.</FONT>
</P>

<P><FONT SIZE=3D2>I think of a hybrid between these two:</FONT>
</P>

<P><FONT SIZE=3D2>One advantage of the last one, 3, is that formats =
become independent of IO</FONT>

<BR><FONT SIZE=3D2>encodings: If there is a mechanism external to the =
file selecting the</FONT>

<BR><FONT SIZE=3D2>encoding, it will be possible to choose the encoding =
of .aux files etc.,</FONT>

<BR><FONT SIZE=3D2>and then get Omega get to read it back without =
changing any pre-compiled</FONT>

<BR><FONT SIZE=3D2>format. If the only mechanism is selecting encoding =
from within a file that</FONT>

<BR><FONT SIZE=3D2>is compiled, this will not be possible.</FONT>
</P>

<P><FONT SIZE=3D2>&gt; In other words, we acknowledge that it does not =
make any</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp; sense in terms of usability to mix =
input encodings, as such files</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp; simply cannot (and should not) be =
displayed cleanly in any editor.</FONT>
</P>

<P><FONT SIZE=3D2>This does not follow: One can easily define an =
translation that can handle</FONT>

<BR><FONT SIZE=3D2>different input encodings in the same file.</FONT>
</P>

<P><FONT SIZE=3D2>The requirement is instead that the translator must =
know when it reads the</FONT>

<BR><FONT SIZE=3D2>file byte by byte when and how to switch. If you =
integrate these switches</FONT>

<BR><FONT SIZE=3D2>with TeX's macro system, then switches can be hard to =
predict, but that is</FONT>

<BR><FONT SIZE=3D2>all.</FONT>
</P>

<P><FONT SIZE=3D2>On the other hand, Robin Fairbairns didn't like the =
approach 3, because the</FONT>

<BR><FONT SIZE=3D2>directory might become littered with files indicating =
the encoding.</FONT>
</P>

<P><FONT SIZE=3D2>So why not do this: When Omega starts, one indicates =
the encoding in the</FONT>

<BR><FONT SIZE=3D2>first file that Omega is reading. This would be a =
mode (cf Omega draft, ch</FONT>

<BR><FONT SIZE=3D2>12), plus an&nbsp; OTP (loc.cit. ch. 8). There can be =
some simplifying defaults</FONT>

<BR><FONT SIZE=3D2>corresponding to formats that editors can handle =
(like ASCII and Unicode).</FONT>
</P>

<P><FONT SIZE=3D2>Then other files can be opened using information about =
mode + OTP as I</FONT>

<BR><FONT SIZE=3D2>figure is&nbsp;&nbsp; the case now.</FONT>
</P>

<P><FONT SIZE=3D2>But in addition, one can provide external encoding =
information about a file</FONT>

<BR><FONT SIZE=3D2>that overrides the translation information in the =
command opening the file.</FONT>
</P>

<P><FONT SIZE=3D2>This way, even though a format is compiled to write =
and read .aux files in</FONT>

<BR><FONT SIZE=3D2>say Unicode, one may override it and get Omega to =
write and read .aux files</FONT>

<BR><FONT SIZE=3D2>in say UTF8.</FONT>
</P>

<P><FONT SIZE=3D2>The question though, when playing around with these =
ideas, is how people</FONT>

<BR><FONT SIZE=3D2>will use the features implemented.</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp; Hans Aberg</FONT>
</P>

</BODY>
</HTML>
------_=_NextPart_001_01C0E6E1.00B0F100--