MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----_=_NextPart_001_01C0966B.1EDF6D80"
In-Reply-To:  <14985.13977.836075.844694@gargle.gargle.HOWL>
Content-class: urn:content-classes:message
Subject:      Re: Multilingual Encodings Summary
Date: Wed, 14 Feb 2001 10:45:51 +0100
Message-ID:  <v03110700b6affdb302e7@[195.100.226.152]>
From: "Hans Aberg" <haberg@MATEMATIK.SU.SE>
Sender: "Mailing list for the LaTeX3 project" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
To: "Multiple recipients of list LATEX-L" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
Reply-To: "Mailing list for the LaTeX3 project" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
Status: R

This is a multi-part message in MIME format.

------_=_NextPart_001_01C0966B.1EDF6D80
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

At 14:28 +0100 2001/02/13, Marcel Oliver wrote:
>2. Internal Representation and Output Encoding:
>
>2.1. Problems with Current TeX:
...
>This leads to a number of problems.
>
>- A sufficiently general internal multilingual representation may be
>  impossible to maintain, unless it is Unicode in disguise.

If we are speaking about tweaking TeX's internals, what is needed is a
stream of characters, where the characters can be subjected to various
operations, such as  comparisons, etc. The exact internal representation =
is
irrelevant.

If the implementation uses say C++, one could easily implement such
characters which polymorphicly can change internal representation. It =
could
then be mixture of 1-4 byte formats.

However, if one would decide to implement such polymorphism by =
allocating
each character in separately in free store, it would be slow, and each
character would take up at typically 1 (computer-)word to indicate the =
size
of the allocation, and the character itself plus word round-off, which =
is
another word, that is 2 words, or at least 8 bytes for each characters. =
And
the latest Mac's with G3 & G4 uses 64 and 128 bit words.

So this suggests that what one should use, for the internal =
representation,
are 32-bit characters, which are encoded in some way making each =
character
in the semantic sense unique. (That is, if a group of input characters =
are
to be regarded as a single semantic entity, they should be replaced with =
a
unique 32-bit code.) -- Space will be enough with today's computers, and
using more compact formats will not be faster, as the CPU's internals =
will
probably compute in larger words anyway. (That is, if one uses 16-bit
characters, they will probably first be translated into 32-bit words or
larger, the CPU operations will then be performed, and after that,
translated back into 16 bit characters. It will just as fast working =
with
32-bit characters directly, or perhaps even faster if it decreases the =
need
of making round-offs. Strictly speaking, which one is the faster can =
only
be determined by using a profiler, but 32-bit characters seems to be =
OK.)

  Hans Aberg

------_=_NextPart_001_01C0966B.1EDF6D80
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Diso-8859-1">
<META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version =
6.5.7654.12">
<TITLE>     Re: Multilingual Encodings Summary</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->

<P><FONT SIZE=3D2>At 14:28 +0100 2001/02/13, Marcel Oliver wrote:</FONT>

<BR><FONT SIZE=3D2>&gt;2. Internal Representation and Output =
Encoding:</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt;2.1. Problems with Current TeX:</FONT>

<BR><FONT SIZE=3D2>...</FONT>

<BR><FONT SIZE=3D2>&gt;This leads to a number of problems.</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt;- A sufficiently general internal multilingual =
representation may be</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp; impossible to maintain, unless it is =
Unicode in disguise.</FONT>
</P>

<P><FONT SIZE=3D2>If we are speaking about tweaking TeX's internals, =
what is needed is a</FONT>

<BR><FONT SIZE=3D2>stream of characters, where the characters can be =
subjected to various</FONT>

<BR><FONT SIZE=3D2>operations, such as&nbsp; comparisons, etc. The exact =
internal representation is</FONT>

<BR><FONT SIZE=3D2>irrelevant.</FONT>
</P>

<P><FONT SIZE=3D2>If the implementation uses say C++, one could easily =
implement such</FONT>

<BR><FONT SIZE=3D2>characters which polymorphicly can change internal =
representation. It could</FONT>

<BR><FONT SIZE=3D2>then be mixture of 1-4 byte formats.</FONT>
</P>

<P><FONT SIZE=3D2>However, if one would decide to implement such =
polymorphism by allocating</FONT>

<BR><FONT SIZE=3D2>each character in separately in free store, it would =
be slow, and each</FONT>

<BR><FONT SIZE=3D2>character would take up at typically 1 =
(computer-)word to indicate the size</FONT>

<BR><FONT SIZE=3D2>of the allocation, and the character itself plus word =
round-off, which is</FONT>

<BR><FONT SIZE=3D2>another word, that is 2 words, or at least 8 bytes =
for each characters. And</FONT>

<BR><FONT SIZE=3D2>the latest Mac's with G3 &amp; G4 uses 64 and 128 bit =
words.</FONT>
</P>

<P><FONT SIZE=3D2>So this suggests that what one should use, for the =
internal representation,</FONT>

<BR><FONT SIZE=3D2>are 32-bit characters, which are encoded in some way =
making each character</FONT>

<BR><FONT SIZE=3D2>in the semantic sense unique. (That is, if a group of =
input characters are</FONT>

<BR><FONT SIZE=3D2>to be regarded as a single semantic entity, they =
should be replaced with a</FONT>

<BR><FONT SIZE=3D2>unique 32-bit code.) -- Space will be enough with =
today's computers, and</FONT>

<BR><FONT SIZE=3D2>using more compact formats will not be faster, as the =
CPU's internals will</FONT>

<BR><FONT SIZE=3D2>probably compute in larger words anyway. (That is, if =
one uses 16-bit</FONT>

<BR><FONT SIZE=3D2>characters, they will probably first be translated =
into 32-bit words or</FONT>

<BR><FONT SIZE=3D2>larger, the CPU operations will then be performed, =
and after that,</FONT>

<BR><FONT SIZE=3D2>translated back into 16 bit characters. It will just =
as fast working with</FONT>

<BR><FONT SIZE=3D2>32-bit characters directly, or perhaps even faster if =
it decreases the need</FONT>

<BR><FONT SIZE=3D2>of making round-offs. Strictly speaking, which one is =
the faster can only</FONT>

<BR><FONT SIZE=3D2>be determined by using a profiler, but 32-bit =
characters seems to be OK.)</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp; Hans Aberg</FONT>
</P>

</BODY>
</HTML>
------_=_NextPart_001_01C0966B.1EDF6D80--