MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----_=_NextPart_001_01C0DC5F.8E133080"
In-Reply-To:  <E14zEMh-0006vu-00@wisbech.cl.cam.ac.uk>
References: Your message of "Sun, 13 May 2001 21:32:35 +0200."             <v03110701b7248c1d8889@[195.100.226.136]>
Content-class: urn:content-classes:message
Subject:      Re: Multilingual Encodings Summary 2.2
Date: Mon, 14 May 2001 11:19:00 +0100
Message-ID:  <v03110701b72558c35c28@[195.100.226.136]>
From: "Hans Aberg" <haberg@MATEMATIK.SU.SE>
Sender: "Mailing list for the LaTeX3 project" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
To: "Multiple recipients of list LATEX-L" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
Reply-To: "Mailing list for the LaTeX3 project" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
Status: R

This is a multi-part message in MIME format.

------_=_NextPart_001_01C0DC5F.8E133080
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

At 10:10 +0100 2001/05/14, Robin Fairbairns wrote:
>in practice, most people know what encodings their files are in.  and
>if they're into unicode, and encoding in utf-8 or utf-16, the chance
>that they'll also be using another encoding is likely rather small;

The chance of encountering mixed sets of encodings is very big if one =
picks
down files from the Internet or is maintaining an archive.

In addition one might want to allow the use of mixed encodings in the =
same
file (like Cyrillic plus Latin).

So whatever scheme you come up, I think it must fulfill the requirement
that the encodings used can somehow easily be identified. The manual
version that you suggest, I think may end up becoming a pain in the
something.

> if
>they're using latin-1 in parallel, it'll be consumed quite happily by
>a utf-8 decoder.  imposing a schema file on *everything* is wild
>overkill.

So allow to set a default encoding different than 32 bit Unicode, then.

>>(If Omega
>> uses C++ for IO, one can use something called a codecvt. Or use =
pipes,
>> where available.)
>
>no.  omega does (shame) use clunky old c++ for some parts of its
>operation,

One should not use old C++, but the current C++ standard
(ANSI+ISO+IEC+14882-1998).

> but it uses its own ocp mechanism for transforming
>encodings.  macro coding to switch ocps at input time is trivial, but
>not attractive for the normal case of using the same encoding all the
>time.

At 23:58 +0200 2001/05/13, Lars Hellstr=F6m wrote:
>Read Sections 8--12 (Section 12 in particular) of the Omega draft
>documentation---that will answer you question more thoroughly that I =
bother
>to do right now. Marcel's summary contains a reference to it. But in =
short
>the equivalent functionality is already implemented (without resorting =
to
>language or platform specific mechanisms such as those you mention).

One reason that Omega is not using C/C++ for code conversions might be =
that
it apparently is quite difficult:

Most people do not realize that there is no way in C/C++ to ensure that =
one
writes a byte of 8 bits -- this is in fact platform (or rather, =
compiler)
dependant. All one knows is that a C/C++ byte has at least 8 bits, even
though most (but not all) would use 8 bit C/C++ bytes.

Also, there is no way to ensure that there is an integral type with 32
bits. And it is (currently) supposed a hell trying to write Unicode
programs on many platforms.

However, I got the following suggestion at the C++ newsgroups for a C++
implementation:

On each compiler, get hold of an integral type with at least 32 bits, =
and
use that as your character type in the program. Then, if one knows that =
the
C/C++ byte is 8 bit (for example by looking at the CHAR_BITS macro), one
also knows that file IO takes place in 8-bit chunks.

Then one evidently can write something called a codecvt (code converter)
that can make the file IO translations transparent to the one writing =
the
C/C++ program.

One advantage could be speed. And if it is relatively easy to write =
those
codecvt's (perhaps there are libraries), it becomes easy to add
translations for many different formats without making the C/C++ program
itself any more complicated.

One interesting possibility, which I do not think has been discussed =
here,
is the ability to read compressed files without unpacking them. One way =
to
compress a file is to make a statistical character frequency analysis =
and
then make a variable bit character translation table. If the compression
scheme is right, it might be easy to write a codecvt for that too. (Java
uses .zip files that way, I think, which also allows to bundle files.)

  Hans Aberg

------_=_NextPart_001_01C0DC5F.8E133080
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Diso-8859-1">
<META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version =
6.5.7654.12">
<TITLE>     Re: Multilingual Encodings Summary 2.2</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->

<P><FONT SIZE=3D2>At 10:10 +0100 2001/05/14, Robin Fairbairns =
wrote:</FONT>

<BR><FONT SIZE=3D2>&gt;in practice, most people know what encodings =
their files are in.&nbsp; and</FONT>

<BR><FONT SIZE=3D2>&gt;if they're into unicode, and encoding in utf-8 or =
utf-16, the chance</FONT>

<BR><FONT SIZE=3D2>&gt;that they'll also be using another encoding is =
likely rather small;</FONT>
</P>

<P><FONT SIZE=3D2>The chance of encountering mixed sets of encodings is =
very big if one picks</FONT>

<BR><FONT SIZE=3D2>down files from the Internet or is maintaining an =
archive.</FONT>
</P>

<P><FONT SIZE=3D2>In addition one might want to allow the use of mixed =
encodings in the same</FONT>

<BR><FONT SIZE=3D2>file (like Cyrillic plus Latin).</FONT>
</P>

<P><FONT SIZE=3D2>So whatever scheme you come up, I think it must =
fulfill the requirement</FONT>

<BR><FONT SIZE=3D2>that the encodings used can somehow easily be =
identified. The manual</FONT>

<BR><FONT SIZE=3D2>version that you suggest, I think may end up becoming =
a pain in the</FONT>

<BR><FONT SIZE=3D2>something.</FONT>
</P>

<P><FONT SIZE=3D2>&gt; if</FONT>

<BR><FONT SIZE=3D2>&gt;they're using latin-1 in parallel, it'll be =
consumed quite happily by</FONT>

<BR><FONT SIZE=3D2>&gt;a utf-8 decoder.&nbsp; imposing a schema file on =
*everything* is wild</FONT>

<BR><FONT SIZE=3D2>&gt;overkill.</FONT>
</P>

<P><FONT SIZE=3D2>So allow to set a default encoding different than 32 =
bit Unicode, then.</FONT>
</P>

<P><FONT SIZE=3D2>&gt;&gt;(If Omega</FONT>

<BR><FONT SIZE=3D2>&gt;&gt; uses C++ for IO, one can use something =
called a codecvt. Or use pipes,</FONT>

<BR><FONT SIZE=3D2>&gt;&gt; where available.)</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt;no.&nbsp; omega does (shame) use clunky old c++ =
for some parts of its</FONT>

<BR><FONT SIZE=3D2>&gt;operation,</FONT>
</P>

<P><FONT SIZE=3D2>One should not use old C++, but the current C++ =
standard</FONT>

<BR><FONT SIZE=3D2>(ANSI+ISO+IEC+14882-1998).</FONT>
</P>

<P><FONT SIZE=3D2>&gt; but it uses its own ocp mechanism for =
transforming</FONT>

<BR><FONT SIZE=3D2>&gt;encodings.&nbsp; macro coding to switch ocps at =
input time is trivial, but</FONT>

<BR><FONT SIZE=3D2>&gt;not attractive for the normal case of using the =
same encoding all the</FONT>

<BR><FONT SIZE=3D2>&gt;time.</FONT>
</P>

<P><FONT SIZE=3D2>At 23:58 +0200 2001/05/13, Lars Hellstr=F6m =
wrote:</FONT>

<BR><FONT SIZE=3D2>&gt;Read Sections 8--12 (Section 12 in particular) of =
the Omega draft</FONT>

<BR><FONT SIZE=3D2>&gt;documentation---that will answer you question =
more thoroughly that I bother</FONT>

<BR><FONT SIZE=3D2>&gt;to do right now. Marcel's summary contains a =
reference to it. But in short</FONT>

<BR><FONT SIZE=3D2>&gt;the equivalent functionality is already =
implemented (without resorting to</FONT>

<BR><FONT SIZE=3D2>&gt;language or platform specific mechanisms such as =
those you mention).</FONT>
</P>

<P><FONT SIZE=3D2>One reason that Omega is not using C/C++ for code =
conversions might be that</FONT>

<BR><FONT SIZE=3D2>it apparently is quite difficult:</FONT>
</P>

<P><FONT SIZE=3D2>Most people do not realize that there is no way in =
C/C++ to ensure that one</FONT>

<BR><FONT SIZE=3D2>writes a byte of 8 bits -- this is in fact platform =
(or rather, compiler)</FONT>

<BR><FONT SIZE=3D2>dependant. All one knows is that a C/C++ byte has at =
least 8 bits, even</FONT>

<BR><FONT SIZE=3D2>though most (but not all) would use 8 bit C/C++ =
bytes.</FONT>
</P>

<P><FONT SIZE=3D2>Also, there is no way to ensure that there is an =
integral type with 32</FONT>

<BR><FONT SIZE=3D2>bits. And it is (currently) supposed a hell trying to =
write Unicode</FONT>

<BR><FONT SIZE=3D2>programs on many platforms.</FONT>
</P>

<P><FONT SIZE=3D2>However, I got the following suggestion at the C++ =
newsgroups for a C++</FONT>

<BR><FONT SIZE=3D2>implementation:</FONT>
</P>

<P><FONT SIZE=3D2>On each compiler, get hold of an integral type with at =
least 32 bits, and</FONT>

<BR><FONT SIZE=3D2>use that as your character type in the program. Then, =
if one knows that the</FONT>

<BR><FONT SIZE=3D2>C/C++ byte is 8 bit (for example by looking at the =
CHAR_BITS macro), one</FONT>

<BR><FONT SIZE=3D2>also knows that file IO takes place in 8-bit =
chunks.</FONT>
</P>

<P><FONT SIZE=3D2>Then one evidently can write something called a =
codecvt (code converter)</FONT>

<BR><FONT SIZE=3D2>that can make the file IO translations transparent to =
the one writing the</FONT>

<BR><FONT SIZE=3D2>C/C++ program.</FONT>
</P>

<P><FONT SIZE=3D2>One advantage could be speed. And if it is relatively =
easy to write those</FONT>

<BR><FONT SIZE=3D2>codecvt's (perhaps there are libraries), it becomes =
easy to add</FONT>

<BR><FONT SIZE=3D2>translations for many different formats without =
making the C/C++ program</FONT>

<BR><FONT SIZE=3D2>itself any more complicated.</FONT>
</P>

<P><FONT SIZE=3D2>One interesting possibility, which I do not think has =
been discussed here,</FONT>

<BR><FONT SIZE=3D2>is the ability to read compressed files without =
unpacking them. One way to</FONT>

<BR><FONT SIZE=3D2>compress a file is to make a statistical character =
frequency analysis and</FONT>

<BR><FONT SIZE=3D2>then make a variable bit character translation table. =
If the compression</FONT>

<BR><FONT SIZE=3D2>scheme is right, it might be easy to write a codecvt =
for that too. (Java</FONT>

<BR><FONT SIZE=3D2>uses .zip files that way, I think, which also allows =
to bundle files.)</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp; Hans Aberg</FONT>
</P>

</BODY>
</HTML>
------_=_NextPart_001_01C0DC5F.8E133080--