Received: from webgate.proteosys.de (mail.proteosys-ag.com [62.225.9.49]) by lucy.proteosys (8.11.0/8.9.3/SuSE Linux 8.9.3-0.1) with ESMTP id f4EAKrf22470 for ; Mon, 14 May 2001 12:20:53 +0200 Received: by webgate.proteosys.de (8.11.0/8.11.0) with ESMTP id f4EAKr720693 . for ; Mon, 14 May 2001 12:20:53 +0200 Received: from mail.Uni-Mainz.DE (mailserver1.zdv.Uni-Mainz.DE [134.93.8.30]) by mailgate1.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f4EAKlU11050 for ; Mon, 14 May 2001 12:20:47 +0200 (MET DST) MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----_=_NextPart_001_01C0DC5F.8E133080" Received: from mailgate2.zdv.Uni-Mainz.DE (mailgate2.zdv.Uni-Mainz.DE [134.93.8.57]) by mail.Uni-Mainz.DE (8.9.3/8.9.3) with ESMTP id MAA11794 for ; Mon, 14 May 2001 12:20:47 +0200 (MEST) Received: from mail.listserv.gmd.de (mail.listserv.gmd.de [192.88.97.5]) by mailgate2.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f4EAKk011605 for ; Mon, 14 May 2001 12:20:46 +0200 (MET DST) X-MimeOLE: Produced By Microsoft Exchange V6.5 Received: from mail.listserv.gmd.de (192.88.97.5) by mail.listserv.gmd.de (LSMTP for OpenVMS v1.1a) with SMTP id <1.8A7CA48D@mail.listserv.gmd.de>; Mon, 14 May 2001 12:19:09 +0200 Received: from RELAY.URZ.UNI-HEIDELBERG.DE by RELAY.URZ.UNI-HEIDELBERG.DE (LISTSERV-TCP/IP release 1.8b) with spool id 495539 for LATEX-L@RELAY.URZ.UNI-HEIDELBERG.DE; Mon, 14 May 2001 12:20:43 +0200 Received: from ix.urz.uni-heidelberg.de (mail.urz.uni-heidelberg.de [129.206.119.234]) by relay.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id MAA11269 for ; Mon, 14 May 2001 12:20:41 +0200 (MET DST) Received: from relay.uni-heidelberg.de (relay.uni-heidelberg.de [129.206.100.212]) by ix.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id MAA46138 for ; Mon, 14 May 2001 12:20:42 +0200 Received: from musse.tninet.se (musse.tninet.se [195.100.94.12]) by relay.uni-heidelberg.de (8.10.2+Sun/8.10.2) with SMTP id f4EAKeQ03575 for ; Mon, 14 May 2001 12:20:40 +0200 (MET DST) Received: (qmail 18588 invoked from network); 14 May 2001 12:20:38 +0200 Received: from delenn.tninet.se (HELO algonet.se) (195.100.94.104) by musse.tninet.se with SMTP; 14 May 2001 12:20:38 +0200 Received: from [195.100.226.132] (du132-226.ppp.su-anst.tninet.se [195.100.226.132]) by delenn.tninet.se (BLUETAIL Mail Robustifier 2.2.2) with ESMTP id 567451.835634.989delenn-s1 for ; Mon, 14 May 2001 12:20:34 +0200 In-Reply-To: References: Your message of "Sun, 13 May 2001 21:32:35 +0200." Return-Path: X-Sender: haberg@pop.matematik.su.se x-mime-autoconverted: from quoted-printable to 8bit by relay.urz.uni-heidelberg.de id MAA11270 Content-class: urn:content-classes:message Subject: Re: Multilingual Encodings Summary 2.2 Date: Mon, 14 May 2001 11:19:00 +0100 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: From: "Hans Aberg" Sender: "Mailing list for the LaTeX3 project" To: "Multiple recipients of list LATEX-L" Reply-To: "Mailing list for the LaTeX3 project" Status: R X-Status: X-Keywords: X-UID: 4058 This is a multi-part message in MIME format. ------_=_NextPart_001_01C0DC5F.8E133080 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable At 10:10 +0100 2001/05/14, Robin Fairbairns wrote: >in practice, most people know what encodings their files are in. and >if they're into unicode, and encoding in utf-8 or utf-16, the chance >that they'll also be using another encoding is likely rather small; The chance of encountering mixed sets of encodings is very big if one = picks down files from the Internet or is maintaining an archive. In addition one might want to allow the use of mixed encodings in the = same file (like Cyrillic plus Latin). So whatever scheme you come up, I think it must fulfill the requirement that the encodings used can somehow easily be identified. The manual version that you suggest, I think may end up becoming a pain in the something. > if >they're using latin-1 in parallel, it'll be consumed quite happily by >a utf-8 decoder. imposing a schema file on *everything* is wild >overkill. So allow to set a default encoding different than 32 bit Unicode, then. >>(If Omega >> uses C++ for IO, one can use something called a codecvt. Or use = pipes, >> where available.) > >no. omega does (shame) use clunky old c++ for some parts of its >operation, One should not use old C++, but the current C++ standard (ANSI+ISO+IEC+14882-1998). > but it uses its own ocp mechanism for transforming >encodings. macro coding to switch ocps at input time is trivial, but >not attractive for the normal case of using the same encoding all the >time. At 23:58 +0200 2001/05/13, Lars Hellstr=F6m wrote: >Read Sections 8--12 (Section 12 in particular) of the Omega draft >documentation---that will answer you question more thoroughly that I = bother >to do right now. Marcel's summary contains a reference to it. But in = short >the equivalent functionality is already implemented (without resorting = to >language or platform specific mechanisms such as those you mention). One reason that Omega is not using C/C++ for code conversions might be = that it apparently is quite difficult: Most people do not realize that there is no way in C/C++ to ensure that = one writes a byte of 8 bits -- this is in fact platform (or rather, = compiler) dependant. All one knows is that a C/C++ byte has at least 8 bits, even though most (but not all) would use 8 bit C/C++ bytes. Also, there is no way to ensure that there is an integral type with 32 bits. And it is (currently) supposed a hell trying to write Unicode programs on many platforms. However, I got the following suggestion at the C++ newsgroups for a C++ implementation: On each compiler, get hold of an integral type with at least 32 bits, = and use that as your character type in the program. Then, if one knows that = the C/C++ byte is 8 bit (for example by looking at the CHAR_BITS macro), one also knows that file IO takes place in 8-bit chunks. Then one evidently can write something called a codecvt (code converter) that can make the file IO translations transparent to the one writing = the C/C++ program. One advantage could be speed. And if it is relatively easy to write = those codecvt's (perhaps there are libraries), it becomes easy to add translations for many different formats without making the C/C++ program itself any more complicated. One interesting possibility, which I do not think has been discussed = here, is the ability to read compressed files without unpacking them. One way = to compress a file is to make a statistical character frequency analysis = and then make a variable bit character translation table. If the compression scheme is right, it might be easy to write a codecvt for that too. (Java uses .zip files that way, I think, which also allows to bundle files.) Hans Aberg ------_=_NextPart_001_01C0DC5F.8E133080 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Re: Multilingual Encodings Summary 2.2

At 10:10 +0100 2001/05/14, Robin Fairbairns = wrote:
>in practice, most people know what encodings = their files are in.  and
>if they're into unicode, and encoding in utf-8 or = utf-16, the chance
>that they'll also be using another encoding is = likely rather small;

The chance of encountering mixed sets of encodings is = very big if one picks
down files from the Internet or is maintaining an = archive.

In addition one might want to allow the use of mixed = encodings in the same
file (like Cyrillic plus Latin).

So whatever scheme you come up, I think it must = fulfill the requirement
that the encodings used can somehow easily be = identified. The manual
version that you suggest, I think may end up becoming = a pain in the
something.

> if
>they're using latin-1 in parallel, it'll be = consumed quite happily by
>a utf-8 decoder.  imposing a schema file on = *everything* is wild
>overkill.

So allow to set a default encoding different than 32 = bit Unicode, then.

>>(If Omega
>> uses C++ for IO, one can use something = called a codecvt. Or use pipes,
>> where available.)
>
>no.  omega does (shame) use clunky old c++ = for some parts of its
>operation,

One should not use old C++, but the current C++ = standard
(ANSI+ISO+IEC+14882-1998).

> but it uses its own ocp mechanism for = transforming
>encodings.  macro coding to switch ocps at = input time is trivial, but
>not attractive for the normal case of using the = same encoding all the
>time.

At 23:58 +0200 2001/05/13, Lars Hellstr=F6m = wrote:
>Read Sections 8--12 (Section 12 in particular) of = the Omega draft
>documentation---that will answer you question = more thoroughly that I bother
>to do right now. Marcel's summary contains a = reference to it. But in short
>the equivalent functionality is already = implemented (without resorting to
>language or platform specific mechanisms such as = those you mention).

One reason that Omega is not using C/C++ for code = conversions might be that
it apparently is quite difficult:

Most people do not realize that there is no way in = C/C++ to ensure that one
writes a byte of 8 bits -- this is in fact platform = (or rather, compiler)
dependant. All one knows is that a C/C++ byte has at = least 8 bits, even
though most (but not all) would use 8 bit C/C++ = bytes.

Also, there is no way to ensure that there is an = integral type with 32
bits. And it is (currently) supposed a hell trying to = write Unicode
programs on many platforms.

However, I got the following suggestion at the C++ = newsgroups for a C++
implementation:

On each compiler, get hold of an integral type with at = least 32 bits, and
use that as your character type in the program. Then, = if one knows that the
C/C++ byte is 8 bit (for example by looking at the = CHAR_BITS macro), one
also knows that file IO takes place in 8-bit = chunks.

Then one evidently can write something called a = codecvt (code converter)
that can make the file IO translations transparent to = the one writing the
C/C++ program.

One advantage could be speed. And if it is relatively = easy to write those
codecvt's (perhaps there are libraries), it becomes = easy to add
translations for many different formats without = making the C/C++ program
itself any more complicated.

One interesting possibility, which I do not think has = been discussed here,
is the ability to read compressed files without = unpacking them. One way to
compress a file is to make a statistical character = frequency analysis and
then make a variable bit character translation table. = If the compression
scheme is right, it might be easy to write a codecvt = for that too. (Java
uses .zip files that way, I think, which also allows = to bundle files.)

  Hans Aberg

------_=_NextPart_001_01C0DC5F.8E133080--