Received: from webgate.proteosys.de (mail.proteosys-ag.com [62.225.9.49]) by lucy.proteosys (8.11.0/8.9.3/SuSE Linux 8.9.3-0.1) with ESMTP id f1E9lJH30897 for ; Wed, 14 Feb 2001 10:47:19 +0100 Received: by webgate.proteosys.de (8.11.0/8.11.0) with ESMTP id f1E9lId05475 . for ; Wed, 14 Feb 2001 10:47:18 +0100 Received: from mail.Uni-Mainz.DE (mailserver1.zdv.Uni-Mainz.DE [134.93.8.30]) by mailgate1.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f1E9lHM03239 for ; Wed, 14 Feb 2001 10:47:17 +0100 (MET) MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----_=_NextPart_001_01C0966B.1EDF6D80" Received: from mailgate1.zdv.Uni-Mainz.DE (mailgate1.zdv.Uni-Mainz.DE [134.93.8.56]) by mail.Uni-Mainz.DE (8.9.3/8.9.3) with ESMTP id KAA07109 for ; Wed, 14 Feb 2001 10:47:17 +0100 (MET) X-MimeOLE: Produced By Microsoft Exchange V6.5 Received: from mail.listserv.gmd.de (mail.listserv.gmd.de [192.88.97.5]) by mailgate1.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f1E9lGM03232 for ; Wed, 14 Feb 2001 10:47:16 +0100 (MET) Received: from mail.listserv.gmd.de (192.88.97.5) by mail.listserv.gmd.de (LSMTP for OpenVMS v1.1a) with SMTP id <1.F17E034E@mail.listserv.gmd.de>; Wed, 14 Feb 2001 10:47:09 +0100 Received: from RELAY.URZ.UNI-HEIDELBERG.DE by RELAY.URZ.UNI-HEIDELBERG.DE (LISTSERV-TCP/IP release 1.8b) with spool id 488050 for LATEX-L@RELAY.URZ.UNI-HEIDELBERG.DE; Wed, 14 Feb 2001 10:47:13 +0100 Received: from ix.urz.uni-heidelberg.de (mail.urz.uni-heidelberg.de [129.206.119.234]) by relay.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id KAA16661 for ; Wed, 14 Feb 2001 10:47:11 +0100 (MET) Received: from relay.uni-heidelberg.de (relay.uni-heidelberg.de [129.206.100.212]) by ix.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id KAA19800 for ; Wed, 14 Feb 2001 10:47:11 +0100 Received: from hromeo.algonet.se (hromeo.algonet.se [194.213.74.51]) by relay.uni-heidelberg.de (8.10.2+Sun/8.10.2) with SMTP id f1E9lCV23480 for ; Wed, 14 Feb 2001 10:47:12 +0100 (MET) Received: (qmail 15485 invoked from network); 14 Feb 2001 10:47:07 +0100 Received: from garibaldi.tninet.se (HELO algonet.se) (195.100.94.103) by hromeo.algonet.se with SMTP; 14 Feb 2001 10:47:07 +0100 Received: from [195.100.226.149] (du149-226.ppp.su-anst.tninet.se [195.100.226.149]) by garibaldi.tninet.se (BLUETAIL Mail Robustifier 2.2.1) with ESMTP id 114995.144026.982garibaldi-s1 for ; Wed, 14 Feb 2001 10:47:06 +0100 In-Reply-To: <14985.13977.836075.844694@gargle.gargle.HOWL> Return-Path: X-Sender: haberg@pop.matematik.su.se (Unverified) Content-class: urn:content-classes:message Subject: Re: Multilingual Encodings Summary Date: Wed, 14 Feb 2001 10:45:51 +0100 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: From: "Hans Aberg" Sender: "Mailing list for the LaTeX3 project" To: "Multiple recipients of list LATEX-L" Reply-To: "Mailing list for the LaTeX3 project" Status: R X-Status: X-Keywords: X-UID: 3914 This is a multi-part message in MIME format. ------_=_NextPart_001_01C0966B.1EDF6D80 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable At 14:28 +0100 2001/02/13, Marcel Oliver wrote: >2. Internal Representation and Output Encoding: > >2.1. Problems with Current TeX: ... >This leads to a number of problems. > >- A sufficiently general internal multilingual representation may be > impossible to maintain, unless it is Unicode in disguise. If we are speaking about tweaking TeX's internals, what is needed is a stream of characters, where the characters can be subjected to various operations, such as comparisons, etc. The exact internal representation = is irrelevant. If the implementation uses say C++, one could easily implement such characters which polymorphicly can change internal representation. It = could then be mixture of 1-4 byte formats. However, if one would decide to implement such polymorphism by = allocating each character in separately in free store, it would be slow, and each character would take up at typically 1 (computer-)word to indicate the = size of the allocation, and the character itself plus word round-off, which = is another word, that is 2 words, or at least 8 bytes for each characters. = And the latest Mac's with G3 & G4 uses 64 and 128 bit words. So this suggests that what one should use, for the internal = representation, are 32-bit characters, which are encoded in some way making each = character in the semantic sense unique. (That is, if a group of input characters = are to be regarded as a single semantic entity, they should be replaced with = a unique 32-bit code.) -- Space will be enough with today's computers, and using more compact formats will not be faster, as the CPU's internals = will probably compute in larger words anyway. (That is, if one uses 16-bit characters, they will probably first be translated into 32-bit words or larger, the CPU operations will then be performed, and after that, translated back into 16 bit characters. It will just as fast working = with 32-bit characters directly, or perhaps even faster if it decreases the = need of making round-offs. Strictly speaking, which one is the faster can = only be determined by using a profiler, but 32-bit characters seems to be = OK.) Hans Aberg ------_=_NextPart_001_01C0966B.1EDF6D80 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Re: Multilingual Encodings Summary

At 14:28 +0100 2001/02/13, Marcel Oliver wrote:
>2. Internal Representation and Output = Encoding:
>
>2.1. Problems with Current TeX:
...
>This leads to a number of problems.
>
>- A sufficiently general internal multilingual = representation may be
>  impossible to maintain, unless it is = Unicode in disguise.

If we are speaking about tweaking TeX's internals, = what is needed is a
stream of characters, where the characters can be = subjected to various
operations, such as  comparisons, etc. The exact = internal representation is
irrelevant.

If the implementation uses say C++, one could easily = implement such
characters which polymorphicly can change internal = representation. It could
then be mixture of 1-4 byte formats.

However, if one would decide to implement such = polymorphism by allocating
each character in separately in free store, it would = be slow, and each
character would take up at typically 1 = (computer-)word to indicate the size
of the allocation, and the character itself plus word = round-off, which is
another word, that is 2 words, or at least 8 bytes = for each characters. And
the latest Mac's with G3 & G4 uses 64 and 128 bit = words.

So this suggests that what one should use, for the = internal representation,
are 32-bit characters, which are encoded in some way = making each character
in the semantic sense unique. (That is, if a group of = input characters are
to be regarded as a single semantic entity, they = should be replaced with a
unique 32-bit code.) -- Space will be enough with = today's computers, and
using more compact formats will not be faster, as the = CPU's internals will
probably compute in larger words anyway. (That is, if = one uses 16-bit
characters, they will probably first be translated = into 32-bit words or
larger, the CPU operations will then be performed, = and after that,
translated back into 16 bit characters. It will just = as fast working with
32-bit characters directly, or perhaps even faster if = it decreases the need
of making round-offs. Strictly speaking, which one is = the faster can only
be determined by using a profiler, but 32-bit = characters seems to be OK.)

  Hans Aberg

------_=_NextPart_001_01C0966B.1EDF6D80--