Received: from webgate.proteosys.de (mail.proteosys-ag.com [62.225.9.49]) by lucy.proteosys (8.11.0/8.9.3/SuSE Linux 8.9.3-0.1) with ESMTP id f1IKJRf17900 for ; Sun, 18 Feb 2001 21:19:27 +0100 Received: by webgate.proteosys.de (8.11.0/8.11.0) with ESMTP id f1IKJRd24431 . for ; Sun, 18 Feb 2001 21:19:27 +0100 Received: from mail.Uni-Mainz.DE (mailserver1.zdv.Uni-Mainz.DE [134.93.8.30]) by mailgate1.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f1IKJQH08403 for ; Sun, 18 Feb 2001 21:19:26 +0100 (MET) MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----_=_NextPart_001_01C099E8.175FF180" Received: from mailgate2.zdv.Uni-Mainz.DE (mailgate2.zdv.Uni-Mainz.DE [134.93.8.57]) by mail.Uni-Mainz.DE (8.9.3/8.9.3) with ESMTP id VAA16755 for ; Sun, 18 Feb 2001 21:19:26 +0100 (MET) X-MimeOLE: Produced By Microsoft Exchange V6.5 Received: from mail.listserv.gmd.de (mail.listserv.gmd.de [192.88.97.5]) by mailgate2.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f1IKJPQ19436 for ; Sun, 18 Feb 2001 21:19:26 +0100 (MET) Received: from mail.listserv.gmd.de (192.88.97.5) by mail.listserv.gmd.de (LSMTP for OpenVMS v1.1a) with SMTP id <4.E99A4B8D@mail.listserv.gmd.de>; Sun, 18 Feb 2001 21:19:16 +0100 Received: from RELAY.URZ.UNI-HEIDELBERG.DE by RELAY.URZ.UNI-HEIDELBERG.DE (LISTSERV-TCP/IP release 1.8b) with spool id 489626 for LATEX-L@RELAY.URZ.UNI-HEIDELBERG.DE; Sun, 18 Feb 2001 21:19:22 +0100 Received: from ix.urz.uni-heidelberg.de (mail.urz.uni-heidelberg.de [129.206.119.234]) by relay.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id VAA09758 for ; Sun, 18 Feb 2001 21:19:21 +0100 (MET) Received: from relay.uni-heidelberg.de (relay.uni-heidelberg.de [129.206.100.212]) by ix.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id VAA43236 for ; Sun, 18 Feb 2001 21:19:21 +0100 Received: from musse.tninet.se (musse.tninet.se [195.100.94.12]) by relay.uni-heidelberg.de (8.10.2+Sun/8.10.2) with SMTP id f1IKJKx19308 for ; Sun, 18 Feb 2001 21:19:21 +0100 (MET) Received: (qmail 11830 invoked from network); 18 Feb 2001 21:19:20 +0100 Received: from delenn.tninet.se (HELO algonet.se) (195.100.94.104) by musse.tninet.se with SMTP; 18 Feb 2001 21:19:20 +0100 Received: from [195.100.226.135] (du149-226.ppp.su-anst.tninet.se [195.100.226.149]) by delenn.tninet.se (BLUETAIL Mail Robustifier 2.2.1) with ESMTP id 706591.527558.982delenn-s1 for ; Sun, 18 Feb 2001 21:19:18 +0100 Return-Path: X-Sender: haberg@pop.matematik.su.se x-mime-autoconverted: from quoted-printable to 8bit by relay.urz.uni-heidelberg.de id VAA09759 Content-class: urn:content-classes:message Subject: Re: LaTeX's internal char prepresentation (UTF8 or Unicode?) Date: Sun, 18 Feb 2001 21:18:11 +0100 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: From: "Hans Aberg" Sender: "Mailing list for the LaTeX3 project" To: "Multiple recipients of list LATEX-L" Reply-To: "Mailing list for the LaTeX3 project" Status: R X-Status: X-Keywords: X-UID: 3977 This is a multi-part message in MIME format. ------_=_NextPart_001_01C099E8.175FF180 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Just as an input, here is a system for "chef" preprocessing (i.e., to = make it palatable before it reaches TeX's mouth, in order to avoid internal indigestion) that comes to my mind: Every file that UTeX reads, is required to in say the first 4 bytes to = have information about its general encoding, interpretable as ASCII, say = (padded with spaces) BYTE -- eight byte mixed encoding UT8 -- UTF8 UT16 -- UTF16 U16 -- Unicode 16 U32 -- Unicode 32 For the last four, no further information, but in the first case BYTE, = one then has a series of lines indicating, each one indicating an encoding = that might be used and a start sequence. -- It might difficult to foresee a suitable start sequence for every possible file, so one could allow individual choices for each file. It could look like: ASCII Latin-1 Russian ... (Or whatever official names one decides to have for the different = encodings.) The preprocessor then zips through the file, looking for the indicated character combinations, in this case (7 bit), (Western = European), (Russian), etc., and applies the encodings to the characters that follow. So for example, one would have to write ... \def\bar{bla bla \french{f=F4=F4} bla bla} as \french does not handle character encodings anymore, but only other aspects that might related to the use of French in TeX. It is an inconvenience that one has to do this explicit markup, but it = only happens if one is mixing encodings. In the example above, the ASCII characters in the bottom 7 bits agree (i.e., have the same Unicode translation), so it would not have been needed, if one uses Latin-1 all = the time. One spin-off, though, is that it is fairly easy to convert such files to Unicode, once Unicode editors become available. -- If, in the example above, one requires that the ... markup being a part of the = \french command, then it becomes more complicated to translate such a file to a Unicode format. Hans Aberg ------_=_NextPart_001_01C099E8.175FF180 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Re: LaTeX's internal char prepresentation (UTF8 or = Unicode?)

Just as an input, here is a system for = "chef" preprocessing (i.e., to make
it palatable before it reaches TeX's mouth, in order = to avoid internal
indigestion) that comes to my mind:

Every file that UTeX reads, is required to in say the = first 4 bytes to have
information about its general encoding, interpretable = as ASCII, say (padded
with spaces)
BYTE   -- eight byte mixed encoding
UT8    -- UTF8
UT16   -- UTF16
U16    -- Unicode 16
U32    -- Unicode 32

For the last four, no further information, but in the = first case BYTE, one
then has a series of lines indicating, each one = indicating an encoding that
might be used and a start sequence. -- It might = difficult to foresee a
suitable start sequence for every possible file, so = one could allow
individual choices for each file. It could look = like:
ASCII   <as>
Latin-1 <we>
Russian <ru>
...
(Or whatever official names one decides to have for = the different encodings.)

The preprocessor then zips through the file, looking = for the indicated
character combinations, in this case <as> (7 = bit), <we> (Western European),
<ru> (Russian), etc., and applies the encodings = to the characters that
follow. So for example, one would have to = write
  <as> ...
  \def\bar{bla bla = \french{<we>f=F4=F4<as>} bla bla}
as \french does not handle character encodings = anymore, but only other
aspects that might related to the use of French in = TeX.

It is an inconvenience that one has to do this = explicit markup, but it only
happens if one is mixing encodings. In the example = above, the ASCII
characters in the bottom 7 bits agree (i.e., have the = same Unicode
translation), so it would not have been needed, if = one uses Latin-1 all the
time.

One spin-off, though, is that it is fairly easy to = convert such files to
Unicode, once Unicode editors become available. -- = If, in the example
above, one requires that the <we>...<as> = markup being a part of the \french
command, then it becomes more complicated to = translate such a file to a
Unicode format.

  Hans Aberg

------_=_NextPart_001_01C099E8.175FF180--