Received: from webgate.proteosys.de (mail.proteosys-ag.com [62.225.9.49]) by lucy.proteosys (8.11.0/8.9.3/SuSE Linux 8.9.3-0.1) with ESMTP id f1CBYH918332 for ; Mon, 12 Feb 2001 12:34:17 +0100 Received: by webgate.proteosys.de (8.11.0/8.11.0) with ESMTP id f1CBYHd28920 . for ; Mon, 12 Feb 2001 12:34:17 +0100 MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----_=_NextPart_001_01C094E7.BB790280" Received: from mail.Uni-Mainz.DE (mailserver1.zdv.Uni-Mainz.DE [134.93.8.30]) by mailgate1.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f1CBYGM26345 for ; Mon, 12 Feb 2001 12:34:16 +0100 (MET) Received: from mailgate2.zdv.Uni-Mainz.DE (mailgate2.zdv.Uni-Mainz.DE [134.93.8.57]) by mail.Uni-Mainz.DE (8.9.3/8.9.3) with ESMTP id MAA28056 for ; Mon, 12 Feb 2001 12:34:15 +0100 (MET) X-MimeOLE: Produced By Microsoft Exchange V6.5 Received: from mail.listserv.gmd.de (mail.listserv.gmd.de [192.88.97.5]) by mailgate2.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f1CBYE708563 for ; Mon, 12 Feb 2001 12:34:15 +0100 (MET) Received: from mail.listserv.gmd.de (192.88.97.5) by mail.listserv.gmd.de (LSMTP for OpenVMS v1.1a) with SMTP id <15.8E520CAC@mail.listserv.gmd.de>; Mon, 12 Feb 2001 12:34:07 +0100 Received: from RELAY.URZ.UNI-HEIDELBERG.DE by RELAY.URZ.UNI-HEIDELBERG.DE (LISTSERV-TCP/IP release 1.8b) with spool id 488013 for LATEX-L@RELAY.URZ.UNI-HEIDELBERG.DE; Mon, 12 Feb 2001 12:34:10 +0100 Received: from ix.urz.uni-heidelberg.de (mail.urz.uni-heidelberg.de [129.206.119.234]) by relay.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id MAA04729 for ; Mon, 12 Feb 2001 12:34:09 +0100 (MET) Received: from relay.uni-heidelberg.de (relay.uni-heidelberg.de [129.206.100.212]) by ix.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id MAA26390 for ; Mon, 12 Feb 2001 12:34:10 +0100 Received: from nag.co.uk (openmath.nag.co.uk [62.232.54.144]) by relay.uni-heidelberg.de (8.10.2+Sun/8.10.2) with ESMTP id f1CBY9u28432 for ; Mon, 12 Feb 2001 12:34:09 +0100 (MET) Received: (from davidc@localhost) by nag.co.uk (AIX4.2/UCB 8.7/8.7) id LAA18236; Mon, 12 Feb 2001 11:33:39 GMT In-Reply-To: <200102112146.QAA04933@hilbert.math.albany.edu> (hammond@CSC.ALBANY.EDU) References: <200102112146.QAA04933@hilbert.math.albany.edu> Return-Path: Content-class: urn:content-classes:message Subject: Re: LaTeX's internal char prepresentation (UTF8 or Unicode?) Date: Mon, 12 Feb 2001 12:33:39 +0100 Message-ID: <200102121133.LAA18236@nag.co.uk> X-MS-Has-Attach: X-MS-TNEF-Correlator: From: "David Carlisle" Sender: "Mailing list for the LaTeX3 project" To: "Multiple recipients of list LATEX-L" Reply-To: "Mailing list for the LaTeX3 project" Status: R X-Status: X-Keywords: X-UID: 3843 This is a multi-part message in MIME format. ------_=_NextPart_001_01C094E7.BB790280 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable > Such use of the byte sequence "EF BB BF" is a hack. It has > probability $2^{-24}$ as the initial three byte sequence in a stream > of random bytes. The equivalent in UTF 16 isn't a hack as that mandates that the BOM appearing as the first two characters is always an encoding indicator. (If you actually want to start with those characters you have to prepend the byte order mark to the file). I've a feeling that utf8 has recently been changed to similarly indicate that it isn't legal utf8 to have those characters (as character data) at the start of the file. This makes it a lot safer to "recognise" a UTF8 BOM as a BOM rather than character data. > Under the rules non-conforming XHTML If you know it's XML then you know the encoding anyway, in particular it is UTF8 unless you find an encoding declaration giving a different encoding (or a UTF16 byte order mark). (ignoring complications like the fact that the encoding can be specified in the transport, eg mime headers, rather than in the file) > as I say, utf-8 is the default encoding so that isn't necessary. Also probably worth noting that XMl (unlike latex) does not enforce that encodings have ascii characters in ascii positions, so it may be that the above line will not be recognised at all (in cases where it is a non standard encoding rather than utf8). An XML system might have to read byte by byte to see if recognises the byte stream as the characters or (some day) \usepackage[utf8]{inputenc} typing utf8 inputenc latex into http://www.google.com indicates that you can do that today, if you want. (not tried it though) David ------_=_NextPart_001_01C094E7.BB790280 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Re: LaTeX's internal char prepresentation (UTF8 or = Unicode?)

> Such use of the byte sequence "EF BB = BF" is a hack.  It has
> probability $2^{-24}$ as the initial three byte = sequence in a stream
> of random bytes.

The equivalent in UTF 16 isn't a hack as that mandates = that the BOM
appearing as the first two characters is always an = encoding indicator.
(If you actually want to start with those characters = you have to prepend
the byte order mark to the file). I've a feeling that = utf8 has recently
been changed to similarly indicate that it isn't = legal utf8 to have
those characters (as character data) at the start of = the file.
This makes it a lot safer to "recognise" a = UTF8 BOM as a BOM rather
than character data.

> Under the rules non-conforming XHTML
If you know it's XML then you know the encoding = anyway, in particular it
is UTF8 unless you find an encoding declaration = giving a different
encoding (or a UTF16 byte order mark). (ignoring = complications like the
fact that the encoding can be specified in the = transport, eg mime
headers, rather than in the file)

> <?xml ... = encoding=3D"utf-8"?>
as I say, utf-8 is the default encoding so that isn't = necessary.
Also probably worth noting that XMl (unlike latex) = does not enforce that
encodings have ascii characters in ascii positions, = so it may be that
the above line will not be recognised at all (in = cases where it is a non
standard encoding rather than utf8). An XML system = might have to read
byte by byte to see if recognises the byte stream as = the characters
<?xml
in any encoding that it knows about.

> or (some day) \usepackage[utf8]{inputenc}
typing utf8 inputenc latex into http://www.google.com indicates that = you
can do that today, if you want. (not tried it = though)

David

------_=_NextPart_001_01C094E7.BB790280--