Received: from webgate.proteosys.de (mail.proteosys-ag.com [62.225.9.49]) by lucy.proteosys (8.11.0/8.9.3/SuSE Linux 8.9.3-0.1) with ESMTP id f1BLlbH12285 for ; Sun, 11 Feb 2001 22:47:37 +0100 Received: by webgate.proteosys.de (8.11.0/8.11.0) with ESMTP id f1BLlbd26476 . for ; Sun, 11 Feb 2001 22:47:37 +0100 MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----_=_NextPart_001_01C09474.3F91A280" Received: from mail.Uni-Mainz.DE (mailserver1.zdv.Uni-Mainz.DE [134.93.8.30]) by mailgate1.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f1BLlaM17139 for ; Sun, 11 Feb 2001 22:47:36 +0100 (MET) Received: from mailgate1.zdv.Uni-Mainz.DE (mailgate1.zdv.Uni-Mainz.DE [134.93.8.56]) by mail.Uni-Mainz.DE (8.9.3/8.9.3) with ESMTP id WAA02365 for ; Sun, 11 Feb 2001 22:47:36 +0100 (MET) X-MimeOLE: Produced By Microsoft Exchange V6.5 Received: from mail.listserv.gmd.de (mail.listserv.gmd.de [192.88.97.5]) by mailgate1.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f1BLlZM17135 for ; Sun, 11 Feb 2001 22:47:35 +0100 (MET) Received: from mail.listserv.gmd.de (192.88.97.5) by mail.listserv.gmd.de (LSMTP for OpenVMS v1.1a) with SMTP id <8.12B9662F@mail.listserv.gmd.de>; Sun, 11 Feb 2001 22:47:28 +0100 Received: from RELAY.URZ.UNI-HEIDELBERG.DE by RELAY.URZ.UNI-HEIDELBERG.DE (LISTSERV-TCP/IP release 1.8b) with spool id 487846 for LATEX-L@RELAY.URZ.UNI-HEIDELBERG.DE; Sun, 11 Feb 2001 22:47:31 +0100 Received: from ix.urz.uni-heidelberg.de (mail.urz.uni-heidelberg.de [129.206.119.234]) by relay.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id WAA27954 for ; Sun, 11 Feb 2001 22:47:29 +0100 (MET) Received: from relay.uni-heidelberg.de (relay.uni-heidelberg.de [129.206.100.212]) by ix.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id WAA35772 for ; Sun, 11 Feb 2001 22:47:30 +0100 Received: from csc.albany.edu (sarah.albany.edu [169.226.1.103]) by relay.uni-heidelberg.de (8.10.2+Sun/8.10.2) with ESMTP id f1BLlTu03396 for ; Sun, 11 Feb 2001 22:47:29 +0100 (MET) Received: from hilbert.math.albany.edu (hilbert.math.albany.edu [169.226.23.52]) by csc.albany.edu (8.9.3/8.9.3) with ESMTP id QAA20573 for ; Sun, 11 Feb 2001 16:47:00 -0500 (EST) Received: (from hammond@localhost) by hilbert.math.albany.edu (8.9.3/8.9.3) id QAA04933 for LATEX-L@URZ.UNI-HEIDELBERG.DE; Sun, 11 Feb 2001 16:46:57 -0500 (EST) Return-Path: Content-class: urn:content-classes:message Subject: Re: LaTeX's internal char prepresentation (UTF8 or Unicode?) Date: Sun, 11 Feb 2001 22:46:57 +0100 Message-ID: <200102112146.QAA04933@hilbert.math.albany.edu> X-MS-Has-Attach: X-MS-TNEF-Correlator: From: "William F. Hammond" Sender: "Mailing list for the LaTeX3 project" To: "Multiple recipients of list LATEX-L" Reply-To: "Mailing list for the LaTeX3 project" Status: R X-Status: X-Keywords: X-UID: 3827 This is a multi-part message in MIME format. ------_=_NextPart_001_01C09474.3F91A280 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Roozbeh Pournader writes: > Also, many applications shipped with Windows 2000 attach a signature > to the start of file (U+FEFF, Zero-Width No-Break Space) when they > want to save the file, so that will make the autodetection much > easier. The Unicode Standard accepts this as an autodetection > mechanism, and says that this sequence (EF BB BF in UTF-8) is really > improbable anywhere other than a UTF-8 file. Such use of the byte sequence "EF BB BF" is a hack. It has probability $2^{-24}$ as the initial three byte sequence in a stream of random bytes. In many locales it is even printable and screen representable, and who knows what it represents in someone else's locale now or in the future. > Although, I do not have a good experience with that, I don't like my > HTML files becoming non-conformant according to Unix checkers I have. Under the rules non-conforming XHTML (next generation HTML) is supposed to be rejected by a conforming XML processor. Non valid XHTML will have a high probability of failure to convey correctly the author's intent. The correct way to indicate utf-8 encoding is with something like or in another context Content-type: text/plain; charset=3D"utf-8" or (some day) \usepackage[utf8]{inputenc} or ... as appropriate in the context. -- Bill ------_=_NextPart_001_01C09474.3F91A280 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Re: LaTeX's internal char prepresentation (UTF8 or = Unicode?)

Roozbeh Pournader <roozbeh@SHARIF.EDU> = writes:

> Also, many applications shipped with Windows 2000 = attach a signature
> to the start of file (U+FEFF, Zero-Width = No-Break Space) when they
> want to save the file, so that will make the = autodetection much
> easier. The Unicode Standard accepts this as an = autodetection
> mechanism, and says that this sequence (EF BB BF = in UTF-8) is really
> improbable anywhere other than a UTF-8 = file.

Such use of the byte sequence "EF BB BF" is = a hack.  It has
probability $2^{-24}$ as the initial three byte = sequence in a stream
of random bytes.  In many locales it is even = printable and screen
representable, and who knows what it represents in = someone else's
locale now or in the future.

> Although, I do not have a good experience with = that, I don't like my
> HTML files becoming non-conformant according to = Unix checkers I have.

Under the rules non-conforming XHTML (next generation = HTML) is supposed
to be rejected by a conforming XML processor.  = Non valid XHTML will have
a high probability of failure to convey correctly the = author's intent.

The correct way to indicate utf-8 encoding is with = something like

<?xml ... encoding=3D"utf-8"?>

or in another context

Content-type: text/plain; = charset=3D"utf-8"

or (some day) \usepackage[utf8]{inputenc}

or ...

as appropriate in the context.

          &nbs= p;            = ;            =   -- Bill

------_=_NextPart_001_01C09474.3F91A280--