Received: from webgate.proteosys.de (mail.proteosys-ag.com [62.225.9.49]) by lucy.proteosys (8.11.0/8.9.3/SuSE Linux 8.9.3-0.1) with ESMTP id f1BGI4H11574 for ; Sun, 11 Feb 2001 17:18:04 +0100 Received: by webgate.proteosys.de (8.11.0/8.11.0) with ESMTP id f1BGI3d25536 . for ; Sun, 11 Feb 2001 17:18:03 +0100 MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----_=_NextPart_001_01C09446.35F14600" Received: from mail.Uni-Mainz.DE (mailserver1.zdv.Uni-Mainz.DE [134.93.8.30]) by mailgate1.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f1BGI2M02637 for ; Sun, 11 Feb 2001 17:18:02 +0100 (MET) Received: from mailgate1.zdv.Uni-Mainz.DE (mailgate1.zdv.Uni-Mainz.DE [134.93.8.56]) by mail.Uni-Mainz.DE (8.9.3/8.9.3) with ESMTP id RAA17115 for ; Sun, 11 Feb 2001 17:18:01 +0100 (MET) X-MimeOLE: Produced By Microsoft Exchange V6.5 Received: from mail.listserv.gmd.de (mail.listserv.gmd.de [192.88.97.5]) by mailgate1.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f1BGI0M02633 for ; Sun, 11 Feb 2001 17:18:00 +0100 (MET) Received: from mail.listserv.gmd.de (192.88.97.5) by mail.listserv.gmd.de (LSMTP for OpenVMS v1.1a) with SMTP id <14.0881331E@mail.listserv.gmd.de>; Sun, 11 Feb 2001 17:17:54 +0100 Received: from RELAY.URZ.UNI-HEIDELBERG.DE by RELAY.URZ.UNI-HEIDELBERG.DE (LISTSERV-TCP/IP release 1.8b) with spool id 487618 for LATEX-L@RELAY.URZ.UNI-HEIDELBERG.DE; Sun, 11 Feb 2001 17:17:57 +0100 Received: from ix.urz.uni-heidelberg.de (mail.urz.uni-heidelberg.de [129.206.119.234]) by relay.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id RAA20937 for ; Sun, 11 Feb 2001 17:17:56 +0100 (MET) Received: from relay.uni-heidelberg.de (relay.uni-heidelberg.de [129.206.100.212]) by ix.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id RAA34766 for ; Sun, 11 Feb 2001 17:17:56 +0100 Received: from Sina.sharif.ac.ir (sina.Sharif.AC.IR [194.225.40.9]) by relay.uni-heidelberg.de (8.10.2+Sun/8.10.2) with ESMTP id f1BGHqu09890 for ; Sun, 11 Feb 2001 17:17:53 +0100 (MET) Received: from localhost (roozbeh@localhost) by Sina.sharif.ac.ir (8.9.3/8.9.3) with ESMTP id TAA12705 for ; Sun, 11 Feb 2001 19:47:44 +0330 In-Reply-To: <14982.45082.150652.74719@istrati.zdv.uni-mainz.de> Return-Path: X-Sender: roozbeh@Sina.sharif.ac.ir Content-class: urn:content-classes:message Subject: Re: LaTeX's internal char prepresentation (UTF8 or Unicode?) Date: Sun, 11 Feb 2001 17:17:44 +0100 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: From: "Roozbeh Pournader" Sender: "Mailing list for the LaTeX3 project" To: "Multiple recipients of list LATEX-L" Reply-To: "Mailing list for the LaTeX3 project" Status: R X-Status: X-Keywords: X-UID: 3798 This is a multi-part message in MIME format. ------_=_NextPart_001_01C09446.35F14600 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable On Sun, 11 Feb 2001, Frank Mittelbach wrote: > The LaTeX internal character representation is a 7bit representation = not an > 8bit one as UTF8. As such it is far less likely to be mangled by = incorrect > conversion if files are exchanged between different platforms. I have = yet to > see that UTF8 text (without taking precaution and externally = announcing that a > file is in UTF8) is really properly handled by any OS platform. Is it? Windows 2000 autodetects them. I can't define the proper handling in = Linux well; you mean in a text editor? > wouldn't it be better if the internal LaTeX representation would be = Unicode > in one or the other flavor? What about symbol fonts like TC? What about math characters that are unified in Unicode (\rightarrow and \longrightarrow)? What about the things that are not yet in Unicode? > - however, not clear is that the resulting names are easier to read, = eg > \unicode{00e4} viz \"a. They are worse than you may think. They are always hard to read. My real work is related to Unicode Arabic script, and after two years of full dedication, I can't recall more than a few codes. I always need a table = at hand. I have much less experience with Knuthian names of math symbols, = but I'm sure I can recall the names of more than 95% of them without any problem. > - with intermediate forms like data written to files this could be a = pain and > people in Russia, for example, already have this problem when they = see > something like \cyr\CYRA\cyrn\cyrn\cyro\cyrt\cyra\cyrc\cyri\cyrya. = In case > of unicode as the internal representation this would be true for = all > languages (except English) while currently the Latin based ones are = still > basically okay. This is a place where UTF8 helps a lot. People can use Unicode text editors to see the files, or use the widely available convertors like iconv to convert to theoretically every charset. > - the current latex internal representation is richer than unicode = for good > or worse, eg \" is defined individually as representation for = accenting the > next char, which means that anything = \" is > automatically also a member of it, eg \"g. > > - the latter point could be considered bad since it allows to produce > characters not in unicode but independently of what you feel about = that the > fact itself as consequences when defining mappings for font > encodings. right now, one specifies the accents, ie = \DeclareTextAccent\" > and for those glyphs that exists as composites one also specifies = the > composite, eg \DeclareTextComposite{\"}{T1}{a}{...} > With the unicode approach as the internal representation there = would be > an atomic form ie \unicode{00e4} describing umlaut-a so if that = has no > representation in a font, eg in OT1, then one would need to define = for each > combination the result. Unicode also has the equivalent of \", it only appears after the letter. So the problem of a accented letter not in Unicode is not a real = problem, these letters can also be made in Unicode. But I don't know what are you going to do with the combining accent appearing after the letter. --roozbeh ------_=_NextPart_001_01C09446.35F14600 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Re: LaTeX's internal char prepresentation (UTF8 or = Unicode?)

On Sun, 11 Feb 2001, Frank Mittelbach wrote:

> The LaTeX internal character representation is a = 7bit representation not an
> 8bit one as UTF8. As such it is far less likely = to be mangled by incorrect
> conversion if files are exchanged between = different platforms. I have yet to
> see that UTF8 text (without taking precaution = and externally announcing that a
> file is in UTF8) is really properly handled by = any OS platform. Is it?

Windows 2000 autodetects them. I can't define the = proper handling in Linux
well; you mean in a text editor?

>  wouldn't it be better if the internal LaTeX = representation would be Unicode
>  in one or the other flavor?

What about symbol fonts like TC? What about math = characters that are
unified in Unicode (\rightarrow and \longrightarrow)? = What about the
things that are not yet in Unicode?

>  - however, not clear is that the resulting = names are easier to read, eg
>    \unicode{00e4} viz = \"a.

They are worse than you may think. They are always = hard to read. My real
work is related to Unicode Arabic script, and after = two years of full
dedication, I can't recall more than a few codes. I = always need a table at
hand. I have much less experience with Knuthian names = of math symbols, but
I'm sure I can recall the names of more than 95% of = them without any
problem.

>  - with intermediate forms like data written = to files this could be a pain and
>    people in Russia, for example, = already have this problem when they see
>    something like = \cyr\CYRA\cyrn\cyrn\cyro\cyrt\cyra\cyrc\cyri\cyrya.  In case
>    of unicode as the internal = representation this would be true for all
>    languages (except English) = while currently the Latin based ones are still
>    basically okay.

This is a place where UTF8 helps a lot. People can use = Unicode text
editors to see the files, or use the widely available = convertors like
iconv to convert to theoretically every = charset.

>  - the current latex internal representation = is richer than unicode for good
>    or worse, eg \" is = defined individually as representation for accenting the
>    next char, which means that = anything \"<base-char-in-the-internal-reps> is
>    automatically also a member of = it, eg \"g.
>
>  - the latter point could be considered bad = since it allows to produce
>    characters not in unicode but = independently of what you feel about that the
>    fact itself as consequences = when defining mappings for font
>    encodings. right now, one = specifies the accents, ie \DeclareTextAccent\"
>    and for those glyphs that = exists as composites one also specifies the
>    composite, eg = \DeclareTextComposite{\"}{T1}{a}{...}
>    With the unicode approach as = the internal representation there would be
>    an atomic form ie  = \unicode{00e4} describing umlaut-a so if that has no
>    representation in a font, eg = in OT1, then one would need to define for each
>    combination the result.

Unicode also has the equivalent of \", it only = appears after the letter.
So the problem of a accented letter not in Unicode is = not a real problem,
these letters can also be made in Unicode. But I = don't know what are you
going to do with the combining accent appearing after = the letter.

--roozbeh

------_=_NextPart_001_01C09446.35F14600--