Received: from webgate.proteosys.de (mail.proteosys-ag.com [62.225.9.49]) by lucy.proteosys (8.11.0/8.9.3/SuSE Linux 8.9.3-0.1) with ESMTP id f1BJfOH11976 for ; Sun, 11 Feb 2001 20:41:24 +0100 Received: by webgate.proteosys.de (8.11.0/8.11.0) with ESMTP id f1BJfOd26073 . for ; Sun, 11 Feb 2001 20:41:24 +0100 Received: from mail.Uni-Mainz.DE (mailserver1.zdv.Uni-Mainz.DE [134.93.8.30]) by mailgate2.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f1BJfJ723190 for ; Sun, 11 Feb 2001 20:41:19 +0100 (MET) MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----_=_NextPart_001_01C09462.9DB58A00" Received: from mailgate1.zdv.Uni-Mainz.DE (mailgate1.zdv.Uni-Mainz.DE [134.93.8.56]) by mail.Uni-Mainz.DE (8.9.3/8.9.3) with ESMTP id UAA15639 for ; Sun, 11 Feb 2001 20:41:18 +0100 (MET) Received: from mail.listserv.gmd.de (mail.listserv.gmd.de [192.88.97.5]) by mailgate1.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f1BJfIM11439 for ; Sun, 11 Feb 2001 20:41:18 +0100 (MET) X-MimeOLE: Produced By Microsoft Exchange V6.5 Received: from mail.listserv.gmd.de (192.88.97.5) by mail.listserv.gmd.de (LSMTP for OpenVMS v1.1a) with SMTP id <1.6EDBFE6C@mail.listserv.gmd.de>; Sun, 11 Feb 2001 20:41:11 +0100 Received: from RELAY.URZ.UNI-HEIDELBERG.DE by RELAY.URZ.UNI-HEIDELBERG.DE (LISTSERV-TCP/IP release 1.8b) with spool id 487752 for LATEX-L@RELAY.URZ.UNI-HEIDELBERG.DE; Sun, 11 Feb 2001 20:41:15 +0100 Received: from ix.urz.uni-heidelberg.de (mail.urz.uni-heidelberg.de [129.206.119.234]) by relay.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id UAA25879 for ; Sun, 11 Feb 2001 20:41:14 +0100 (MET) Received: from relay.uni-heidelberg.de (relay.uni-heidelberg.de [129.206.100.212]) by ix.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id UAA42718 for ; Sun, 11 Feb 2001 20:41:12 +0100 Received: from moutvdom00.kundenserver.de (moutvdom00.kundenserver.de [195.20.224.149]) by relay.uni-heidelberg.de (8.10.2+Sun/8.10.2) with ESMTP id f1BJf7u10481 for ; Sun, 11 Feb 2001 20:41:07 +0100 (MET) Received: from [195.20.224.220] (helo=mrvdom04.kundenserver.de) by moutvdom00.kundenserver.de with esmtp (Exim 2.12 #2) id 14S2Ms-00008X-00 for LATEX-L@urz.uni-heidelberg.de; Sun, 11 Feb 2001 20:41:06 +0100 Received: from manz-3e365959.pool.mediaways.net ([62.54.89.89] helo=istrati.zdv.uni-mainz.de) by mrvdom04.kundenserver.de with esmtp (Exim 2.12 #2) id 14S2Mn-0004e1-00 for LATEX-L@URZ.UNI-HEIDELBERG.DE; Sun, 11 Feb 2001 20:41:02 +0100 Received: (from latex3@localhost) by istrati.zdv.uni-mainz.de (8.9.3/8.9.3/SuSE Linux 8.9.3-0.1) id UAA12831; Sun, 11 Feb 2001 20:38:41 +0100 In-Reply-To: References: <14982.45082.150652.74719@istrati.zdv.uni-mainz.de> Return-Path: X-Mailer: VM 6.75 under Emacs 20.4.1 X-Authentication-Warning: istrati.zdv.uni-mainz.de: latex3 set sender to frank@mittelbach-online.de using -f Content-class: urn:content-classes:message Subject: Re: LaTeX's internal char prepresentation (UTF8 or Unicode?) Date: Sun, 11 Feb 2001 20:38:40 +0100 Message-ID: <14982.59968.683159.480912@istrati.zdv.uni-mainz.de> X-MS-Has-Attach: X-MS-TNEF-Correlator: From: "Frank Mittelbach" Sender: "Mailing list for the LaTeX3 project" To: "Multiple recipients of list LATEX-L" Reply-To: "Mailing list for the LaTeX3 project" Status: R X-Status: X-Keywords: X-UID: 3812 This is a multi-part message in MIME format. ------_=_NextPart_001_01C09462.9DB58A00 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable I asked the question: > > wouldn't it be better if the internal LaTeX representation would = be Unicode > > in one or the other flavor? Roozbeh replied: > What about symbol fonts like TC? What about math characters that are > unified in Unicode (\rightarrow and \longrightarrow)? What about the > things that are not yet in Unicode? yes, what about them? as I outlined already in other replies I don't think that unicode or = UTF8 is the answer as far as LICR is concerned. it can only provide a partial = answer - it clearly can't provide the answer for chars not existing in unicode - and it clearly can't provide the answer for math however LICR (or the part I'm talking about) isn't really concerned with = math which needs a far richer, or lets say different handling anyway; and = which on the other hand doesn't need some of the mechanisms needed for text representations, like being aware of certain type of font attribute = changes etc. > > - however, not clear is that the resulting names are easier to = read, eg > > \unicode{00e4} viz \"a. > > They are worse than you may think. They are always hard to read. My = real > work is related to Unicode Arabic script, and after two years of full > dedication, I can't recall more than a few codes. I always need a = table at > hand. I have much less experience with Knuthian names of math = symbols, but > I'm sure I can recall the names of more than 95% of them without any > problem. so you agree with me, they aren't easy to read :-) but then being = "internal" this only matters in some circumstances and Oliver put some good = arguments forward when something like UTF8 might actually be easier to read. > > - with intermediate forms like data written to files this could be = a pain and > > people in Russia, for example, already have this problem when = they see > > something like = \cyr\CYRA\cyrn\cyrn\cyro\cyrt\cyra\cyrc\cyri\cyrya. In case > > of unicode as the internal representation this would be true for = all > > languages (except English) while currently the Latin based ones = are still > > basically okay. > > This is a place where UTF8 helps a lot. People can use Unicode text > editors to see the files, or use the widely available convertors like > iconv to convert to theoretically every charset. yes and no, I tried to explain that there are limitations posed by the = current implementation of the major underlying formatter (ie TeX) which you = can't easily overcome and even if you do: which then needs a long time to get actually being deployed at sites that have not much use for anything = other than ASCII plus perhaps a few accents. > Unicode also has the equivalent of \", it only appears after the = letter. > So the problem of a accented letter not in Unicode is not a real = problem, > these letters can also be made in Unicode. But I don't know what are = you > going to do with the combining accent appearing after the letter. ahh here is the remark i was searching for an hour ago: nothing really and that is a problem as long as i want to stick with TeX = and a bit of its parsing machinery. and that means i can't make use of this = concept. frank ------_=_NextPart_001_01C09462.9DB58A00 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Re: LaTeX's internal char prepresentation (UTF8 or = Unicode?)

I asked the question:

 > >  wouldn't it be better if the = internal LaTeX representation would be Unicode
 > >  in one or the other = flavor?

Roozbeh replied:

 > What about symbol fonts like TC? What about = math characters that are
 > unified in Unicode (\rightarrow and = \longrightarrow)? What about the
 > things that are not yet in Unicode?

yes, what about them?

as I outlined already in other replies I don't think = that unicode or UTF8 is
the answer as far as LICR is concerned. it can only = provide a partial answer

 - it clearly can't provide the answer for chars = not existing in unicode
 - and it clearly can't provide the answer for = math

however LICR (or the part I'm talking about) isn't = really concerned with math
which needs a far richer, or lets say different = handling anyway; and which
on the other hand doesn't need some of the mechanisms = needed for text
representations, like being aware of  certain = type of font attribute changes
etc.

 > >  - however, not clear is that the = resulting names are easier to read, eg
 > >    \unicode{00e4} viz = \"a.
 >
 > They are worse than you may think. They = are always hard to read. My real
 > work is related to Unicode Arabic script, = and after two years of full
 > dedication, I can't recall more than a few = codes. I always need a table at
 > hand. I have much less experience with = Knuthian names of math symbols, but
 > I'm sure I can recall the names of more = than 95% of them without any
 > problem.

so you agree with me, they aren't easy to read :-) but = then being "internal"
this only matters in some circumstances and Oliver = put some good arguments
forward when something like UTF8 might actually be = easier to read.

 > >  - with intermediate forms like = data written to files this could be a pain and
 > >    people in Russia, = for example, already have this problem when they see
 > >    something like = \cyr\CYRA\cyrn\cyrn\cyro\cyrt\cyra\cyrc\cyri\cyrya.  In case
 > >    of unicode as the = internal representation this would be true for all
 > >    languages (except = English) while currently the Latin based ones are still
 > >    basically = okay.
 >
 > This is a place where UTF8 helps a lot. = People can use Unicode text
 > editors to see the files, or use the = widely available convertors like
 > iconv to convert to theoretically every = charset.

yes and no, I tried to explain that there are = limitations posed by the current
implementation of the major underlying formatter (ie = TeX) which you can't
easily overcome and even if you do: which then needs = a long time to get
actually being deployed at sites that have not much = use for anything other
than ASCII plus perhaps a few accents.

 > Unicode also has the equivalent of \", = it only appears after the letter.
 > So the problem of a accented letter not in = Unicode is not a real problem,
 > these letters can also be made in Unicode. = But I don't know what are you
 > going to do with the combining accent = appearing after the letter.

ahh here is the remark i was searching for an hour = ago:

nothing really and that is a problem as long as i want = to stick with TeX and a
bit of its parsing machinery. and that means i can't = make use of this concept.

frank

------_=_NextPart_001_01C09462.9DB58A00--