MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----_=_NextPart_001_01C09462.9DB58A00"
In-Reply-To:  <Pine.LNX.4.10.10102111920000.11902-100000@Sina.sharif.ac.ir>
References: <14982.45082.150652.74719@istrati.zdv.uni-mainz.de>            <Pine.LNX.4.10.10102111920000.11902-100000@Sina.sharif.ac.ir>
Content-class: urn:content-classes:message
Subject:      Re: LaTeX's internal char prepresentation (UTF8 or Unicode?)
Date: Sun, 11 Feb 2001 20:38:40 +0100
Message-ID:  <14982.59968.683159.480912@istrati.zdv.uni-mainz.de>
From: "Frank Mittelbach" <frank.mittelbach@LATEX-PROJECT.ORG>
Sender: "Mailing list for the LaTeX3 project" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
To: "Multiple recipients of list LATEX-L" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
Reply-To: "Mailing list for the LaTeX3 project" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
Status: R

This is a multi-part message in MIME format.

------_=_NextPart_001_01C09462.9DB58A00
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

I asked the question:

 > >  wouldn't it be better if the internal LaTeX representation would =
be Unicode
 > >  in one or the other flavor?

Roozbeh replied:

 > What about symbol fonts like TC? What about math characters that are
 > unified in Unicode (\rightarrow and \longrightarrow)? What about the
 > things that are not yet in Unicode?

yes, what about them?

as I outlined already in other replies I don't think that unicode or =
UTF8 is
the answer as far as LICR is concerned. it can only provide a partial =
answer

 - it clearly can't provide the answer for chars not existing in unicode
 - and it clearly can't provide the answer for math

however LICR (or the part I'm talking about) isn't really concerned with =
math
which needs a far richer, or lets say different handling anyway; and =
which
on the other hand doesn't need some of the mechanisms needed for text
representations, like being aware of  certain type of font attribute =
changes
etc.

 > >  - however, not clear is that the resulting names are easier to =
read, eg
 > >    \unicode{00e4} viz \"a.
 >
 > They are worse than you may think. They are always hard to read. My =
real
 > work is related to Unicode Arabic script, and after two years of full
 > dedication, I can't recall more than a few codes. I always need a =
table at
 > hand. I have much less experience with Knuthian names of math =
symbols, but
 > I'm sure I can recall the names of more than 95% of them without any
 > problem.

so you agree with me, they aren't easy to read :-) but then being =
"internal"
this only matters in some circumstances and Oliver put some good =
arguments
forward when something like UTF8 might actually be easier to read.

 > >  - with intermediate forms like data written to files this could be =
a pain and
 > >    people in Russia, for example, already have this problem when =
they see
 > >    something like =
\cyr\CYRA\cyrn\cyrn\cyro\cyrt\cyra\cyrc\cyri\cyrya.  In case
 > >    of unicode as the internal representation this would be true for =
all
 > >    languages (except English) while currently the Latin based ones =
are still
 > >    basically okay.
 >
 > This is a place where UTF8 helps a lot. People can use Unicode text
 > editors to see the files, or use the widely available convertors like
 > iconv to convert to theoretically every charset.

yes and no, I tried to explain that there are limitations posed by the =
current
implementation of the major underlying formatter (ie TeX) which you =
can't
easily overcome and even if you do: which then needs a long time to get
actually being deployed at sites that have not much use for anything =
other
than ASCII plus perhaps a few accents.

 > Unicode also has the equivalent of \", it only appears after the =
letter.
 > So the problem of a accented letter not in Unicode is not a real =
problem,
 > these letters can also be made in Unicode. But I don't know what are =
you
 > going to do with the combining accent appearing after the letter.

ahh here is the remark i was searching for an hour ago:

nothing really and that is a problem as long as i want to stick with TeX =
and a
bit of its parsing machinery. and that means i can't make use of this =
concept.

frank

------_=_NextPart_001_01C09462.9DB58A00
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Diso-8859-1">
<META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version =
6.5.7654.12">
<TITLE>     Re: LaTeX's internal char prepresentation (UTF8 or =
Unicode?)</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->

<P><FONT SIZE=3D2>I asked the question:</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp;&gt; &gt;&nbsp; wouldn't it be better if the =
internal LaTeX representation would be Unicode</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; &gt;&nbsp; in one or the other =
flavor?</FONT>
</P>

<P><FONT SIZE=3D2>Roozbeh replied:</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp;&gt; What about symbol fonts like TC? What about =
math characters that are</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; unified in Unicode (\rightarrow and =
\longrightarrow)? What about the</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; things that are not yet in Unicode?</FONT>
</P>

<P><FONT SIZE=3D2>yes, what about them?</FONT>
</P>

<P><FONT SIZE=3D2>as I outlined already in other replies I don't think =
that unicode or UTF8 is</FONT>

<BR><FONT SIZE=3D2>the answer as far as LICR is concerned. it can only =
provide a partial answer</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp;- it clearly can't provide the answer for chars =
not existing in unicode</FONT>

<BR><FONT SIZE=3D2>&nbsp;- and it clearly can't provide the answer for =
math</FONT>
</P>

<P><FONT SIZE=3D2>however LICR (or the part I'm talking about) isn't =
really concerned with math</FONT>

<BR><FONT SIZE=3D2>which needs a far richer, or lets say different =
handling anyway; and which</FONT>

<BR><FONT SIZE=3D2>on the other hand doesn't need some of the mechanisms =
needed for text</FONT>

<BR><FONT SIZE=3D2>representations, like being aware of&nbsp; certain =
type of font attribute changes</FONT>

<BR><FONT SIZE=3D2>etc.</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp;&gt; &gt;&nbsp; - however, not clear is that the =
resulting names are easier to read, eg</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; &gt;&nbsp;&nbsp;&nbsp; \unicode{00e4} viz =
\&quot;a.</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; They are worse than you may think. They =
are always hard to read. My real</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; work is related to Unicode Arabic script, =
and after two years of full</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; dedication, I can't recall more than a few =
codes. I always need a table at</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; hand. I have much less experience with =
Knuthian names of math symbols, but</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; I'm sure I can recall the names of more =
than 95% of them without any</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; problem.</FONT>
</P>

<P><FONT SIZE=3D2>so you agree with me, they aren't easy to read :-) but =
then being &quot;internal&quot;</FONT>

<BR><FONT SIZE=3D2>this only matters in some circumstances and Oliver =
put some good arguments</FONT>

<BR><FONT SIZE=3D2>forward when something like UTF8 might actually be =
easier to read.</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp;&gt; &gt;&nbsp; - with intermediate forms like =
data written to files this could be a pain and</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; &gt;&nbsp;&nbsp;&nbsp; people in Russia, =
for example, already have this problem when they see</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; &gt;&nbsp;&nbsp;&nbsp; something like =
\cyr\CYRA\cyrn\cyrn\cyro\cyrt\cyra\cyrc\cyri\cyrya.&nbsp; In case</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; &gt;&nbsp;&nbsp;&nbsp; of unicode as the =
internal representation this would be true for all</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; &gt;&nbsp;&nbsp;&nbsp; languages (except =
English) while currently the Latin based ones are still</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; &gt;&nbsp;&nbsp;&nbsp; basically =
okay.</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; This is a place where UTF8 helps a lot. =
People can use Unicode text</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; editors to see the files, or use the =
widely available convertors like</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; iconv to convert to theoretically every =
charset.</FONT>
</P>

<P><FONT SIZE=3D2>yes and no, I tried to explain that there are =
limitations posed by the current</FONT>

<BR><FONT SIZE=3D2>implementation of the major underlying formatter (ie =
TeX) which you can't</FONT>

<BR><FONT SIZE=3D2>easily overcome and even if you do: which then needs =
a long time to get</FONT>

<BR><FONT SIZE=3D2>actually being deployed at sites that have not much =
use for anything other</FONT>

<BR><FONT SIZE=3D2>than ASCII plus perhaps a few accents.</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp;&gt; Unicode also has the equivalent of \&quot;, =
it only appears after the letter.</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; So the problem of a accented letter not in =
Unicode is not a real problem,</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; these letters can also be made in Unicode. =
But I don't know what are you</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; going to do with the combining accent =
appearing after the letter.</FONT>
</P>

<P><FONT SIZE=3D2>ahh here is the remark i was searching for an hour =
ago:</FONT>
</P>

<P><FONT SIZE=3D2>nothing really and that is a problem as long as i want =
to stick with TeX and a</FONT>

<BR><FONT SIZE=3D2>bit of its parsing machinery. and that means i can't =
make use of this concept.</FONT>
</P>

<P><FONT SIZE=3D2>frank</FONT>
</P>

</BODY>
</HTML>
------_=_NextPart_001_01C09462.9DB58A00--