MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----_=_NextPart_001_01C09446.35F14600"
In-Reply-To:  <14982.45082.150652.74719@istrati.zdv.uni-mainz.de>
Content-class: urn:content-classes:message
Subject:      Re: LaTeX's internal char prepresentation (UTF8 or Unicode?)
Date: Sun, 11 Feb 2001 17:17:44 +0100
Message-ID:  <Pine.LNX.4.10.10102111920000.11902-100000@Sina.sharif.ac.ir>
From: "Roozbeh Pournader" <roozbeh@SHARIF.EDU>
Sender: "Mailing list for the LaTeX3 project" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
To: "Multiple recipients of list LATEX-L" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
Reply-To: "Mailing list for the LaTeX3 project" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
Status: R

This is a multi-part message in MIME format.

------_=_NextPart_001_01C09446.35F14600
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

On Sun, 11 Feb 2001, Frank Mittelbach wrote:

> The LaTeX internal character representation is a 7bit representation =
not an
> 8bit one as UTF8. As such it is far less likely to be mangled by =
incorrect
> conversion if files are exchanged between different platforms. I have =
yet to
> see that UTF8 text (without taking precaution and externally =
announcing that a
> file is in UTF8) is really properly handled by any OS platform. Is it?

Windows 2000 autodetects them. I can't define the proper handling in =
Linux
well; you mean in a text editor?

>  wouldn't it be better if the internal LaTeX representation would be =
Unicode
>  in one or the other flavor?

What about symbol fonts like TC? What about math characters that are
unified in Unicode (\rightarrow and \longrightarrow)? What about the
things that are not yet in Unicode?

>  - however, not clear is that the resulting names are easier to read, =
eg
>    \unicode{00e4} viz \"a.

They are worse than you may think. They are always hard to read. My real
work is related to Unicode Arabic script, and after two years of full
dedication, I can't recall more than a few codes. I always need a table =
at
hand. I have much less experience with Knuthian names of math symbols, =
but
I'm sure I can recall the names of more than 95% of them without any
problem.

>  - with intermediate forms like data written to files this could be a =
pain and
>    people in Russia, for example, already have this problem when they =
see
>    something like \cyr\CYRA\cyrn\cyrn\cyro\cyrt\cyra\cyrc\cyri\cyrya.  =
In case
>    of unicode as the internal representation this would be true for =
all
>    languages (except English) while currently the Latin based ones are =
still
>    basically okay.

This is a place where UTF8 helps a lot. People can use Unicode text
editors to see the files, or use the widely available convertors like
iconv to convert to theoretically every charset.

>  - the current latex internal representation is richer than unicode =
for good
>    or worse, eg \" is defined individually as representation for =
accenting the
>    next char, which means that anything =
\"<base-char-in-the-internal-reps> is
>    automatically also a member of it, eg \"g.
>
>  - the latter point could be considered bad since it allows to produce
>    characters not in unicode but independently of what you feel about =
that the
>    fact itself as consequences when defining mappings for font
>    encodings. right now, one specifies the accents, ie =
\DeclareTextAccent\"
>    and for those glyphs that exists as composites one also specifies =
the
>    composite, eg \DeclareTextComposite{\"}{T1}{a}{...}
>    With the unicode approach as the internal representation there =
would be
>    an atomic form ie  \unicode{00e4} describing umlaut-a so if that =
has no
>    representation in a font, eg in OT1, then one would need to define =
for each
>    combination the result.

Unicode also has the equivalent of \", it only appears after the letter.
So the problem of a accented letter not in Unicode is not a real =
problem,
these letters can also be made in Unicode. But I don't know what are you
going to do with the combining accent appearing after the letter.

--roozbeh

------_=_NextPart_001_01C09446.35F14600
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Diso-8859-1">
<META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version =
6.5.7654.12">
<TITLE>     Re: LaTeX's internal char prepresentation (UTF8 or =
Unicode?)</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->

<P><FONT SIZE=3D2>On Sun, 11 Feb 2001, Frank Mittelbach wrote:</FONT>
</P>

<P><FONT SIZE=3D2>&gt; The LaTeX internal character representation is a =
7bit representation not an</FONT>

<BR><FONT SIZE=3D2>&gt; 8bit one as UTF8. As such it is far less likely =
to be mangled by incorrect</FONT>

<BR><FONT SIZE=3D2>&gt; conversion if files are exchanged between =
different platforms. I have yet to</FONT>

<BR><FONT SIZE=3D2>&gt; see that UTF8 text (without taking precaution =
and externally announcing that a</FONT>

<BR><FONT SIZE=3D2>&gt; file is in UTF8) is really properly handled by =
any OS platform. Is it?</FONT>
</P>

<P><FONT SIZE=3D2>Windows 2000 autodetects them. I can't define the =
proper handling in Linux</FONT>

<BR><FONT SIZE=3D2>well; you mean in a text editor?</FONT>
</P>

<P><FONT SIZE=3D2>&gt;&nbsp; wouldn't it be better if the internal LaTeX =
representation would be Unicode</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp; in one or the other flavor?</FONT>
</P>

<P><FONT SIZE=3D2>What about symbol fonts like TC? What about math =
characters that are</FONT>

<BR><FONT SIZE=3D2>unified in Unicode (\rightarrow and \longrightarrow)? =
What about the</FONT>

<BR><FONT SIZE=3D2>things that are not yet in Unicode?</FONT>
</P>

<P><FONT SIZE=3D2>&gt;&nbsp; - however, not clear is that the resulting =
names are easier to read, eg</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp; \unicode{00e4} viz =
\&quot;a.</FONT>
</P>

<P><FONT SIZE=3D2>They are worse than you may think. They are always =
hard to read. My real</FONT>

<BR><FONT SIZE=3D2>work is related to Unicode Arabic script, and after =
two years of full</FONT>

<BR><FONT SIZE=3D2>dedication, I can't recall more than a few codes. I =
always need a table at</FONT>

<BR><FONT SIZE=3D2>hand. I have much less experience with Knuthian names =
of math symbols, but</FONT>

<BR><FONT SIZE=3D2>I'm sure I can recall the names of more than 95% of =
them without any</FONT>

<BR><FONT SIZE=3D2>problem.</FONT>
</P>

<P><FONT SIZE=3D2>&gt;&nbsp; - with intermediate forms like data written =
to files this could be a pain and</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp; people in Russia, for example, =
already have this problem when they see</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp; something like =
\cyr\CYRA\cyrn\cyrn\cyro\cyrt\cyra\cyrc\cyri\cyrya.&nbsp; In case</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp; of unicode as the internal =
representation this would be true for all</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp; languages (except English) =
while currently the Latin based ones are still</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp; basically okay.</FONT>
</P>

<P><FONT SIZE=3D2>This is a place where UTF8 helps a lot. People can use =
Unicode text</FONT>

<BR><FONT SIZE=3D2>editors to see the files, or use the widely available =
convertors like</FONT>

<BR><FONT SIZE=3D2>iconv to convert to theoretically every =
charset.</FONT>
</P>

<P><FONT SIZE=3D2>&gt;&nbsp; - the current latex internal representation =
is richer than unicode for good</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp; or worse, eg \&quot; is =
defined individually as representation for accenting the</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp; next char, which means that =
anything \&quot;&lt;base-char-in-the-internal-reps&gt; is</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp; automatically also a member of =
it, eg \&quot;g.</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp; - the latter point could be considered bad =
since it allows to produce</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp; characters not in unicode but =
independently of what you feel about that the</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp; fact itself as consequences =
when defining mappings for font</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp; encodings. right now, one =
specifies the accents, ie \DeclareTextAccent\&quot;</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp; and for those glyphs that =
exists as composites one also specifies the</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp; composite, eg =
\DeclareTextComposite{\&quot;}{T1}{a}{...}</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp; With the unicode approach as =
the internal representation there would be</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp; an atomic form ie&nbsp; =
\unicode{00e4} describing umlaut-a so if that has no</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp; representation in a font, eg =
in OT1, then one would need to define for each</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp;&nbsp; combination the result.</FONT>
</P>

<P><FONT SIZE=3D2>Unicode also has the equivalent of \&quot;, it only =
appears after the letter.</FONT>

<BR><FONT SIZE=3D2>So the problem of a accented letter not in Unicode is =
not a real problem,</FONT>

<BR><FONT SIZE=3D2>these letters can also be made in Unicode. But I =
don't know what are you</FONT>

<BR><FONT SIZE=3D2>going to do with the combining accent appearing after =
the letter.</FONT>
</P>

<P><FONT SIZE=3D2>--roozbeh</FONT>
</P>

</BODY>
</HTML>
------_=_NextPart_001_01C09446.35F14600--