MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----_=_NextPart_001_01C0944A.270C9280"
In-Reply-To:  <14982.45082.150652.74719@istrati.zdv.uni-mainz.de>
References: <v03110701b6a9aae65099@[195.100.226.129]>            <200102091445.JAA00482@plmsc.psu.edu>            <200102091643.RAA23818@mozart.ujf-grenoble.Fr>            <14980.23750.628032.305093@gargle.gargle.HOWL>            <14982.45082.150652.74719@istrati.zdv.uni-mainz.de>
Content-class: urn:content-classes:message
Subject:      LaTeX's internal char representation (UTF8 or Unicode?)
Date: Sun, 11 Feb 2001 17:45:44 +0100
Message-ID:  <14982.49592.211375.235529@gargle.gargle.HOWL>
From: "Marcel Oliver" <oliver@NA.UNI-TUEBINGEN.DE>
Sender: "Mailing list for the LaTeX3 project" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
To: "Multiple recipients of list LATEX-L" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
Reply-To: "Mailing list for the LaTeX3 project" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
Status: R

This is a multi-part message in MIME format.

------_=_NextPart_001_01C0944A.270C9280
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Frank Mittelbach writes:
 > in 1992/3 when we worked on shaping the ideas of the LaTeX internal
 > representation we actually did discuss similar ideas but back then
 > abandoned them because of resource constraints (in the
 > software). Machines are nowadays bigger and faster so this isn't
 > really much of an argument there.
 >
 > So... time for another attempt?

Yes, yes, yes!

 > The LaTeX internal character representation is a 7bit
 > representation not an 8bit one as UTF8. As such it is far less
 > likely to be mangled by incorrect conversion if files are exchanged
 > between different platforms. I have yet to see that UTF8 text
 > (without taking precaution and externally announcing that a file is
 > in UTF8) is really properly handled by any OS platform. Is it?

Not at the moment.  But there is a strong movement pushing for UTF8 as
_the_ encoding standard.  Support in bleeding edge versions of a lot
of software is actually quite good.  As far as I can see, UTF8 is the
only standard that has a reasonable chance of becoming the one that
"works without taking precaution".

It would be a pity if LaTeX missed the boat.  For some info, see

  http://www.cl.cam.ac.uk/~mgk25/unicode.html

 > TeX is 7bit with a parser that accepts 8bit but doesn't by default
 > gives it any meaning. On the other hand Omega is 16bit (or more
 > these days?) and could be viewed as internally using something like
 > Unicode for representation.

This is good because UTF8 is a proper superset of what TeX is
currently taking as input.  Does anybody know about the state of
Omega?  16-bit Unicode is not the whole game, and also not particularly
attractive as an input encoding.

 >  wouldn't it be better if the internal LaTeX representation would
 >  be Unicode in one or the other flavor?

Yes, because:

- A LaTeX specific naming scheme will be essentially unmaintainable.

- LaTeX could eventually be made to output diagnostics and log files
  in UTF8.  For example, UTF8-enabled Xterms exist now, and will
  likely come as default on Linux distributions long before LaTeX3.
  Same for text editors.  So the infrastructure is getting to a state
  where it's possible to pull this off.

 >  - however, not clear is that the resulting names are easier to
 >    read, eg \unicode{00e4} viz \"a.

See remark about Xterms.

 >  - the current latex internal representation is richer than unicode
 >    for good or worse, eg \" is defined individually as
 >    representation for accenting the next char, which means that
 >    anything \"<base-char-in-the-internal-reps> is automatically
 >    also a member of it, eg \"g.

This is not necessarily a problem (cf. Roozbeh's remark about math
symbols which are not properly defined in unicode).  As long as there
are no hyphenation and nongeneric kerning issues involved (and those
seem only an issue for natural language scripts), one could still have
named symbols as they exists now, whether they part of some special
font, a combination of glyphs, the drawing of some box, etc.

 >  - the latter point could be considered bad since it allows to
 >    produce characters not in unicode but independently of what you
 >    feel about that the fact itself as consequences when defining
 >    mappings for font encodings. right now, one specifies the
 >    accents, ie \DeclareTextAccent\" and for those glyphs that
 >    exists as composites one also specifies the composite, eg
 >    \DeclareTextComposite{\"}{T1}{a}{...}  With the unicode approach
 >    as the internal representation there would be an atomic form ie
 >    \unicode{00e4} describing umlaut-a so if that has no
 >    representation in a font, eg in OT1, then one would need to
 >    define for each combination the result.

This seems to be the logical thing to do?!

--Marcel

------_=_NextPart_001_01C0944A.270C9280
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Diso-8859-1">
<META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version =
6.5.7654.12">
<TITLE>     LaTeX's internal char representation (UTF8 or =
Unicode?)</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->

<P><FONT SIZE=3D2>Frank Mittelbach writes:</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; in 1992/3 when we worked on shaping the =
ideas of the LaTeX internal</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; representation we actually did discuss =
similar ideas but back then</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; abandoned them because of resource =
constraints (in the</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; software). Machines are nowadays bigger =
and faster so this isn't</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; really much of an argument there.</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; So... time for another attempt?</FONT>
</P>

<P><FONT SIZE=3D2>Yes, yes, yes!</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp;&gt; The LaTeX internal character representation =
is a 7bit</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; representation not an 8bit one as UTF8. As =
such it is far less</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; likely to be mangled by incorrect =
conversion if files are exchanged</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; between different platforms. I have yet to =
see that UTF8 text</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; (without taking precaution and externally =
announcing that a file is</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; in UTF8) is really properly handled by any =
OS platform. Is it?</FONT>
</P>

<P><FONT SIZE=3D2>Not at the moment.&nbsp; But there is a strong =
movement pushing for UTF8 as</FONT>

<BR><FONT SIZE=3D2>_the_ encoding standard.&nbsp; Support in bleeding =
edge versions of a lot</FONT>

<BR><FONT SIZE=3D2>of software is actually quite good.&nbsp; As far as I =
can see, UTF8 is the</FONT>

<BR><FONT SIZE=3D2>only standard that has a reasonable chance of =
becoming the one that</FONT>

<BR><FONT SIZE=3D2>&quot;works without taking precaution&quot;.</FONT>
</P>

<P><FONT SIZE=3D2>It would be a pity if LaTeX missed the boat.&nbsp; For =
some info, see</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp; <A =
HREF=3D"http://www.cl.cam.ac.uk/~mgk25/unicode.html">http://www.cl.cam.ac=
.uk/~mgk25/unicode.html</A></FONT>
</P>

<P><FONT SIZE=3D2>&nbsp;&gt; TeX is 7bit with a parser that accepts 8bit =
but doesn't by default</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; gives it any meaning. On the other hand =
Omega is 16bit (or more</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; these days?) and could be viewed as =
internally using something like</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; Unicode for representation.</FONT>
</P>

<P><FONT SIZE=3D2>This is good because UTF8 is a proper superset of what =
TeX is</FONT>

<BR><FONT SIZE=3D2>currently taking as input.&nbsp; Does anybody know =
about the state of</FONT>

<BR><FONT SIZE=3D2>Omega?&nbsp; 16-bit Unicode is not the whole game, =
and also not particularly</FONT>

<BR><FONT SIZE=3D2>attractive as an input encoding.</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp;&gt;&nbsp; wouldn't it be better if the internal =
LaTeX representation would</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;&nbsp; be Unicode in one or the other =
flavor?</FONT>
</P>

<P><FONT SIZE=3D2>Yes, because:</FONT>
</P>

<P><FONT SIZE=3D2>- A LaTeX specific naming scheme will be essentially =
unmaintainable.</FONT>
</P>

<P><FONT SIZE=3D2>- LaTeX could eventually be made to output diagnostics =
and log files</FONT>

<BR><FONT SIZE=3D2>&nbsp; in UTF8.&nbsp; For example, UTF8-enabled =
Xterms exist now, and will</FONT>

<BR><FONT SIZE=3D2>&nbsp; likely come as default on Linux distributions =
long before LaTeX3.</FONT>

<BR><FONT SIZE=3D2>&nbsp; Same for text editors.&nbsp; So the =
infrastructure is getting to a state</FONT>

<BR><FONT SIZE=3D2>&nbsp; where it's possible to pull this off.</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp;&gt;&nbsp; - however, not clear is that the =
resulting names are easier to</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;&nbsp;&nbsp;&nbsp; read, eg \unicode{00e4} =
viz \&quot;a.</FONT>
</P>

<P><FONT SIZE=3D2>See remark about Xterms.</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp;&gt;&nbsp; - the current latex internal =
representation is richer than unicode</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;&nbsp;&nbsp;&nbsp; for good or worse, eg =
\&quot; is defined individually as</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;&nbsp;&nbsp;&nbsp; representation for =
accenting the next char, which means that</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;&nbsp;&nbsp;&nbsp; anything =
\&quot;&lt;base-char-in-the-internal-reps&gt; is automatically</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;&nbsp;&nbsp;&nbsp; also a member of it, eg =
\&quot;g.</FONT>
</P>

<P><FONT SIZE=3D2>This is not necessarily a problem (cf. Roozbeh's =
remark about math</FONT>

<BR><FONT SIZE=3D2>symbols which are not properly defined in =
unicode).&nbsp; As long as there</FONT>

<BR><FONT SIZE=3D2>are no hyphenation and nongeneric kerning issues =
involved (and those</FONT>

<BR><FONT SIZE=3D2>seem only an issue for natural language scripts), one =
could still have</FONT>

<BR><FONT SIZE=3D2>named symbols as they exists now, whether they part =
of some special</FONT>

<BR><FONT SIZE=3D2>font, a combination of glyphs, the drawing of some =
box, etc.</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp;&gt;&nbsp; - the latter point could be =
considered bad since it allows to</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;&nbsp;&nbsp;&nbsp; produce characters not =
in unicode but independently of what you</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;&nbsp;&nbsp;&nbsp; feel about that the fact =
itself as consequences when defining</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;&nbsp;&nbsp;&nbsp; mappings for font =
encodings. right now, one specifies the</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;&nbsp;&nbsp;&nbsp; accents, ie =
\DeclareTextAccent\&quot; and for those glyphs that</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;&nbsp;&nbsp;&nbsp; exists as composites one =
also specifies the composite, eg</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;&nbsp;&nbsp;&nbsp; =
\DeclareTextComposite{\&quot;}{T1}{a}{...}&nbsp; With the unicode =
approach</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;&nbsp;&nbsp;&nbsp; as the internal =
representation there would be an atomic form ie</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;&nbsp;&nbsp;&nbsp; \unicode{00e4} =
describing umlaut-a so if that has no</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;&nbsp;&nbsp;&nbsp; representation in a =
font, eg in OT1, then one would need to</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;&nbsp;&nbsp;&nbsp; define for each =
combination the result.</FONT>
</P>

<P><FONT SIZE=3D2>This seems to be the logical thing to do?!</FONT>
</P>

<P><FONT SIZE=3D2>--Marcel</FONT>
</P>

</BODY>
</HTML>
------_=_NextPart_001_01C0944A.270C9280--