MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----_=_NextPart_001_01C2BE60.688CF000"
In-Reply-To:  <20030116114637.GA9844@g113.hadiko.de>
References: <200212031601.gB3G11cQ009558@sun.dante.de>            <15899.14827.804209.458595@istrati.mittelbach-online.de>            <20030116114637.GA9844@g113.hadiko.de>
Content-class: urn:content-classes:message
Subject:      Re: latex/3480: Support for UTF-8 missing in inputenc.sty
Date: Fri, 17 Jan 2003 20:31:16 +0100
Message-ID: A<15912.23044.419984.897093@istrati.mittelbach-online.de>
Thread-Topic:      Re: latex/3480: Support for UTF-8 missing in inputenc.sty
Thread-Index: AcK+YGiz0iCPihLSRKWmp1S2HiIuLQ==
From: "Frank Mittelbach" <frank.mittelbach@LATEX-PROJECT.ORG>
To: <LATEX-L@listserv.uni-heidelberg.de>
Reply-To: "Mailing list for the LaTeX3 project" <LATEX-L@listserv.uni-heidelberg.de>
Status: R

This is a multi-part message in MIME format.

------_=_NextPart_001_01C2BE60.688CF000
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Dominique wrote:

 > I want to add several comments to Frank and Chris's utf8.def:

good

 > =3D=3D=3D 1. The definition of the .dfu files.
 >
 > In the present model, we have the problem, that the same Unicode
 > character is defined several times in several .dfu files. If all
 > definitions are identical, this is no problem, but this has to be
 > ensured. Take the following example: Fontencoding LGR has the command
 > \euro, to be assigned to U+20AC, while TS1 has \texteuro, same =
Unicode
 > character.

if LGR does that then LGR is at fault since \texteuro is the LaTeX =
internal
character representation (LICR) name for the euro character

[aside, where is that file, the one that i have here is very short and =
doesn't
contain \euro but neither looks like a proper encoding file either]

i agree that there is a potential problem here, sort of similar to the
potential problem that to inputenc files map  the same abstract input to =
a
different internal command (of which only one should be a proper LICR)

being pragmatic i believe that these get weed out after a while, the =
reason
for suggesting a .dfu file approach is that this allows easy extensions =
for
locally developed encodings.

 > Therefore I propose the following policy:
 >
 > - Unicode to TeX mappings are done in a single, fontencoding
 > independent file, e.g. ucs.map:
 > [...]
 > 0x20AC   \texteuro
 > [...]
 >
 > - Fontencoding specific files contain list of supported code
 > positions, e.g.  lgr.ucr and ts1.ucr (UCR=3D Unicode Range) both =
contain
 > the number 0x20AC (but no more information).
 >
 > - A script then generates the .dfu files, the above example induces
 > the inclusion of
 >
 > \DeclareUnicodeCharacter{20AC}{\texteuro}
 >
 > into ts1.dfu and lgr.dfu (LGR has then to be updated to include the
 > macro \texteuro additionally to \euro). Note that only the final .dfu
 > files are seen by the latex executable, so this system does not
 > involve any changes in utf8.def.
 >
 > - The ucs.map file is managed by the LaTeX team. The .ucr files can =
be
 > created be the developers of the fontencodings, thus enabling the
 > developement of fontencodings without the need of interaction with =
the
 > LaTeX team. Inclusion of new into the ucs.map file should not be
 > subject to some restrictive election, since no resources are wasted,
 > unless some fontencoding requests these characters.

in principle i agree with this kind of approach. however, i don't think =
it is
a very good idea to require a "script" (that then doesn't work on all
installations or is not available on all installations ...)

essentially per encoding Xenc.def there will only be the need to produce
Xenc.dfu once (except for fixing it) and so people will distribute def =
and dfu
together anyway rather than distributing .def and .ucr and relying on =
some
process at the installation to generate .dfu for them.

so i don't think this will work.

i would suggest something simpler:

a ucs.map file that contains the mappings Unicode->LICR  in the form =
directly
usable in .dfu files, simply as a template for making a .dfu if really
necessary.

perhaps using docstrip to generate the standard dfu files from that file

[further ideas welcome]


 > =3D=3D=3D 2. \IeC
 >
 > Most characters must be enclosed in a call to \IeC, like it is also
 > done by \DeclareInputText. Otherwise the following fragment
 >
 > \tableofcontents
 > \section{La=DF nach}  % La\ss  nach
 >
 > will give a TOC entry "La=DFnach" (i.e. the space will go away).


criminal oversight. that certainly needs correction


 > =3D=3D=3D 3. Unicode to LaTeX mappings.
 >
 > There are already extensive lists of character mappings available at:
 > http://www.unruh.de/DniQ/latex/unicode/content/config/

so there is, worth stealing from


 > =3D=3D=3D 4. The loading of the .dfu files.
 >
 > It has been mentioned, that the late loading of the .dfu files (lines
 > 113--124) causes problems with saveboxes. For completeness I'd like =
to
 > add, that also \xdef's etc. cause similar problems when used in the
 > preamble.
 >

there is no such thing a \xdef (on free input) in LaTeX it should always =
be
\protected@edef or the like. having said that it doesn't really help as =
the
utf parsing expands straight up to the LICR (and that is not yet defined =
at
this point)

so yes. you can formulate it differently: with the current =
implementation this
supports utf8 _after_ begin document

with the outlined implmentation however that problem is going to vanish


 > =3D=3D=3D 5. Interoperability with ucs.sty
 >
 > There are some name clashes with my Unicode package.

i'm not totally ignorant of your work, but it is a whileago that i =
looked at
it in some more detail and ...

my impression was that it tries to provide much more than what we have =
been
after in the excerise for inputenc.

would it be possible for you to give use a ten line bullet list of =
comparsion?

perhaps the best is simply to forget about what we did on lazy =
afternoons
during the Xmas holidays?


 > - utf8.def: I accept the fact, that this is the canonical name for
 > that file and will rename my inputencoding in favour of the kernel's
 > encoding.

negotiable certainly. you try to hook into inputenc as well aren't you?

 > - \DeclareUnicodeCharacter: This command is named identically in my
 > system. I would appreciate if another name could be chosen at this
 > early stadium to evade chaos.

what are your arguments?


 >  Some possible names would be
 >
 > \DeclareUnicodeGlyph (according to the nomenclature of the Unicode
 > standard)

no we map to LICR those are characters not glyphs!

 > \DeclareUnicodeCommand (analogous to \DeclareTextCommand)

no again Command in that context has already some semantics

maybe

 \DeclareUnicodeLaTeXMapping


frank

------_=_NextPart_001_01C2BE60.688CF000
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Diso-8859-1">
<META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version =
6.5.7654.12">
<TITLE>     Re: latex/3480: Support for UTF-8 missing in =
inputenc.sty</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->

<P><FONT SIZE=3D2>Dominique wrote:</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp;&gt; I want to add several comments to Frank and =
Chris's utf8.def:</FONT>
</P>

<P><FONT SIZE=3D2>good</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp;&gt; =3D=3D=3D 1. The definition of the .dfu =
files.</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; In the present model, we have the problem, =
that the same Unicode</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; character is defined several times in =
several .dfu files. If all</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; definitions are identical, this is no =
problem, but this has to be</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; ensured. Take the following example: =
Fontencoding LGR has the command</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; \euro, to be assigned to U+20AC, while TS1 =
has \texteuro, same Unicode</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; character.</FONT>
</P>

<P><FONT SIZE=3D2>if LGR does that then LGR is at fault since \texteuro =
is the LaTeX internal</FONT>

<BR><FONT SIZE=3D2>character representation (LICR) name for the euro =
character</FONT>
</P>

<P><FONT SIZE=3D2>[aside, where is that file, the one that i have here =
is very short and doesn't</FONT>

<BR><FONT SIZE=3D2>contain \euro but neither looks like a proper =
encoding file either]</FONT>
</P>

<P><FONT SIZE=3D2>i agree that there is a potential problem here, sort =
of similar to the</FONT>

<BR><FONT SIZE=3D2>potential problem that to inputenc files map&nbsp; =
the same abstract input to a</FONT>

<BR><FONT SIZE=3D2>different internal command (of which only one should =
be a proper LICR)</FONT>
</P>

<P><FONT SIZE=3D2>being pragmatic i believe that these get weed out =
after a while, the reason</FONT>

<BR><FONT SIZE=3D2>for suggesting a .dfu file approach is that this =
allows easy extensions for</FONT>

<BR><FONT SIZE=3D2>locally developed encodings.</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp;&gt; Therefore I propose the following =
policy:</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; - Unicode to TeX mappings are done in a =
single, fontencoding</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; independent file, e.g. ucs.map:</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; [...]</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; 0x20AC&nbsp;&nbsp; \texteuro</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; [...]</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; - Fontencoding specific files contain list =
of supported code</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; positions, e.g.&nbsp; lgr.ucr and ts1.ucr =
(UCR=3D Unicode Range) both contain</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; the number 0x20AC (but no more =
information).</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; - A script then generates the .dfu files, =
the above example induces</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; the inclusion of</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; =
\DeclareUnicodeCharacter{20AC}{\texteuro}</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; into ts1.dfu and lgr.dfu (LGR has then to =
be updated to include the</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; macro \texteuro additionally to \euro). =
Note that only the final .dfu</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; files are seen by the latex executable, so =
this system does not</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; involve any changes in utf8.def.</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; - The ucs.map file is managed by the LaTeX =
team. The .ucr files can be</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; created be the developers of the =
fontencodings, thus enabling the</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; developement of fontencodings without the =
need of interaction with the</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; LaTeX team. Inclusion of new into the =
ucs.map file should not be</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; subject to some restrictive election, =
since no resources are wasted,</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; unless some fontencoding requests these =
characters.</FONT>
</P>

<P><FONT SIZE=3D2>in principle i agree with this kind of approach. =
however, i don't think it is</FONT>

<BR><FONT SIZE=3D2>a very good idea to require a &quot;script&quot; =
(that then doesn't work on all</FONT>

<BR><FONT SIZE=3D2>installations or is not available on all =
installations ...)</FONT>
</P>

<P><FONT SIZE=3D2>essentially per encoding Xenc.def there will only be =
the need to produce</FONT>

<BR><FONT SIZE=3D2>Xenc.dfu once (except for fixing it) and so people =
will distribute def and dfu</FONT>

<BR><FONT SIZE=3D2>together anyway rather than distributing .def and =
.ucr and relying on some</FONT>

<BR><FONT SIZE=3D2>process at the installation to generate .dfu for =
them.</FONT>
</P>

<P><FONT SIZE=3D2>so i don't think this will work.</FONT>
</P>

<P><FONT SIZE=3D2>i would suggest something simpler:</FONT>
</P>

<P><FONT SIZE=3D2>a ucs.map file that contains the mappings =
Unicode-&gt;LICR&nbsp; in the form directly</FONT>

<BR><FONT SIZE=3D2>usable in .dfu files, simply as a template for making =
a .dfu if really</FONT>

<BR><FONT SIZE=3D2>necessary.</FONT>
</P>

<P><FONT SIZE=3D2>perhaps using docstrip to generate the standard dfu =
files from that file</FONT>
</P>

<P><FONT SIZE=3D2>[further ideas welcome]</FONT>
</P>
<BR>

<P><FONT SIZE=3D2>&nbsp;&gt; =3D=3D=3D 2. \IeC</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; Most characters must be enclosed in a call =
to \IeC, like it is also</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; done by \DeclareInputText. Otherwise the =
following fragment</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; \tableofcontents</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; \section{La=DF nach}&nbsp; % La\ss&nbsp; =
nach</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; will give a TOC entry =
&quot;La=DFnach&quot; (i.e. the space will go away).</FONT>
</P>
<BR>

<P><FONT SIZE=3D2>criminal oversight. that certainly needs =
correction</FONT>
</P>
<BR>
<BR>

<P><FONT SIZE=3D2>&nbsp;&gt; =3D=3D=3D 3. Unicode to LaTeX =
mappings.</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; There are already extensive lists of =
character mappings available at:</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; <A =
HREF=3D"http://www.unruh.de/DniQ/latex/unicode/content/config/">http://ww=
w.unruh.de/DniQ/latex/unicode/content/config/</A></FONT>
</P>

<P><FONT SIZE=3D2>so there is, worth stealing from</FONT>
</P>
<BR>
<BR>

<P><FONT SIZE=3D2>&nbsp;&gt; =3D=3D=3D 4. The loading of the .dfu =
files.</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; It has been mentioned, that the late =
loading of the .dfu files (lines</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; 113--124) causes problems with saveboxes. =
For completeness I'd like to</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; add, that also \xdef's etc. cause similar =
problems when used in the</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; preamble.</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;</FONT>
</P>

<P><FONT SIZE=3D2>there is no such thing a \xdef (on free input) in =
LaTeX it should always be</FONT>

<BR><FONT SIZE=3D2>\protected@edef or the like. having said that it =
doesn't really help as the</FONT>

<BR><FONT SIZE=3D2>utf parsing expands straight up to the LICR (and that =
is not yet defined at</FONT>

<BR><FONT SIZE=3D2>this point)</FONT>
</P>

<P><FONT SIZE=3D2>so yes. you can formulate it differently: with the =
current implementation this</FONT>

<BR><FONT SIZE=3D2>supports utf8 _after_ begin document</FONT>
</P>

<P><FONT SIZE=3D2>with the outlined implmentation however that problem =
is going to vanish</FONT>
</P>
<BR>
<BR>
<BR>

<P><FONT SIZE=3D2>&nbsp;&gt; =3D=3D=3D 5. Interoperability with =
ucs.sty</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; There are some name clashes with my =
Unicode package.</FONT>
</P>

<P><FONT SIZE=3D2>i'm not totally ignorant of your work, but it is a =
whileago that i looked at</FONT>

<BR><FONT SIZE=3D2>it in some more detail and ...</FONT>
</P>

<P><FONT SIZE=3D2>my impression was that it tries to provide much more =
than what we have been</FONT>

<BR><FONT SIZE=3D2>after in the excerise for inputenc.</FONT>
</P>

<P><FONT SIZE=3D2>would it be possible for you to give use a ten line =
bullet list of comparsion?</FONT>
</P>

<P><FONT SIZE=3D2>perhaps the best is simply to forget about what we did =
on lazy afternoons</FONT>

<BR><FONT SIZE=3D2>during the Xmas holidays?</FONT>
</P>
<BR>

<P><FONT SIZE=3D2>&nbsp;&gt; - utf8.def: I accept the fact, that this is =
the canonical name for</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; that file and will rename my inputencoding =
in favour of the kernel's</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; encoding.</FONT>
</P>

<P><FONT SIZE=3D2>negotiable certainly. you try to hook into inputenc as =
well aren't you?</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp;&gt; - \DeclareUnicodeCharacter: This command is =
named identically in my</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; system. I would appreciate if another name =
could be chosen at this</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; early stadium to evade chaos.</FONT>
</P>

<P><FONT SIZE=3D2>what are your arguments?</FONT>
</P>
<BR>

<P><FONT SIZE=3D2>&nbsp;&gt;&nbsp; Some possible names would be</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; \DeclareUnicodeGlyph (according to the =
nomenclature of the Unicode</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; standard)</FONT>
</P>

<P><FONT SIZE=3D2>no we map to LICR those are characters not =
glyphs!</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp;&gt; \DeclareUnicodeCommand (analogous to =
\DeclareTextCommand)</FONT>
</P>

<P><FONT SIZE=3D2>no again Command in that context has already some =
semantics</FONT>
</P>

<P><FONT SIZE=3D2>maybe</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp;\DeclareUnicodeLaTeXMapping</FONT>
</P>
<BR>

<P><FONT SIZE=3D2>frank</FONT>
</P>

</BODY>
</HTML>
------_=_NextPart_001_01C2BE60.688CF000--