MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----_=_NextPart_001_01C46295.3F9B2D80"
In-Reply-To:  <20040705.073134.197330345.wl@gnu.org>
References: <20040705.073134.197330345.wl@gnu.org>
Content-class: urn:content-classes:message
Subject: Re: accents and inputenc
Date: Mon, 5 Jul 2004 14:30:06 +0100
Message-ID: A<20040705133005.GA3295@m0A02325D.vpn.uni-freiburg.de>
Thread-Topic: accents and inputenc
Thread-Index: AcRilT/sK83d4ypwR+ek8MmlciHcpQ==
From: "Heiko Oberdiek" <oberdiek@UNI-FREIBURG.DE>
Sender: "Mailing list for the LaTeX3 project" <LATEX-L@listserv.uni-heidelberg.de>
To: <LATEX-L@listserv.uni-heidelberg.de>
Reply-To: "Mailing list for the LaTeX3 project" <LATEX-L@listserv.uni-heidelberg.de>
Status: R

This is a multi-part message in MIME format.

------_=_NextPart_001_01C46295.3F9B2D80
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

On Mon, Jul 05, 2004 at 07:31:34AM +0200, Werner LEMBERG wrote:

> [LaTeX 2e 2003/12/01]
>
> Is the following a known limitation or a bug?  And if it is a
> limitation, where is it documented?
>
>   \documentclass{article}
>
>   \usepackage[latin3]{inputenc}
>
>   \begin{document}
>   \tableofcontents
>   \section{\'^^b9}
>   \end{document}
>
> ^^b9 is the dotless i in latin 3 -- in the TOC, the accent is
> formatted incorrectly.  BTW, it doesn't matter whether OT1 or T1 is
> used.

Package inputenc translates the input characters that it controls
into TeX code: ^^b9 becomes:
  \show^^b9
  ->\IeC {\i }
Actually 4 tokens instead of one ^^b9 token.

This goes into the .aux and .toc file:
  \contentsline {section}{\numberline {1}\'\IeC {\i }}{1}

The function of \IeC is that spaces after the character
are detected correctly:
  ^^b9 foobar     --> space between
  \i foobar       --> no space
  \IeC{\i} foobar --> space between

Because of the four tokens you need braces around such characters:
  \section{\'{^^b9}}

Of course it is possible to change the behaviour of inputenc:
The translation into TeX code is deferred in protecting environments,
so that the 8-bit character goes into the .aux and .toc file:
  \contentsline {section}{\numberline {1}\'^^b9}{1}

The disadvantage of this approach is, that the \section command
and \tableofcontents are processed at different times perhaps with
different input encodings. Then the wrong input encoding can
apply to the section title in the table of contents. Then changes
of the input encoding has to be recorded in the .toc file, too.

Yours sincerely
  Heiko <oberdiek@uni-freiburg.de>

------_=_NextPart_001_01C46295.3F9B2D80
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Diso-8859-1">
<META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version =
6.5.7654.12">
<TITLE>Re: accents and inputenc</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->

<P><FONT SIZE=3D2>On Mon, Jul 05, 2004 at 07:31:34AM +0200, Werner =
LEMBERG wrote:</FONT>
</P>

<P><FONT SIZE=3D2>&gt; [LaTeX 2e 2003/12/01]</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt; Is the following a known limitation or a =
bug?&nbsp; And if it is a</FONT>

<BR><FONT SIZE=3D2>&gt; limitation, where is it documented?</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp; \documentclass{article}</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp; \usepackage[latin3]{inputenc}</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp; \begin{document}</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp; \tableofcontents</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp; \section{\'^^b9}</FONT>

<BR><FONT SIZE=3D2>&gt;&nbsp;&nbsp; \end{document}</FONT>

<BR><FONT SIZE=3D2>&gt;</FONT>

<BR><FONT SIZE=3D2>&gt; ^^b9 is the dotless i in latin 3 -- in the TOC, =
the accent is</FONT>

<BR><FONT SIZE=3D2>&gt; formatted incorrectly.&nbsp; BTW, it doesn't =
matter whether OT1 or T1 is</FONT>

<BR><FONT SIZE=3D2>&gt; used.</FONT>
</P>

<P><FONT SIZE=3D2>Package inputenc translates the input characters that =
it controls</FONT>

<BR><FONT SIZE=3D2>into TeX code: ^^b9 becomes:</FONT>

<BR><FONT SIZE=3D2>&nbsp; \show^^b9</FONT>

<BR><FONT SIZE=3D2>&nbsp; -&gt;\IeC {\i }</FONT>

<BR><FONT SIZE=3D2>Actually 4 tokens instead of one ^^b9 token.</FONT>
</P>

<P><FONT SIZE=3D2>This goes into the .aux and .toc file:</FONT>

<BR><FONT SIZE=3D2>&nbsp; \contentsline {section}{\numberline {1}\'\IeC =
{\i }}{1}</FONT>
</P>

<P><FONT SIZE=3D2>The function of \IeC is that spaces after the =
character</FONT>

<BR><FONT SIZE=3D2>are detected correctly:</FONT>

<BR><FONT SIZE=3D2>&nbsp; ^^b9 foobar&nbsp;&nbsp;&nbsp;&nbsp; --&gt; =
space between</FONT>

<BR><FONT SIZE=3D2>&nbsp; \i foobar&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
--&gt; no space</FONT>

<BR><FONT SIZE=3D2>&nbsp; \IeC{\i} foobar --&gt; space between</FONT>
</P>

<P><FONT SIZE=3D2>Because of the four tokens you need braces around such =
characters:</FONT>

<BR><FONT SIZE=3D2>&nbsp; \section{\'{^^b9}}</FONT>
</P>

<P><FONT SIZE=3D2>Of course it is possible to change the behaviour of =
inputenc:</FONT>

<BR><FONT SIZE=3D2>The translation into TeX code is deferred in =
protecting environments,</FONT>

<BR><FONT SIZE=3D2>so that the 8-bit character goes into the .aux and =
.toc file:</FONT>

<BR><FONT SIZE=3D2>&nbsp; \contentsline {section}{\numberline =
{1}\'^^b9}{1}</FONT>
</P>

<P><FONT SIZE=3D2>The disadvantage of this approach is, that the =
\section command</FONT>

<BR><FONT SIZE=3D2>and \tableofcontents are processed at different times =
perhaps with</FONT>

<BR><FONT SIZE=3D2>different input encodings. Then the wrong input =
encoding can</FONT>

<BR><FONT SIZE=3D2>apply to the section title in the table of contents. =
Then changes</FONT>

<BR><FONT SIZE=3D2>of the input encoding has to be recorded in the .toc =
file, too.</FONT>
</P>

<P><FONT SIZE=3D2>Yours sincerely</FONT>

<BR><FONT SIZE=3D2>&nbsp; Heiko &lt;oberdiek@uni-freiburg.de&gt;</FONT>
</P>

</BODY>
</HTML>
------_=_NextPart_001_01C46295.3F9B2D80--