MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----_=_NextPart_001_01C2B8EF.C913AE80"
In-Reply-To:  <l03130303ba437e2e7d70@[130.239.20.144]>
References: <200212031601.gB3G11cQ009558@sun.dante.de>            <l03130303ba437e2e7d70@[130.239.20.144]>
Content-class: urn:content-classes:message
Subject:      Re: latex/3480: Support for UTF-8 missing in inputenc.sty
Date: Fri, 10 Jan 2003 22:23:20 +0100
Message-ID: A<15903.14792.193451.96963@istrati.mittelbach-online.de>
Thread-Topic:      Re: latex/3480: Support for UTF-8 missing in inputenc.sty
Thread-Index: AcK478mQ8NihOiMDRK6hxft6++N1xQ==
From: "Frank Mittelbach" <frank.mittelbach@LATEX-PROJECT.ORG>
To: <LATEX-L@listserv.uni-heidelberg.de>
Reply-To: "Mailing list for the LaTeX3 project" <LATEX-L@listserv.uni-heidelberg.de>
Status: R

This is a multi-part message in MIME format.

------_=_NextPart_001_01C2B8EF.C913AE80
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Lars Hellstr=F6m writes:

 > >but first i would like to hear comments/suggestions on the approach
 >
 > I like it -- it's a nice compromise between practical usefulness and =
the
 > ideal that input and output encodings should be independent.

you can explain that one day to me :-)

 > >\gdef\DeclareUnicodeCharacter#1#2{%
 > >   \count@"#1\relax
 > >   \typeout{ \space\space defining Unicode char #1 (decimal =
\the\count@)}%
 >
 > That should probably be better as
 >
 >   \typeout{ \space\space defining Unicode char U+#1 (decimal =
\the\count@)}%

yes.


 > >%</t1>
 > >%    \end{macrocode}
 > >%    The following definitions are in the encoding file but have no
 > >%    direct equivalent in Unicode or simply do not make sense in =
that
 > >%    context (or I couldn't find anything or \ldots :-).
 > >%\begin{verbatim}
 > >%\DeclareTextSymbol{\j}{OT1}{17}
 > >%\DeclareTextSymbol{\SS}{T1}{223}
 > >%\DeclareTextSymbol{\textcompwordmark}{T1}{23}
 >
 > I would say that the compwordmark is U+200C (ZERO WIDTH NON-JOINER); =
that
 > character is for example supposed to prevent ligaturing between =
characters.

yes

 > Adobe has assigned dotlessj to U+F6BE (LATIN SMALL LETTER DOTLESS J), =
but
 > that is inofficial (and thus not universal) as it resides in the =
private
 > use area.

what we try is to provide a utf8 input encoding, how likely is it that =
some
editor or application generates that Adobe thing? not very i would guess =
(at
least not now) therefore i would not assign anything.

 > U+00DF (LATIN SMALL LETTER SHARP S) is uppercased as SS (two
 > U+0053), so there probably isn't any \SS.

that was my guess too.

 > >% But the following (and some others) might actually lurk in Unicode
 > >%    somewhere\ldots
 > >%\begin{verbatim}
 > >%\DeclareTextSymbol{\textasteriskcentered}{OMS}{3}   % "03
 >
 > How about U+2217 (ASTERISK OPERATOR)?

that would be wrong in my opinion. the internal LaTeX form
\textasteriskcentered is clearly a text character and  U+2217 is a math
symbol. so if some application is requesting  U+2217 it should get a * =
in math
mode that is (probably, haven't checked the unicode page) a relation or =
a
binary operator.

 > OTOH, \textasteriskcentered is probably a glyphic variation on the =
normal
 > asterisk rather than a separate character.

more like that, which means we should not map it unless there is a =
dedicated
character for this in unicode


 > >%\DeclareTextCommand{\textcircled}{OMS}
 >
 > U+24B6 (CIRCLED LATIN CAPITAL LETTER A) is decribed as being =
approximately
 > "<circle> A", where "<circle>" means "something expressed by a higher =
level
 > protocol" (such as LaTeX). Hence I don't think there is a character =
for
 > this.

well there is U+20DD but this is a combining char and although that is =
more or
less what we want we can't support combining in the way unicode uses it, =
so
yes, this is not an individual char. However, one could implement the =
unicode
chars U+24B6-U+24E9 if we wish to do that as they would be

\DeclareUnicodeCharacter{24B6}{\textcircled{A}}
...
\DeclareUnicodeCharacter{24E9}{\textcircled{z}}

whether that is worth doing, I don't know. I guess as part of the =
exercise we
should perhaps build an extended list of all mapping from unicode to =
known
(abd used) encoding-specific commands.

then people can pick additional mappings as neccessary, or load the =
whole
thing if they have a big enough TeX. A good starting point would be
Sebastian's ucharacters.sty from which my lists where more or less =
stolen

frank

------_=_NextPart_001_01C2B8EF.C913AE80
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Diso-8859-1">
<META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version =
6.5.7654.12">
<TITLE>     Re: latex/3480: Support for UTF-8 missing in =
inputenc.sty</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->

<P><FONT SIZE=3D2>Lars Hellstr=F6m writes:</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp;&gt; &gt;but first i would like to hear =
comments/suggestions on the approach</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; I like it -- it's a nice compromise =
between practical usefulness and the</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; ideal that input and output encodings =
should be independent.</FONT>
</P>

<P><FONT SIZE=3D2>you can explain that one day to me :-)</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp;&gt; =
&gt;\gdef\DeclareUnicodeCharacter#1#2{%</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; &gt;&nbsp;&nbsp; =
\count@&quot;#1\relax</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; &gt;&nbsp;&nbsp; \typeout{ \space\space =
defining Unicode char #1 (decimal \the\count@)}%</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; That should probably be better as</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;&nbsp;&nbsp; \typeout{ \space\space =
defining Unicode char U+#1 (decimal \the\count@)}%</FONT>
</P>

<P><FONT SIZE=3D2>yes.</FONT>
</P>
<BR>

<P><FONT SIZE=3D2>&nbsp;&gt; &gt;%&lt;/t1&gt;</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; &gt;%&nbsp;&nbsp;&nbsp; =
\end{macrocode}</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; &gt;%&nbsp;&nbsp;&nbsp; The following =
definitions are in the encoding file but have no</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; &gt;%&nbsp;&nbsp;&nbsp; direct equivalent =
in Unicode or simply do not make sense in that</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; &gt;%&nbsp;&nbsp;&nbsp; context (or I =
couldn't find anything or \ldots :-).</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; &gt;%\begin{verbatim}</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; =
&gt;%\DeclareTextSymbol{\j}{OT1}{17}</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; =
&gt;%\DeclareTextSymbol{\SS}{T1}{223}</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; =
&gt;%\DeclareTextSymbol{\textcompwordmark}{T1}{23}</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; I would say that the compwordmark is =
U+200C (ZERO WIDTH NON-JOINER); that</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; character is for example supposed to =
prevent ligaturing between characters.</FONT>
</P>

<P><FONT SIZE=3D2>yes</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp;&gt; Adobe has assigned dotlessj to U+F6BE =
(LATIN SMALL LETTER DOTLESS J), but</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; that is inofficial (and thus not =
universal) as it resides in the private</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; use area.</FONT>
</P>

<P><FONT SIZE=3D2>what we try is to provide a utf8 input encoding, how =
likely is it that some</FONT>

<BR><FONT SIZE=3D2>editor or application generates that Adobe thing? not =
very i would guess (at</FONT>

<BR><FONT SIZE=3D2>least not now) therefore i would not assign =
anything.</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp;&gt; U+00DF (LATIN SMALL LETTER SHARP S) is =
uppercased as SS (two</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; U+0053), so there probably isn't any =
\SS.</FONT>
</P>

<P><FONT SIZE=3D2>that was my guess too.</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp;&gt; &gt;% But the following (and some others) =
might actually lurk in Unicode</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; &gt;%&nbsp;&nbsp;&nbsp; =
somewhere\ldots</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; &gt;%\begin{verbatim}</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; =
&gt;%\DeclareTextSymbol{\textasteriskcentered}{OMS}{3}&nbsp;&nbsp; % =
&quot;03</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; How about U+2217 (ASTERISK =
OPERATOR)?</FONT>
</P>

<P><FONT SIZE=3D2>that would be wrong in my opinion. the internal LaTeX =
form</FONT>

<BR><FONT SIZE=3D2>\textasteriskcentered is clearly a text character =
and&nbsp; U+2217 is a math</FONT>

<BR><FONT SIZE=3D2>symbol. so if some application is requesting&nbsp; =
U+2217 it should get a * in math</FONT>

<BR><FONT SIZE=3D2>mode that is (probably, haven't checked the unicode =
page) a relation or a</FONT>

<BR><FONT SIZE=3D2>binary operator.</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp;&gt; OTOH, \textasteriskcentered is probably a =
glyphic variation on the normal</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; asterisk rather than a separate =
character.</FONT>
</P>

<P><FONT SIZE=3D2>more like that, which means we should not map it =
unless there is a dedicated</FONT>

<BR><FONT SIZE=3D2>character for this in unicode</FONT>
</P>
<BR>

<P><FONT SIZE=3D2>&nbsp;&gt; =
&gt;%\DeclareTextCommand{\textcircled}{OMS}</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt;</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; U+24B6 (CIRCLED LATIN CAPITAL LETTER A) is =
decribed as being approximately</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; &quot;&lt;circle&gt; A&quot;, where =
&quot;&lt;circle&gt;&quot; means &quot;something expressed by a higher =
level</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; protocol&quot; (such as LaTeX). Hence I =
don't think there is a character for</FONT>

<BR><FONT SIZE=3D2>&nbsp;&gt; this.</FONT>
</P>

<P><FONT SIZE=3D2>well there is U+20DD but this is a combining char and =
although that is more or</FONT>

<BR><FONT SIZE=3D2>less what we want we can't support combining in the =
way unicode uses it, so</FONT>

<BR><FONT SIZE=3D2>yes, this is not an individual char. However, one =
could implement the unicode</FONT>

<BR><FONT SIZE=3D2>chars U+24B6-U+24E9 if we wish to do that as they =
would be</FONT>
</P>

<P><FONT SIZE=3D2>\DeclareUnicodeCharacter{24B6}{\textcircled{A}}</FONT>

<BR><FONT SIZE=3D2>...</FONT>

<BR><FONT =
SIZE=3D2>\DeclareUnicodeCharacter{24E9}{\textcircled{z}}</FONT>
</P>

<P><FONT SIZE=3D2>whether that is worth doing, I don't know. I guess as =
part of the exercise we</FONT>

<BR><FONT SIZE=3D2>should perhaps build an extended list of all mapping =
from unicode to known</FONT>

<BR><FONT SIZE=3D2>(abd used) encoding-specific commands.</FONT>
</P>

<P><FONT SIZE=3D2>then people can pick additional mappings as =
neccessary, or load the whole</FONT>

<BR><FONT SIZE=3D2>thing if they have a big enough TeX. A good starting =
point would be</FONT>

<BR><FONT SIZE=3D2>Sebastian's ucharacters.sty from which my lists where =
more or less stolen</FONT>
</P>

<P><FONT SIZE=3D2>frank</FONT>
</P>

</BODY>
</HTML>
------_=_NextPart_001_01C2B8EF.C913AE80--