Received: from mail.proteosys.com ([62.225.9.49]) by nummer-3.proteosys with Microsoft SMTPSVC(5.0.2195.5329); Fri, 10 Jan 2003 22:32:33 +0100 Received: by mail.proteosys.com (8.12.2/8.12.2) with ESMTP id h0ALWV6C030041 for ; Fri, 10 Jan 2003 22:32:31 +0100 Received: from listserv.uni-heidelberg.de (listserv.uni-heidelberg.de [129.206.100.27]) by relay2.uni-heidelberg.de (8.12.4/8.12.4) with ESMTP id h0ALOowO002006; Fri, 10 Jan 2003 22:24:50 +0100 (MET) MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----_=_NextPart_001_01C2B8EF.C913AE80" Received: from listserv (listserv.uni-heidelberg.de [129.206.100.27]) by listserv.uni-heidelberg.de (8.12.2/8.12.2/SuSE Linux 0.6) with ESMTP id h0A4ikw8012815; Fri, 10 Jan 2003 22:17:29 +0100 Received: from LISTSERV.UNI-HEIDELBERG.DE by LISTSERV.UNI-HEIDELBERG.DE (LISTSERV-TCP/IP release 1.8d) with spool id 8727 for LATEX-L@LISTSERV.UNI-HEIDELBERG.DE; Fri, 10 Jan 2003 22:17:29 +0100 X-MimeOLE: Produced By Microsoft Exchange V6.5 Received: from relay.uni-heidelberg.de (relay.uni-heidelberg.de [129.206.100.212]) by listserv.uni-heidelberg.de (8.12.2/8.12.2/SuSE Linux 0.6) with ESMTP id h0ALHTr1021554 for ; Fri, 10 Jan 2003 22:17:29 +0100 Received: from moutng.kundenserver.de (moutng.kundenserver.de [212.227.126.177]) by relay.uni-heidelberg.de (8.12.4/8.12.4) with ESMTP id h0ALO6EV000875 for ; Fri, 10 Jan 2003 22:24:06 +0100 (MET) Received: from [212.227.126.161] (helo=mrelayng.kundenserver.de) by moutng.kundenserver.de with esmtp (Exim 3.35 #1) id 18X6dK-0007CF-00 for LATEX-L@listserv.uni-heidelberg.de; Fri, 10 Jan 2003 22:24:06 +0100 Received: from [80.129.1.79] (helo=istrati.mittelbach-online.de) by mrelayng.kundenserver.de with asmtp (Exim 3.35 #1) id 18X6dJ-0001xc-00 for LATEX-L@listserv.uni-heidelberg.de; Fri, 10 Jan 2003 22:24:06 +0100 Received: (from frank@localhost) by istrati.mittelbach-online.de (8.11.2/8.11.2/SuSE Linux 8.11.1-0.5) id h0ALNKo30515; Fri, 10 Jan 2003 22:23:20 +0100 In-Reply-To: References: <200212031601.gB3G11cQ009558@sun.dante.de> Return-Path: X-Mailer: VM 6.96 under Emacs 20.7.1 X-OriginalArrivalTime: 10 Jan 2003 21:32:33.0670 (UTC) FILETIME=[C979EA60:01C2B8EF] x-mime-autoconverted: from quoted-printable to 8bit by listserv.uni-heidelberg.de id h0ALHTr1021556 X-Authentication-Warning: istrati.mittelbach-online.de: frank set sender to frank@mittelbach-online.de using -f X-Scanned-By: MIMEDefang 2.28 (www . roaringpenguin . com / mimedefang) X-Spam-Score: -2.3 () EMAIL_ATTRIBUTION,IN_REP_TO,REFERENCES,SPAM_PHRASE_00_01,X_AUTH_WARNING Content-class: urn:content-classes:message Subject: Re: latex/3480: Support for UTF-8 missing in inputenc.sty Date: Fri, 10 Jan 2003 22:23:20 +0100 Message-ID: A<15903.14792.193451.96963@istrati.mittelbach-online.de> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Re: latex/3480: Support for UTF-8 missing in inputenc.sty Thread-Index: AcK478mQ8NihOiMDRK6hxft6++N1xQ== From: "Frank Mittelbach" To: Reply-To: "Mailing list for the LaTeX3 project" Status: R X-Status: X-Keywords: X-UID: 4430 This is a multi-part message in MIME format. ------_=_NextPart_001_01C2B8EF.C913AE80 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Lars Hellstr=F6m writes: > >but first i would like to hear comments/suggestions on the approach > > I like it -- it's a nice compromise between practical usefulness and = the > ideal that input and output encodings should be independent. you can explain that one day to me :-) > >\gdef\DeclareUnicodeCharacter#1#2{% > > \count@"#1\relax > > \typeout{ \space\space defining Unicode char #1 (decimal = \the\count@)}% > > That should probably be better as > > \typeout{ \space\space defining Unicode char U+#1 (decimal = \the\count@)}% yes. > >% > >% \end{macrocode} > >% The following definitions are in the encoding file but have no > >% direct equivalent in Unicode or simply do not make sense in = that > >% context (or I couldn't find anything or \ldots :-). > >%\begin{verbatim} > >%\DeclareTextSymbol{\j}{OT1}{17} > >%\DeclareTextSymbol{\SS}{T1}{223} > >%\DeclareTextSymbol{\textcompwordmark}{T1}{23} > > I would say that the compwordmark is U+200C (ZERO WIDTH NON-JOINER); = that > character is for example supposed to prevent ligaturing between = characters. yes > Adobe has assigned dotlessj to U+F6BE (LATIN SMALL LETTER DOTLESS J), = but > that is inofficial (and thus not universal) as it resides in the = private > use area. what we try is to provide a utf8 input encoding, how likely is it that = some editor or application generates that Adobe thing? not very i would guess = (at least not now) therefore i would not assign anything. > U+00DF (LATIN SMALL LETTER SHARP S) is uppercased as SS (two > U+0053), so there probably isn't any \SS. that was my guess too. > >% But the following (and some others) might actually lurk in Unicode > >% somewhere\ldots > >%\begin{verbatim} > >%\DeclareTextSymbol{\textasteriskcentered}{OMS}{3} % "03 > > How about U+2217 (ASTERISK OPERATOR)? that would be wrong in my opinion. the internal LaTeX form \textasteriskcentered is clearly a text character and U+2217 is a math symbol. so if some application is requesting U+2217 it should get a * = in math mode that is (probably, haven't checked the unicode page) a relation or = a binary operator. > OTOH, \textasteriskcentered is probably a glyphic variation on the = normal > asterisk rather than a separate character. more like that, which means we should not map it unless there is a = dedicated character for this in unicode > >%\DeclareTextCommand{\textcircled}{OMS} > > U+24B6 (CIRCLED LATIN CAPITAL LETTER A) is decribed as being = approximately > " A", where "" means "something expressed by a higher = level > protocol" (such as LaTeX). Hence I don't think there is a character = for > this. well there is U+20DD but this is a combining char and although that is = more or less what we want we can't support combining in the way unicode uses it, = so yes, this is not an individual char. However, one could implement the = unicode chars U+24B6-U+24E9 if we wish to do that as they would be \DeclareUnicodeCharacter{24B6}{\textcircled{A}} ... \DeclareUnicodeCharacter{24E9}{\textcircled{z}} whether that is worth doing, I don't know. I guess as part of the = exercise we should perhaps build an extended list of all mapping from unicode to = known (abd used) encoding-specific commands. then people can pick additional mappings as neccessary, or load the = whole thing if they have a big enough TeX. A good starting point would be Sebastian's ucharacters.sty from which my lists where more or less = stolen frank ------_=_NextPart_001_01C2B8EF.C913AE80 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Re: latex/3480: Support for UTF-8 missing in = inputenc.sty

Lars Hellstr=F6m writes:

 > >but first i would like to hear = comments/suggestions on the approach
 >
 > I like it -- it's a nice compromise = between practical usefulness and the
 > ideal that input and output encodings = should be independent.

you can explain that one day to me :-)

 > = >\gdef\DeclareUnicodeCharacter#1#2{%
 > >   = \count@"#1\relax
 > >   \typeout{ \space\space = defining Unicode char #1 (decimal \the\count@)}%
 >
 > That should probably be better as
 >
 >   \typeout{ \space\space = defining Unicode char U+#1 (decimal \the\count@)}%

yes.


 > >%</t1>
 > >%    = \end{macrocode}
 > >%    The following = definitions are in the encoding file but have no
 > >%    direct equivalent = in Unicode or simply do not make sense in that
 > >%    context (or I = couldn't find anything or \ldots :-).
 > >%\begin{verbatim}
 > = >%\DeclareTextSymbol{\j}{OT1}{17}
 > = >%\DeclareTextSymbol{\SS}{T1}{223}
 > = >%\DeclareTextSymbol{\textcompwordmark}{T1}{23}
 >
 > I would say that the compwordmark is = U+200C (ZERO WIDTH NON-JOINER); that
 > character is for example supposed to = prevent ligaturing between characters.

yes

 > Adobe has assigned dotlessj to U+F6BE = (LATIN SMALL LETTER DOTLESS J), but
 > that is inofficial (and thus not = universal) as it resides in the private
 > use area.

what we try is to provide a utf8 input encoding, how = likely is it that some
editor or application generates that Adobe thing? not = very i would guess (at
least not now) therefore i would not assign = anything.

 > U+00DF (LATIN SMALL LETTER SHARP S) is = uppercased as SS (two
 > U+0053), so there probably isn't any = \SS.

that was my guess too.

 > >% But the following (and some others) = might actually lurk in Unicode
 > >%    = somewhere\ldots
 > >%\begin{verbatim}
 > = >%\DeclareTextSymbol{\textasteriskcentered}{OMS}{3}   % = "03
 >
 > How about U+2217 (ASTERISK = OPERATOR)?

that would be wrong in my opinion. the internal LaTeX = form
\textasteriskcentered is clearly a text character = and  U+2217 is a math
symbol. so if some application is requesting  = U+2217 it should get a * in math
mode that is (probably, haven't checked the unicode = page) a relation or a
binary operator.

 > OTOH, \textasteriskcentered is probably a = glyphic variation on the normal
 > asterisk rather than a separate = character.

more like that, which means we should not map it = unless there is a dedicated
character for this in unicode


 > = >%\DeclareTextCommand{\textcircled}{OMS}
 >
 > U+24B6 (CIRCLED LATIN CAPITAL LETTER A) is = decribed as being approximately
 > "<circle> A", where = "<circle>" means "something expressed by a higher = level
 > protocol" (such as LaTeX). Hence I = don't think there is a character for
 > this.

well there is U+20DD but this is a combining char and = although that is more or
less what we want we can't support combining in the = way unicode uses it, so
yes, this is not an individual char. However, one = could implement the unicode
chars U+24B6-U+24E9 if we wish to do that as they = would be

\DeclareUnicodeCharacter{24B6}{\textcircled{A}}
...
\DeclareUnicodeCharacter{24E9}{\textcircled{z}}

whether that is worth doing, I don't know. I guess as = part of the exercise we
should perhaps build an extended list of all mapping = from unicode to known
(abd used) encoding-specific commands.

then people can pick additional mappings as = neccessary, or load the whole
thing if they have a big enough TeX. A good starting = point would be
Sebastian's ucharacters.sty from which my lists where = more or less stolen

frank

------_=_NextPart_001_01C2B8EF.C913AE80--