Received: from mail.proteosys.com ([62.225.9.49]) by nummer-3.proteosys with Microsoft SMTPSVC(5.0.2195.5329); Fri, 6 Dec 2002 00:22:07 +0100 Received: by mail.proteosys.com (8.12.2/8.12.2) with ESMTP id gB5NM4Tp003490 for ; Fri, 6 Dec 2002 00:22:05 +0100 Received: from listserv.uni-heidelberg.de (listserv.uni-heidelberg.de [129.206.100.27]) by relay2.uni-heidelberg.de (8.12.4/8.12.4) with ESMTP id gB5NGUc3004723; Fri, 6 Dec 2002 00:16:30 +0100 (MET) MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----_=_NextPart_001_01C29CB5.209D7980" Received: from listserv (listserv.uni-heidelberg.de [129.206.100.27]) by listserv.uni-heidelberg.de (8.12.2/8.12.2/SuSE Linux 0.6) with ESMTP id gB5N0ABY016189; Fri, 6 Dec 2002 00:10:53 +0100 Received: from LISTSERV.UNI-HEIDELBERG.DE by LISTSERV.UNI-HEIDELBERG.DE (LISTSERV-TCP/IP release 1.8d) with spool id 6335 for LATEX-L@LISTSERV.UNI-HEIDELBERG.DE; Fri, 6 Dec 2002 00:10:53 +0100 X-MimeOLE: Produced By Microsoft Exchange V6.5 Received: from relay.uni-heidelberg.de (relay.uni-heidelberg.de [129.206.100.212]) by listserv.uni-heidelberg.de (8.12.2/8.12.2/SuSE Linux 0.6) with ESMTP id gB5Mq8Ao016118 for ; Thu, 5 Dec 2002 23:52:08 +0100 Received: from mailgate.rz.uni-karlsruhe.de (exim@mailgate.rz.uni-karlsruhe.de [129.13.64.97]) by relay.uni-heidelberg.de (8.12.4/8.12.4) with ESMTP id gB5MvExK029359 for ; Thu, 5 Dec 2002 23:57:14 +0100 (MET) Received: from g113.hadiko.de (root@hadig113.hadiko.uni-karlsruhe.de [172.20.43.13]) by mailgate.rz.uni-karlsruhe.de with esmtp (Exim 3.36 #1) id 18K4vh-0004py-00; Thu, 05 Dec 2002 23:57:13 +0100 Received: (from nil@localhost) by g113.hadiko.de (8.11.1/8.11.1/Debian 8.11.0-6) id gB5MvDv09214; Thu, 5 Dec 2002 23:57:13 +0100 In-Reply-To: <15853.10032.823833.338602@istrati.mittelbach-online.de> References: <200212031641.gB3Gf27K009771@sun.dante.de> <15853.10032.823833.338602@istrati.mittelbach-online.de> Return-Path: X-OriginalArrivalTime: 05 Dec 2002 23:22:08.0005 (UTC) FILETIME=[2136D350:01C29CB5] User-Agent: Mutt/1.3.28i X-Scanned-By: MIMEDefang 2.11 (www dot roaringpenguin dot com slash mimedefang) X-Spam-Score: -2.7 () CARRIAGE_RETURNS,IN_REP_TO,QUOTED_EMAIL_TEXT,REFERENCES,SPAM_PHRASE_00_01,USER_AGENT,USER_AGENT_MUTT Content-class: urn:content-classes:message Subject: Re: latex/3480: Support for UTF-8 missing in inputenc.sty Date: Thu, 5 Dec 2002 23:57:12 +0100 Message-ID: A<20021205225712.GA9171@g113.hadiko.de> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Re: latex/3480: Support for UTF-8 missing in inputenc.sty Thread-Index: AcKctSFptP2aI1DbR8K8gLwhwA6qsQ== From: "Dominique Unruh" To: Reply-To: "Mailing list for the LaTeX3 project" Status: R X-Status: X-Keywords: X-UID: 4375 This is a multi-part message in MIME format. ------_=_NextPart_001_01C29CB5.209D7980 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Short info on what this discussion is about: We were discussing the possibility of adding UTF-8 inputenc support to LaTeX. The existing package ucs.sty is deemed to big/resource consuming for inclusion into the kernel. This discussion is now moved onto LATEX-L. Frank wrote: > it seems important to me to follow up the question Chris has posted > about what are input and what are output (font) encodings. Yes, I do understand this difference. But when adding UTF-8 support, it is probably even unwise to load all supported UTF sequences. Therefore I proposed to add to the fontenc an information, which Unicode range is to be loaded for this fontencoding. To clarify this, here an example: if we have code like the following: \usepackage[utf8]{inputenc} \usepackage[T2A]{fontenc} the file t2aenc.def could contain a line like: \FontencUnicodeRange{"400-"4FF} and \AtBeginDocument UTF-8 sequences would only be loaded for the ranges given by the fontencodings, thus taking the need from the user to decide by himself, which sequences to load. In case no UTF-8 is needed, the \FontencUnicodeRange's are ignored. Of course, the fontencoding->Unicode-Range mappings could also be in some extra file, thus removing the need to change the existing fontencodings. > commands, eg instead of > \DeclareInputText{164}{\textcurrency} > we probably need something like > [...] > = \DeclareUTFeightInputText{}{\textcurre= ncy} Code for this can be extracted from utf8.def as with ucs.sty. Interested people could have a look at the following macros in this file (unfortunately mostly undocumented (yet)): \utf@viii@map{number} constructs the UTF-8 sequence formed \u8-n-BCD where n is the first character of the sequence (as decimal number), and BCD are the (one, two or three) further characters (as characters). Here the macros content gets just number, but the macros can easily be changes to define it to anything give (e.g. \textcurrency). \utf@viii@undef{number}{char}{char}{char} calculates the Unicode number for some UTF-8 sequence (given again as number, char, char, char, with \@nil instead of the chars for shorter sequences.) A UTF-8 sequence starter would then have to be defined approximately as (here the example for the sequence starter "E3 =3D 227) \def\^^E3#1#2{\ifx\csname u8-227-#1#2\endcsname\relax \utf@viii@undef{227}#1#2\@nil\else \csname u8-227-#1#2\endcsname\fi} \utf@viii@make does the job of defining such macros (containing some additional code) Chris wrote: > I tried to understand Dominique's approach and to compare it with > David's but both, as on CTAN, consist of undocumented code ... so > I gave up. Have you looked at David's code? My code is documented (though only partly). The comments can be found in utf8.dtx, or in the files in the CVS archive (see http://www.unruh.de/DniQ/latex/unicode/). I don't know David's code, could you give me a CTAN location? DniQ. ------_=_NextPart_001_01C29CB5.209D7980 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Re: latex/3480: Support for UTF-8 missing in = inputenc.sty

Short info on what this discussion is about:

We were discussing the possibility of adding UTF-8 = inputenc support to
LaTeX. The existing package ucs.sty is deemed to = big/resource
consuming for inclusion into the kernel. This = discussion is now moved
onto LATEX-L.

Frank wrote:
> it seems important to me to follow up the = question Chris has posted
> about what are input and what are output (font) = encodings.

Yes, I do understand this difference. But when adding = UTF-8 support,
it is probably even unwise to load all supported = UTF
sequences. Therefore I proposed to add to the fontenc = an information,
which Unicode range is to be loaded for this = fontencoding. To clarify
this, here an example:

if we have code like the following:

\usepackage[utf8]{inputenc}
\usepackage[T2A]{fontenc}

the file t2aenc.def could contain a line like:

\FontencUnicodeRange{"400-"4FF}

and \AtBeginDocument UTF-8 sequences would only be = loaded for the
ranges given by the fontencodings, thus taking the = need from the user
to decide by himself, which sequences to load. In = case no UTF-8 is
needed, the \FontencUnicodeRange's are = ignored.

Of course, the fontencoding->Unicode-Range mappings = could also be in
some extra file, thus removing the need to change the = existing
fontencodings.

> commands, eg instead of
> \DeclareInputText{164}{\textcurrency}
> we probably need something like
> [...]
> = \DeclareUTFeightInputText{<whatever-number-or-identification>}{\tex= tcurrency}

Code for this can be extracted from utf8.def as = with
ucs.sty. Interested people could have a look at the = following macros
in this file (unfortunately mostly undocumented = (yet)):

\utf@viii@map{number} constructs the UTF-8 sequence = formed \u8-n-BCD
where n is the first character of the sequence (as = decimal number),
and BCD are the (one, two or three) further = characters (as
characters). Here the macros content gets just = number, but the macros
can easily be changes to define it to anything = give
(e.g. \textcurrency).

\utf@viii@undef{number}{char}{char}{char} calculates = the Unicode
number for some UTF-8 sequence (given again as = number, char, char,
char, with \@nil instead of the chars for shorter = sequences.)

A UTF-8 sequence starter would then have to be defined = approximately
as (here the example for the sequence starter = "E3 =3D 227)

\def\^^E3#1#2{\ifx\csname = u8-227-#1#2\endcsname\relax
  \utf@viii@undef{227}#1#2\@nil\else
  \csname u8-227-#1#2\endcsname\fi}

\utf@viii@make does the job of defining such macros = (containing some
additional code)

Chris wrote:
> I tried to understand Dominique's approach and = to compare it with
> David's but both, as on CTAN, consist of = undocumented code ... so
> I gave up.  Have you looked at David's = code?

My code is documented (though only partly). The = comments can be found
in utf8.dtx, or in the files in the CVS archive = (see
http://www.unruh.de/DniQ= /latex/unicode/). I don't know David's code,
could you give me a CTAN location?

DniQ.

------_=_NextPart_001_01C29CB5.209D7980--