Received: from mx0.gmx.net (mx0.gmx.net [213.165.64.100]) by h1439878.stratoserver.net (8.14.2/8.14.2/Debian-2build1) with SMTP id p9CHMgpZ019948 for ; Wed, 12 Oct 2011 19:22:43 +0200 Received: (qmail 25789 invoked by alias); 12 Oct 2011 17:22:37 -0000 Delivered-To: GMX delivery to rainer.schoepf@gmx.net Received: (qmail invoked by alias); 12 Oct 2011 17:22:36 -0000 Received: from relay2.uni-heidelberg.de (EHLO relay2.uni-heidelberg.de) [129.206.210.211] by mx0.gmx.net (mx040) with SMTP; 12 Oct 2011 19:22:36 +0200 Received: from listserv.uni-heidelberg.de (listserv.uni-heidelberg.de [129.206.100.94]) by relay2.uni-heidelberg.de (8.13.8/8.13.8) with ESMTP id p9CHJVCK002040 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 12 Oct 2011 19:19:32 +0200 Received: from listserv.uni-heidelberg.de (localhost.localdomain [127.0.0.1]) by listserv.uni-heidelberg.de (8.13.1/8.13.1) with ESMTP id p9CDoUYX000485; Wed, 12 Oct 2011 19:19:30 +0200 Received: by LISTSERV.UNI-HEIDELBERG.DE (LISTSERV-TCP/IP release 16.0) with spool id 1821658 for LATEX-L@LISTSERV.UNI-HEIDELBERG.DE; Wed, 12 Oct 2011 19:19:30 +0200 Received: from relay2.uni-heidelberg.de (relay2.uni-heidelberg.de [129.206.210.211]) by listserv.uni-heidelberg.de (8.13.1/8.13.1) with ESMTP id p9CHJUSn004178 for ; Wed, 12 Oct 2011 19:19:30 +0200 Received: from mail-ey0-f177.google.com (mail-ey0-f177.google.com [209.85.215.177]) by relay2.uni-heidelberg.de (8.13.8/8.13.8) with ESMTP id p9CHJPEP001926 (version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=FAIL) for ; Wed, 12 Oct 2011 19:19:28 +0200 Received: by eye3 with SMTP id 3so1337313eye.22 for ; Wed, 12 Oct 2011 10:19:25 -0700 (PDT) MIME-Version: 1.0 Received: by 10.223.61.211 with SMTP id u19mr47845728fah.29.1318439964919; Wed, 12 Oct 2011 10:19:24 -0700 (PDT) Received: by 10.152.4.193 with HTTP; Wed, 12 Oct 2011 10:19:24 -0700 (PDT) References: <4E945D77.6090309@morningstar2.co.uk> <20111011153219.GA3677@oberdiek.my-fqdn.de> <20111012093958.GA9734@oberdiek.my-fqdn.de> Content-Type: text/plain; charset=ISO-8859-1 X-Spam-Whitelist: Message-ID: Date: Wed, 12 Oct 2011 13:19:24 -0400 Reply-To: Mailing list for the LaTeX3 project Sender: Mailing list for the LaTeX3 project From: Bruno Le Floch Subject: Re: Strings, and regular expressions To: LATEX-L@listserv.uni-heidelberg.de In-Reply-To: <20111012093958.GA9734@oberdiek.my-fqdn.de> Precedence: list List-Help: , List-Unsubscribe: List-Subscribe: List-Owner: List-Archive: X-GMX-Antispam: 0 (eXpurgate); Detail=5D7Q89H36p6owQmP6PwN9E50Yk1S2YOWIy1vccnOiDGxZJopA8LYNt81ir/yCWQfAJLLu yhEn6Ksgl33a5/Zv36u1D+NWRzY4WOmfKZf9bACyv07O+R+cpwoQPScebxTYN9XfgyRhEoesy0xC hL95Xde3XdBty6YFaIOX3yh7Msimanbh3FQXSK+mI+/9N6y2VcxrMv3Asg=V1; X-Resent-By: Forwarder X-Resent-For: rainer.schoepf@gmx.net X-Resent-To: rainer@rainer-schoepf.de Status: R X-Status: X-Keywords: X-UID: 6927 > Most important for PDF strings: > > * PDFDocEncoding > * UTF-16 > > (hyperref also uses "ascii-print" in case of XeTeX because of > encoding problems with \special.) Can you elaborate on the encoding problems that XeTeX has? (a link to an explanation would be great) >> I guess that most "iso-..." and "cp..." encodings are an overkill for >> a kernel. > > They should be loadable as files similar to LaTeX's .def files > for inputenc or fontenc. Then the kernel can provide a base set > and others can be provided by other projects. But I don't see > the disadvantage if such a base set is not minimal. Putting non-basic encodings in .def files is probably the best approach indeed. The time it takes for me to code all that is probably the only disadvantage, since that means postponing the floating point module. > * String escaping, provided by \pdfescapestring. > * Name escaping, provided by \pdfescapename. > * Hex strings, provided by \pdfescapehex. I've coded all three already, based on one of your previous mails to LaTeX-L (when, two years ago, Joseph had mentionned strings here). > The latter is also useful for other contexts, e.g. for protecting > arbitrary string data in auxiliary files. It is definitely a safe way of storing data, but is it the most efficient? Decoding it requires setting the lccode of ^^@ to the number found, then \lowercase{\edef\result{\result^^@}} for every character, quadratic in the length of the string. A safe format where more characters are as is seems possibly faster? Also, this doesn't allow storage of Unicode data (unless we use an UTF, but the overhead of decoding the UTF may be large). Do you think we could devise a more efficient method? Are there cases where many ^^@ (byte 0) must be output in a row? If not, we can lowercase characters other than ^^@ to produce the relevant bytes, for instance \lccode0=... \lccode1=... [...] \lccode255=... \lowercase{\edef\result{\result^^@^^A...^^?}} The most efficient decoding method depends on how long the string is. What is the typical length of strings that we should optimize for? >> Also, when you say "Unicode encoding", I presume that this means >> native strings for XeTeX and LuaTeX, but what about pdfTeX? Do you use >> "UTF-16" (if so, LE or BE?), or some other UTF? > > In the context of bookmarks and other PDF strings "Unicode" > means UTF-16 (hyperref uses BE, but there is a byte order mark). > And the strings are a sequence of bytes. The big chars of XeTeX or > LuaTeX don't help, because they get written as UTF-8. Thanks for the information (and thank you Lars for the clarification on byte order). I presume that the cleanest is to store strings internally as lists of numbers, then convert them to various formats. -- Bruno