Received: from mx0.gmx.net (mx0.gmx.net [213.165.64.100]) by h1439878.stratoserver.net (8.14.2/8.14.2/Debian-2build1) with SMTP id p9CI62IP030438 for ; Wed, 12 Oct 2011 20:06:04 +0200 Received: (qmail 22079 invoked by alias); 12 Oct 2011 18:05:57 -0000 Delivered-To: GMX delivery to rainer.schoepf@gmx.net Received: (qmail invoked by alias); 12 Oct 2011 18:05:56 -0000 Received: from relay.uni-heidelberg.de (EHLO relay.uni-heidelberg.de) [129.206.100.212] by mx0.gmx.net (mx045) with SMTP; 12 Oct 2011 20:05:56 +0200 Received: from listserv.uni-heidelberg.de (listserv.uni-heidelberg.de [129.206.100.94]) by relay.uni-heidelberg.de (8.14.1/8.14.1) with ESMTP id p9CI1OTD001396 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 12 Oct 2011 20:01:24 +0200 Received: from listserv.uni-heidelberg.de (localhost.localdomain [127.0.0.1]) by listserv.uni-heidelberg.de (8.13.1/8.13.1) with ESMTP id p9CDoUbF000485; Wed, 12 Oct 2011 20:01:24 +0200 Received: by LISTSERV.UNI-HEIDELBERG.DE (LISTSERV-TCP/IP release 16.0) with spool id 1821859 for LATEX-L@LISTSERV.UNI-HEIDELBERG.DE; Wed, 12 Oct 2011 20:01:24 +0200 Received: from relay2.uni-heidelberg.de (relay2.uni-heidelberg.de [129.206.210.211]) by listserv.uni-heidelberg.de (8.13.1/8.13.1) with ESMTP id p9CI1OB1007360 for ; Wed, 12 Oct 2011 20:01:24 +0200 Received: from mail-yw0-f49.google.com (mail-yw0-f49.google.com [209.85.213.49]) by relay2.uni-heidelberg.de (8.13.8/8.13.8) with ESMTP id p9CI17XZ022950 (version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=FAIL) for ; Wed, 12 Oct 2011 20:01:12 +0200 Received: by ywf9 with SMTP id 9so367057ywf.22 for ; Wed, 12 Oct 2011 11:01:06 -0700 (PDT) Received: by 10.223.65.68 with SMTP id h4mr141503fai.24.1318442466289; Wed, 12 Oct 2011 11:01:06 -0700 (PDT) Received: from irwin.vpn.uni-freiburg.de (p548077AA.dip.t-dialin.net. [84.128.119.170]) by mx.google.com with ESMTPS id u6sm4968371fan.17.2011.10.12.11.01.04 (version=TLSv1/SSLv3 cipher=OTHER); Wed, 12 Oct 2011 11:01:05 -0700 (PDT) Received: by irwin.vpn.uni-freiburg.de (Postfix, from userid 500) id 4087814F6D; Wed, 12 Oct 2011 20:01:01 +0200 (CEST) Mail-Followup-To: LATEX-L@LISTSERV.UNI-HEIDELBERG.DE References: <4E945D77.6090309@morningstar2.co.uk> <20111011153219.GA3677@oberdiek.my-fqdn.de> <20111012093958.GA9734@oberdiek.my-fqdn.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Message-ID: <20111012180101.GA17461@oberdiek.my-fqdn.de> Date: Wed, 12 Oct 2011 20:01:01 +0200 Reply-To: Mailing list for the LaTeX3 project Sender: Mailing list for the LaTeX3 project From: Heiko Oberdiek Subject: Re: Strings, and regular expressions To: LATEX-L@listserv.uni-heidelberg.de In-Reply-To: Precedence: list List-Help: , List-Unsubscribe: List-Subscribe: List-Owner: List-Archive: X-GMX-Antispam: 0 (eXpurgate); Detail=5D7Q89H36p6sJLDpZh614Kjz2nt6F3tHNfVdIFI1JqcclzoLUAU0vL2M2bG6FzquBgm3+ MOCxpW8SRSDLOwBawIyYPfq3qnUMP+fCDm5BI/jXUwvZmX/D+3mC+0JdH2QOh6kTrutazS4iMO+X 0Kolo/c5eDqie1cp6fdDjxWOPU/5eBhs7qBYIMFBjEmGn9MVa2TLQ2ZZFs=V1; X-Resent-By: Forwarder X-Resent-For: rainer.schoepf@gmx.net X-Resent-To: rainer@rainer-schoepf.de Status: R X-Status: X-Keywords: X-UID: 6929 On Wed, Oct 12, 2011 at 01:19:24PM -0400, Bruno Le Floch wrote: > > Most important for PDF strings: > > > > * PDFDocEncoding > > * UTF-16 > > > > (hyperref also uses "ascii-print" in case of XeTeX because of > > encoding problems with \special.) > > Can you elaborate on the encoding problems that XeTeX has? (a link to > an explanation would be great) Afaik that is exactly the problem, that there is no explanation (=documenation/specification) on the XeTeX side. In most cases the contents of \special expects bytes, the conversion of bytes with 8 bits to UTF-8 destroys the contents. > >> I guess that most "iso-..." and "cp..." encodings are an overkill for > >> a kernel. > > > > They should be loadable as files similar to LaTeX's .def files > > for inputenc or fontenc. Then the kernel can provide a base set > > and others can be provided by other projects. But I don't see > > the disadvantage if such a base set is not minimal. > > Putting non-basic encodings in .def files is probably the best > approach indeed. The time it takes for me to code all that is probably > the only disadvantage, since that means postponing the floating point > module. I wouldn't do it manually. There are mappings files for Unicode: http://unicode.org/Public/MAPPINGS/ In project I am using these mappings together with a perl script to generate the .def files. > > * String escaping, provided by \pdfescapestring. > > * Name escaping, provided by \pdfescapename. > > * Hex strings, provided by \pdfescapehex. > > I've coded all three already, based on one of your previous mails to > LaTeX-L (when, two years ago, Joseph had mentionned strings here). Good. > > The latter is also useful for other contexts, e.g. for protecting > > arbitrary string data in auxiliary files. > > It is definitely a safe way of storing data, but is it the most > efficient? Of course not, size is 2N. * All safe characters could be used, then the size decreases (e.g. ASCII85, ...). But the problem is to find safe characters. In especially this set might change. * Some kind of compression could be applied. Using hex strings is just simple, fast and easy to implement. > Decoding it requires setting the lccode of ^^@ to the > number found, then \lowercase{\edef\result{\result^^@}} for every > character, quadratic in the length of the string. It could be made even expandable in linear time with a large lookup table (256). And there is engine support (\pdfunescapehex). > A safe format where > more characters are as is seems possibly faster? Also, this doesn't > allow storage of Unicode data (unless we use an UTF, but the overhead > of decoding the UTF may be large). Do you think we could devise a more > efficient method? I think that depends on the Unicode support of the engine. > Are there cases where many ^^@ (byte 0) must be output in a row? If > not, we can lowercase characters other than ^^@ to produce the > relevant bytes, for instance > > \lccode0=... \lccode1=... [...] \lccode255=... > \lowercase{\edef\result{\result^^@^^A...^^?}} > > The most efficient decoding method depends on how long the string is. > What is the typical length of strings that we should optimize for? Very short (e.g. label/anchor names, ...) up to very huge (e.g. object stream data of images, ...). Yours sincerely Heiko Oberdiek