Received: from mx0.gmx.net (mx0.gmx.net [213.165.64.100]) by h1439878.stratoserver.net (8.14.2/8.14.2/Debian-2build1) with SMTP id p9D9wYLv030498 for ; Thu, 13 Oct 2011 11:58:35 +0200 Received: (qmail 12037 invoked by alias); 13 Oct 2011 09:58:29 -0000 Delivered-To: GMX delivery to rainer.schoepf@gmx.net Received: (qmail invoked by alias); 13 Oct 2011 09:58:28 -0000 Received: from relay.uni-heidelberg.de (EHLO relay.uni-heidelberg.de) [129.206.100.212] by mx0.gmx.net (mx046) with SMTP; 13 Oct 2011 11:58:28 +0200 Received: from listserv.uni-heidelberg.de (listserv.uni-heidelberg.de [129.206.100.94]) by relay.uni-heidelberg.de (8.14.1/8.14.1) with ESMTP id p9D9uLuF021011 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 13 Oct 2011 11:56:22 +0200 Received: from listserv.uni-heidelberg.de (localhost.localdomain [127.0.0.1]) by listserv.uni-heidelberg.de (8.13.1/8.13.1) with ESMTP id p9D9WtBg001703; Thu, 13 Oct 2011 11:56:21 +0200 Received: by LISTSERV.UNI-HEIDELBERG.DE (LISTSERV-TCP/IP release 16.0) with spool id 1798500 for LATEX-L@LISTSERV.UNI-HEIDELBERG.DE; Thu, 13 Oct 2011 11:56:21 +0200 Received: from relay2.uni-heidelberg.de (relay2.uni-heidelberg.de [129.206.210.211]) by listserv.uni-heidelberg.de (8.13.1/8.13.1) with ESMTP id p9D9uLXp016683 for ; Thu, 13 Oct 2011 11:56:21 +0200 Received: from mail-gy0-f177.google.com (mail-gy0-f177.google.com [209.85.160.177]) by relay2.uni-heidelberg.de (8.13.8/8.13.8) with ESMTP id p9D9uFqI024015 (version=TLSv1/SSLv3 cipher=RC4-MD5 bits=128 verify=FAIL) for ; Thu, 13 Oct 2011 11:56:20 +0200 Received: by gyd10 with SMTP id 10so2303379gyd.22 for ; Thu, 13 Oct 2011 02:56:15 -0700 (PDT) MIME-Version: 1.0 Received: by 10.223.65.68 with SMTP id h4mr5053889fai.24.1318499774721; Thu, 13 Oct 2011 02:56:14 -0700 (PDT) Received: by 10.152.4.193 with HTTP; Thu, 13 Oct 2011 02:56:14 -0700 (PDT) References: <4E945D77.6090309@morningstar2.co.uk> <20111011153219.GA3677@oberdiek.my-fqdn.de> <20111012093958.GA9734@oberdiek.my-fqdn.de> <20111012180101.GA17461@oberdiek.my-fqdn.de> Content-Type: text/plain; charset=ISO-8859-1 X-Spam-Whitelist: Message-ID: Date: Thu, 13 Oct 2011 05:56:14 -0400 Reply-To: Mailing list for the LaTeX3 project Sender: Mailing list for the LaTeX3 project From: Bruno Le Floch Subject: Re: Strings, and regular expressions To: LATEX-L@listserv.uni-heidelberg.de In-Reply-To: <20111012180101.GA17461@oberdiek.my-fqdn.de> Precedence: list List-Help: , List-Unsubscribe: List-Subscribe: List-Owner: List-Archive: X-GMX-Antispam: 0 (eXpurgate); Detail=5D7Q89H36p6sJLDpZh614Kjz2nt6F3tH+BwTxpRXH6IZlEbyrLDAXSFwXcSL4lNxjRLZ1 kRBvwVVWoaLXgVeV1j4IPUDs1CZSZKjGnoReFTTxqljDE4p3MPduqHnvNy8qRs5fUHxy5L7MI6EQ OUmPSUgqbH6lpt63YAexi4DA9epYHjdCX/ft5OCWm9MQd7xj1ofkGR61ug=V1; X-Resent-By: Forwarder X-Resent-For: rainer.schoepf@gmx.net X-Resent-To: rainer@rainer-schoepf.de Status: R X-Status: X-Keywords: X-UID: 6937 > Afaik that is exactly the problem, that there is no explanation > (=documenation/specification) on the XeTeX side. In most cases > the contents of \special expects bytes, the conversion of bytes > with 8 bits to UTF-8 destroys the contents. Right, so I'll have to look at the precise usage you make of this in your packages. >> Putting non-basic encodings in .def files is probably the best >> approach indeed. The time it takes for me to code all that is probably >> the only disadvantage, since that means postponing the floating point >> module. > > I wouldn't do it manually. There are mappings files for Unicode: > http://unicode.org/Public/MAPPINGS/ > In project I am using these mappings together with a perl script > to generate the .def files. Thank you for the link. It seems that the simplest would be to directly use the tables provided there as the .def files. Simply \catcode`\#=14, and set a few other default catcodes, then input the file, looping over the lines. Are all of the lines of the form 0xHH 0xHHHH # comment (or comment lines), with H = some hexadecimal digit? In other words, are all those encodings 8-bit only, and with only Unicode points <65536? > Of course not, size is 2N. > * All safe characters could be used, then the size decreases > (e.g. ASCII85, ...). But the problem is to find safe characters. > In especially this set might change. > * Some kind of compression could be applied. Right. I was mostly thinking of speed, with my comments on \lowercase. Space is not an issue internally to a TeX run (unless you start manipulating really massive strings). It can be a problem when writing to the PDF file. > It could be made even expandable in linear time > with a large lookup table (256). Right. I was thinking in terms of UTF-8 for some reason, and the lookup table would be too big. > And there is engine support (\pdfunescapehex). Good. >> A safe format where >> more characters are as is seems possibly faster? Also, this doesn't >> allow storage of Unicode data (unless we use an UTF, but the overhead >> of decoding the UTF may be large). Do you think we could devise a more >> efficient method? > > I think that depends on the Unicode support of the engine. Right. I need to do give some serious thoughts to optimization in all those encoding translations and string storages. Give me a few weeks to have a good idea. > Very short (e.g. label/anchor names, ...) up to very huge (e.g. object > stream data of images, ...). That's tough, then :). Regards, Bruno