MIME-Version: 1.0
References: <CANQYN6xiwms0LdCgW-6AQ9Pok9qbtc6k0ZgN40z6RK9+a=RdWg@mail.gmail.com> <4E945D77.6090309@morningstar2.co.uk>
            <20111011153219.GA3677@oberdiek.my-fqdn.de>
            <CANQYN6zkcHWNBUfKK_+nDc5r4hcBa7pHyUE0_1mKu+sHQbWC8A@mail.gmail.com>
            <20111012093958.GA9734@oberdiek.my-fqdn.de>
            <CANQYN6xfTa34wrshLWHmTOVg5U2ATzzK1ML=_qHZw=mGRRdxrw@mail.gmail.com>
            <20111012180101.GA17461@oberdiek.my-fqdn.de>
Content-Type: text/plain; charset=ISO-8859-1
Message-ID:  <CANQYN6wGugJKNBxKfiWBTuir7--GYEtWN1MKDe7PfwvJ5u6tug@mail.gmail.com>
Date:         Thu, 13 Oct 2011 05:56:14 -0400
Reply-To: Mailing list for the LaTeX3 project
              <LATEX-L@listserv.uni-heidelberg.de>
Sender: Mailing list for the LaTeX3 project <LATEX-L@listserv.uni-heidelberg.de>
From: Bruno Le Floch <blflatex@GMAIL.COM>
Subject: Re: Strings, and regular expressions
To: LATEX-L@listserv.uni-heidelberg.de
In-Reply-To:  <20111012180101.GA17461@oberdiek.my-fqdn.de>
Precedence: list
Status: R

> Afaik that is exactly the problem, that there is no explanation
> (=documenation/specification) on the XeTeX side. In most cases
> the contents of \special expects bytes, the conversion of bytes
> with 8 bits to UTF-8 destroys the contents.

Right, so I'll have to look at the precise usage you make of this in
your packages.

>> Putting non-basic encodings in .def files is probably the best
>> approach indeed. The time it takes for me to code all that is probably
>> the only disadvantage, since that means postponing the floating point
>> module.
>
> I wouldn't do it manually. There are mappings files for Unicode:
>   http://unicode.org/Public/MAPPINGS/
> In project I am using these mappings together with a perl script
> to generate the .def files.

Thank you for the link. It seems that the simplest would be to
directly use the tables provided there as the .def files. Simply
\catcode`\#=14, and set a few other default catcodes, then input the
file, looping over the lines. Are all of the lines of the form

0xHH    0xHHHH    # comment

(or comment lines), with H = some hexadecimal digit? In other words,
are all those encodings 8-bit only, and with only Unicode points
<65536?


> Of course not, size is 2N.
> * All safe characters could be used, then the size decreases
>   (e.g. ASCII85, ...). But the problem is to find safe characters.
>   In especially this set might change.
> * Some kind of compression could be applied.

Right. I was mostly thinking of speed, with my comments on \lowercase.
Space is not an issue internally to a TeX run (unless you start
manipulating really massive strings). It can be a problem when writing
to the PDF file.

> It could be made even expandable in linear time
> with a large lookup table (256).

Right. I was thinking in terms of UTF-8 for some reason, and the
lookup table would be too big.

> And there is engine support (\pdfunescapehex).

Good.

>> A safe format where
>> more characters are as is seems possibly faster? Also, this doesn't
>> allow storage of Unicode data (unless we use an UTF, but the overhead
>> of decoding the UTF may be large). Do you think we could devise a more
>> efficient method?
>
> I think that depends on the Unicode support of the engine.

Right. I need to do give some serious thoughts to optimization in all
those encoding translations and string storages. Give me a few weeks
to have a good idea.

> Very short (e.g. label/anchor names, ...) up to very huge (e.g. object
> stream data of images, ...).

That's tough, then :).

Regards,
Bruno