MIME-Version: 1.0
References: <CANQYN6xiwms0LdCgW-6AQ9Pok9qbtc6k0ZgN40z6RK9+a=RdWg@mail.gmail.com> <4E945D77.6090309@morningstar2.co.uk>
            <20111011153219.GA3677@oberdiek.my-fqdn.de>
            <CANQYN6zkcHWNBUfKK_+nDc5r4hcBa7pHyUE0_1mKu+sHQbWC8A@mail.gmail.com>
            <20111012093958.GA9734@oberdiek.my-fqdn.de>
Content-Type: text/plain; charset=ISO-8859-1
Message-ID:  <CANQYN6xfTa34wrshLWHmTOVg5U2ATzzK1ML=_qHZw=mGRRdxrw@mail.gmail.com>
Date:         Wed, 12 Oct 2011 13:19:24 -0400
Reply-To: Mailing list for the LaTeX3 project
              <LATEX-L@listserv.uni-heidelberg.de>
Sender: Mailing list for the LaTeX3 project <LATEX-L@listserv.uni-heidelberg.de>
From: Bruno Le Floch <blflatex@GMAIL.COM>
Subject: Re: Strings, and regular expressions
To: LATEX-L@listserv.uni-heidelberg.de
In-Reply-To:  <20111012093958.GA9734@oberdiek.my-fqdn.de>
Precedence: list
Status: R

> Most important for PDF strings:
>
> * PDFDocEncoding
> * UTF-16
>
> (hyperref also uses "ascii-print" in case of XeTeX because of
> encoding problems with \special.)

Can you elaborate on the encoding problems that XeTeX has? (a link to
an explanation would be great)

>> I guess that most "iso-..." and "cp..." encodings are an overkill for
>> a kernel.
>
> They should be loadable as files similar to LaTeX's .def files
> for inputenc or fontenc. Then the kernel can provide a base set
> and others can be provided by other projects. But I don't see
> the disadvantage if such a base set is not minimal.

Putting non-basic encodings in .def files is probably the best
approach indeed. The time it takes for me to code all that is probably
the only disadvantage, since that means postponing the floating point
module.

> * String escaping, provided by \pdfescapestring.
> * Name escaping, provided by \pdfescapename.
> * Hex strings, provided by \pdfescapehex.

I've coded all three already, based on one of your previous mails to
LaTeX-L (when, two years ago, Joseph had mentionned strings here).

> The latter is also useful for other contexts, e.g. for protecting
> arbitrary string data in auxiliary files.

It is definitely a safe way of storing data, but is it the most
efficient? Decoding it requires setting the lccode of ^^@ to the
number found, then \lowercase{\edef\result{\result^^@}} for every
character, quadratic in the length of the string. A safe format where
more characters are as is seems possibly faster? Also, this doesn't
allow storage of Unicode data (unless we use an UTF, but the overhead
of decoding the UTF may be large). Do you think we could devise a more
efficient method?

Are there cases where many ^^@ (byte 0) must be output in a row? If
not, we can lowercase characters other than ^^@ to produce the
relevant bytes, for instance

\lccode0=... \lccode1=... [...] \lccode255=...
\lowercase{\edef\result{\result^^@^^A...^^?}}

The most efficient decoding method depends on how long the string is.
What is the typical length of strings that we should optimize for?

>> Also, when you say "Unicode encoding", I presume that this means
>> native strings for XeTeX and LuaTeX, but what about pdfTeX? Do you use
>> "UTF-16" (if so, LE or BE?), or some other UTF?
>
> In the context of bookmarks and other PDF strings "Unicode"
> means UTF-16 (hyperref uses BE, but there is a byte order mark).
> And the strings are a sequence of bytes. The big chars of XeTeX or
> LuaTeX don't help, because they get written as UTF-8.

Thanks for the information (and thank you Lars for the clarification
on byte order). I presume that the cleanest is to store strings
internally as lists of numbers, then convert them to various formats.

--
Bruno