Mail-Followup-To: LATEX-L@LISTSERV.UNI-HEIDELBERG.DE
References: <CANQYN6xiwms0LdCgW-6AQ9Pok9qbtc6k0ZgN40z6RK9+a=RdWg@mail.gmail.com> <4E945D77.6090309@morningstar2.co.uk>
            <20111011153219.GA3677@oberdiek.my-fqdn.de>
            <CANQYN6zkcHWNBUfKK_+nDc5r4hcBa7pHyUE0_1mKu+sHQbWC8A@mail.gmail.com>
            <20111012093958.GA9734@oberdiek.my-fqdn.de>
            <CANQYN6xfTa34wrshLWHmTOVg5U2ATzzK1ML=_qHZw=mGRRdxrw@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Message-ID:  <20111012180101.GA17461@oberdiek.my-fqdn.de>
Date:         Wed, 12 Oct 2011 20:01:01 +0200
Reply-To: Mailing list for the LaTeX3 project
              <LATEX-L@listserv.uni-heidelberg.de>
Sender: Mailing list for the LaTeX3 project <LATEX-L@listserv.uni-heidelberg.de>
From: Heiko Oberdiek <heiko.oberdiek@GOOGLEMAIL.COM>
Subject: Re: Strings, and regular expressions
To: LATEX-L@listserv.uni-heidelberg.de
In-Reply-To:  <CANQYN6xfTa34wrshLWHmTOVg5U2ATzzK1ML=_qHZw=mGRRdxrw@mail.gmail.com>
Precedence: list
Status: R

On Wed, Oct 12, 2011 at 01:19:24PM -0400, Bruno Le Floch wrote:

> > Most important for PDF strings:
> >
> > * PDFDocEncoding
> > * UTF-16
> >
> > (hyperref also uses "ascii-print" in case of XeTeX because of
> > encoding problems with \special.)
> 
> Can you elaborate on the encoding problems that XeTeX has? (a link to
> an explanation would be great)

Afaik that is exactly the problem, that there is no explanation
(=documenation/specification) on the XeTeX side. In most cases
the contents of \special expects bytes, the conversion of bytes
with 8 bits to UTF-8 destroys the contents.

> >> I guess that most "iso-..." and "cp..." encodings are an overkill for
> >> a kernel.
> >
> > They should be loadable as files similar to LaTeX's .def files
> > for inputenc or fontenc. Then the kernel can provide a base set
> > and others can be provided by other projects. But I don't see
> > the disadvantage if such a base set is not minimal.
> 
> Putting non-basic encodings in .def files is probably the best
> approach indeed. The time it takes for me to code all that is probably
> the only disadvantage, since that means postponing the floating point
> module.

I wouldn't do it manually. There are mappings files for Unicode:
  http://unicode.org/Public/MAPPINGS/
In project I am using these mappings together with a perl script
to generate the .def files.

> > * String escaping, provided by \pdfescapestring.
> > * Name escaping, provided by \pdfescapename.
> > * Hex strings, provided by \pdfescapehex.
> 
> I've coded all three already, based on one of your previous mails to
> LaTeX-L (when, two years ago, Joseph had mentionned strings here).

Good.

> > The latter is also useful for other contexts, e.g. for protecting
> > arbitrary string data in auxiliary files.
> 
> It is definitely a safe way of storing data, but is it the most
> efficient?

Of course not, size is 2N.
* All safe characters could be used, then the size decreases
  (e.g. ASCII85, ...). But the problem is to find safe characters.
  In especially this set might change.
* Some kind of compression could be applied.

Using hex strings is just simple, fast and easy to implement.

> Decoding it requires setting the lccode of ^^@ to the
> number found, then \lowercase{\edef\result{\result^^@}} for every
> character, quadratic in the length of the string.

It could be made even expandable in linear time
with a large lookup table (256).

And there is engine support (\pdfunescapehex).

> A safe format where
> more characters are as is seems possibly faster? Also, this doesn't
> allow storage of Unicode data (unless we use an UTF, but the overhead
> of decoding the UTF may be large). Do you think we could devise a more
> efficient method?

I think that depends on the Unicode support of the engine.

> Are there cases where many ^^@ (byte 0) must be output in a row? If
> not, we can lowercase characters other than ^^@ to produce the
> relevant bytes, for instance
> 
> \lccode0=... \lccode1=... [...] \lccode255=...
> \lowercase{\edef\result{\result^^@^^A...^^?}}
> 
> The most efficient decoding method depends on how long the string is.
> What is the typical length of strings that we should optimize for?

Very short (e.g. label/anchor names, ...) up to very huge (e.g. object
stream data of images, ...).

Yours sincerely
  Heiko Oberdiek