Importance: Normal
References: <4B727378.8060704@morningstar2.co.uk>,
            <4B729944.5050308@residenset.net>
MIME-Version: 1.0
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
Message-ID:  <OFD3BDE8BF.D25CA425-ON802576C6.0043FFAD-802576C6.0043FFAE@mcs-notes1.open.ac.uk>
Date:         Wed, 10 Feb 2010 12:22:43 +0000
Reply-To: Mailing list for the LaTeX3 project
              <LATEX-L@listserv.uni-heidelberg.de>
Sender: Mailing list for the LaTeX3 project <LATEX-L@listserv.uni-heidelberg.de>
From: Chris Rowley <c.a.rowley@OPEN.AC.UK>
Subject: Re: String module
To: LATEX-L@listserv.uni-heidelberg.de
In-Reply-To:  <4B729944.5050308@residenset.net>
Precedence: list
Status: R

Apologies for the brevity.

I have not had time to look at the details from Lars but his emphasis on analysing the many different TeXie things that look like a string is spot on.

Also note that (as maybe someone already pointed out) in general a 'string of Unicode characters' is itself a rather slippery beast.  Thus when you 'put Unicode inside TeX' (whatever nmeaning you give that phrase) strings could be even more underspecified than Lars' list shows.

Cheers,  chris


-----Lars Hellström <Lars.Hellstrom@RESIDENSET.NET> wrote: -----

To: LATEX-L@LISTSERV.UNI-HEIDELBERG.DE
From: Lars Hellström <Lars.Hellstrom@RESIDENSET.NET>
Date: 10/02/2010 11:32
Subject: Re: String module

Joseph Wright skrev:
> Hello all,
> 
> One of the questions that was raised recently on c.t.t concerning the 
> currently available LaTeX3 modules was the lack of "strings" 
> functionality.
[snip]
> The first "big" question is what exactly is a string in a TeX context.

Indeed, there are several possible interpretations:
(1) Sequences of character tokens
(2) Sequences of character tokens with normalised catcodes
(3) Sequences of characters from some alphabet (possibly large),
     representation not necessarily native
(4) Sequences of LICRs.

Which you want to use depends on what is being targeted, i.e., what 
strings are going to be used for. \write, \special, and \csname are 
probably the main consumers, and since these want (1), that's probably 
the main thing to support.

But it is also important to (eventually) provide conversions between 
different string-like concepts. One should not expect one size to fit all.

> You also have to worry about what happens about special characters (for 
> example, how do you get % into a string). If you escape things at the 
> input stage [say \% => % (catcode 12)] then a simple \detokenize will 
> not work.

For manual entering of string data, one might well find that (3) or (4) 
is most practical...

> On features, things that seem to be popular:
>  - Substring functions such as "x characters from one end", "first x 
> characters", etc.
>  - Search functions such as "where is string x in string y".

...whereas searching typically requires (2).

I would suggest that core string module would primarily operate on the 
(1) kind of string, possibly requiring (2) for some operations, and 
providing the necessary conversion operation 1->2 (trusting the user to 
apply it where necessary, rather than building it into each and every 
operation just to be on the safe side).


Heiko Oberdiek wrote:
> * Encoding conversions, see package `stringenc'.
>   Application: PDF (outlines and other text fields).

This is, at least for the input, rather (3) or (4). Or are you 
anticipating character sets larger than ^^@--^^ff for the underlying 
engine? Then one conversely needs an "octet string" concept, for 
\special and the like.

> * Matching (replacing) using regular expressions,
>   see \pdfmatch and luaTeX.
>   Matching is useful for extracting information pieces or
>   validating option values, ...

Be aware that matching is one thing, extracting information pieces a 
somewhat trickier concept, and replacing even more so (from the CS 
theory point of view).

>   Unhappily \pdfmatch has still the status "experimental"
>   and the regular expression language differs from Lua's.

The last time I looked, Lua's "regular expressions" were not regular 
expressions[*] at all, but rather a kind of beefed-up glob pattern 
(with a regexp-like syntax), so I wouldn't be sad if LaTeX was to 
deviate from Lua in that respect. I would be sad if something is called 
regular expression that really isn't.

Lars Hellström

[*] There are several equivalent and perfectly formal definitions of 
what it means to be "regular" as in "regular expression", the most 
familiar of which is probably that a regular language is one that can 
be recognised by a finite automaton. POSIX regexps are very close to 
this (the only irregular feature being backreferences), whereas Perl's 
"regexps" are way out in context-free-land. Lua's matching engine, 
OTOH, is too weak to recognise arbitrary regular languages.

---------------------------------------------------------------------------
The Open University is incorporated by Royal Charter (RC 000391), an exempt charity in England & Wales and a charity registered in Scotland (SC 038302)