User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; sv-SE; rv:1.9.2.22)
            Gecko/20110902 Thunderbird/3.1.14
MIME-Version: 1.0
References: <CANQYN6xiwms0LdCgW-6AQ9Pok9qbtc6k0ZgN40z6RK9+a=RdWg@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Message-ID:  <4E93664D.7090105@residenset.net>
Date:         Mon, 10 Oct 2011 23:40:29 +0200
Reply-To: Mailing list for the LaTeX3 project
              <LATEX-L@listserv.uni-heidelberg.de>
Sender: Mailing list for the LaTeX3 project <LATEX-L@listserv.uni-heidelberg.de>
From: =?ISO-8859-1?Q?Lars_Hellstr=F6m?= <Lars.Hellstrom@RESIDENSET.NET>
Subject: Re: Strings, and regular expressions
To: LATEX-L@listserv.uni-heidelberg.de
In-Reply-To:  <CANQYN6xiwms0LdCgW-6AQ9Pok9qbtc6k0ZgN40z6RK9+a=RdWg@mail.gmail.com>
Precedence: list
Status: R

Bruno Le Floch skrev 2011-10-10 17.07:
> Hello all,
>
> We just added on CTAN two related modules:

Did you by any chance mean SVN rather than CTAN? (Or maybe there is a delay 
in changes becoming visible.)

[snip]
> The l3regex module allows for testing if a string matches a given
> regular expression, counting matches, extracting submatches, splitting
> at occurrences of a regular expression,

Sometimes one goes the other way, and returns all matches of a string. I 
wouldn't go so far as saying that this is more useful than splitting at 
things that match, however.

>and doing replacement (see
> documentation for function names).
>
> Speed requirements forbid a back-tracking approach, hence
> back-references cannot be supported. Only "truly regular" features are
> implemented.

Having browsed the code now, I see that you execute the automaton in general 
NFA form. Have you considered executing it in epsilon-free NFA form instead? 
Getting rid of epsilons does not increase the number of states (it can even 
reduce the number of reachable states), but I'm not sure how it would 
interact with match path priority...

[snip]
> - Newlines. Currently, "." matches every character; in perl and PCRE
> it should not match new lines. As you know, the situation with new
> lines in TeX is a little bit odd, since they are converted to the
> \endlinechar upon reading, and normally not tokenized, simply giving
> rise to a space or a \par. Should we still decide that "." does not
> match the CR nor LF characters? Or should it simply no match the
> \endlinechar?

The way I'm used to regexps, you can control using "metasyntax" whether . 
matches newline or not, and likewise how ^ behaves with respect to 
beginning-of-line, so I wouldn't find it wrong if ^ and \A are synonyms.

> - Same question for caseless matching, and for look-ahead/look-behind
> assertions.

Can you even do assertions without exploding the number of states? (Whenever 
I've thought about it, I tend to end up with the conclusion that one cannot, 
but I might be missing something.)

Lars Hellström