User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; sv-SE; rv:1.9.2.22)
            Gecko/20110902 Thunderbird/3.1.14
MIME-Version: 1.0
References: <4E93664D.7090105@residenset.net> <7225.1318285652@cl.cam.ac.uk>   
            <CANQYN6z5qgHti7AXdRh9r+C4S2pR=qu9cQGnDp+cr1xFcNTptA@mail.gmail.com>           
            <4E945FF9.1060803@residenset.net>
            <CANQYN6yfd4MW9tmjJfOyyfsT1j-W18bUDnGvvBfN6-zpCsdv-g@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Message-ID:  <4E9702EF.7050802@residenset.net>
Date:         Thu, 13 Oct 2011 17:25:35 +0200
Reply-To: Mailing list for the LaTeX3 project
              <LATEX-L@listserv.uni-heidelberg.de>
Sender: Mailing list for the LaTeX3 project <LATEX-L@listserv.uni-heidelberg.de>
From: =?ISO-8859-1?Q?Lars_Hellstr=F6m?= <Lars.Hellstrom@RESIDENSET.NET>
Subject: Re: Strings, and regular expressions
To: LATEX-L@listserv.uni-heidelberg.de
In-Reply-To:  <CANQYN6yfd4MW9tmjJfOyyfsT1j-W18bUDnGvvBfN6-zpCsdv-g@mail.gmail.com>
Precedence: list
Status: R

Bruno Le Floch skrev 2011-10-12 04.59:
>>> I looked at that this afternoon. Would that be the right framework for
>>> code pretty-printing similar to listings/minted (but hopefully more
>>> powerful)?
>>
>> Strongly yes. As Ford mentions in his POPL paper on PEGs, the traditional
>> context-free grammars suck at expressing things like "an identifier is a
>> _longest_ sequence of consecutive alphanumeric characters", which is why
>> there is typically a separate tokenising step before the parser proper.
>> PEGs, on the other hand, have no problem with parsing from the character
>> level and up in one go. For pretty-printing, I'd expect it to be a great
>> convenience to not have to do it in two steps.
>
> Do you have any idea on the natural resulting data structure? It's not
> clear how trees should be implemented in TeX.

I would suggest just return a token sequence, into which the TeX commands 
the user specified has been inserted to reflect the aspects of the structure 
that he's interested in. A silly example might be something like

\begin{Sentence}\begin{NounPhrase}My 
\begin{Noun}hovercraft\end{Noun}\end{NounPhrase} 
\begin{VerbPhrase}\begin{Verb}is\end{Verb} full 
\begin{PrepositionalPhrase}of 
\begin{Noun}eels\end{Noun}\end{PrepositionalPhrase}\end{VerbPhrase}.\end{Sentence}

It seems to be a fairly standard extension that something inside braces in 
an PEG means "When passing through here upon matching, then insert this 
piece of material." An awkward point is however that you'd often want to 
insert an opening brace at one point and a closing brace a bit later -- for 
example to make that \Noun{hovercraft} rather than the less flexible 
\begin{Noun}hovercraft\end{Noun} -- which is not obvious how to express.

> Also, PEGs seem to lead
> to endless trouble with left recursion. I'm a little bit afraid of
> fighting an open problem :).

I'd be inclined to consider a grammar with a left recursion loop a 
programming error, just like it is for ordinary TeX macros. There is a 
transformation of PEGs that turn left-recursion into runtime match failures 
rather than infinite loops, but I don't know how complicated it is, and I 
suspect it's mainly interesting when the PEG effectively gets compiled to 
machine code. LaTeX already runs inside the TeX virtual machine, so an 
infinite loop is not that fatal.

[snip]
> I haven't thought about doing it as a pure NFA. I think I'll go for
> the most economical solution of running the assertion automaton on the
> string, since that feature will probably not be used too much. I still
> need to think of a good way of matching backwards, though (I believe
> it is simply reversing all the transitions in the automaton,

And exchanging the sets of initial and final states. Yes, that's all there 
is to it. (I assume you don't worry about submatch capturing there anyway.)

> but the
> representation I have makes that non-trivial).

It is often there that the crux lies.

[other mail]
>> > As Will says, an alternative is simply to save all regexes
>> > automatically, and check for the existence of the regex before building
>> > it. That of course costs in terms of macros, so the question is how many
>> > regexes are likely to be used. (We are talking about a typesetting
>> > system, so really this should not normally be 100s.)
> You guys are right. I'll add this automatic storage this weekend, and
> remove the N variants (since they will be done automatically).

I'm not so fond of that idea. I'd expect a bunch of regexps to be one-timers 
used during package initialisation, and keeping all of those around forever 
feels like it will be a lot of bloat. I'd prefer having both N and n 
variants, as in the initial version.

Lars Hellström