MIME-Version: 1.0
References: <CACKZBPX-udskZ1JNVDwBzgFOfS8wMr3cBw9JwMbZwuLodfS5=A@mail.gmail.com> <5268D3D0.1050209@residenset.net>
            <CACKZBPX-UXh+s5nXbTuXPDi4rCMFbs_ebjjDPufu5RSJ+xD2FA@mail.gmail.com>
            <52693F2A.80902@residenset.net>
Content-Type: text/plain; charset=ISO-8859-1
Message-ID:  <CACKZBPUrw6GfV8a45FrRnHYrNQPQwLniqTHC5BsYrWJXwQ4vxQ@mail.gmail.com>
Date:         Fri, 25 Oct 2013 00:09:38 +0200
Reply-To: Mailing list for the LaTeX3 project
              <LATEX-L@LISTSERV.UNI-HEIDELBERG.DE>
Sender: Mailing list for the LaTeX3 project <LATEX-L@LISTSERV.UNI-HEIDELBERG.DE>
From: Michiel Helvensteijn <mhelvens@GMAIL.COM>
Subject: Re: l3regex feature request + a question about the implementation
To: LATEX-L@LISTSERV.UNI-HEIDELBERG.DE
In-Reply-To:  <52693F2A.80902@residenset.net>
Precedence: list
Content-Transfer-Encoding: quoted-printable
Envelope-To: <rainer.schoepf@GMX.NET>
Status: R

On Thu, Oct 24, 2013 at 5:39 PM, Lars Hellstr=F6m
<Lars.Hellstrom@residenset.net> wrote:

>> Interesting, I haven't come across the idea of a lazy powerset
>> construction before. But I don't think I fully understand what you are
>> proposing. Using "a family of csnames" suggests to me the general
>> technique of maintaining an input-output table for a pure function.
>
> I was thinking one csname per state of the DFA (subset of states of the
> NFA), that in turn acted as a function from input character to new stat=
e of
> DFA. This should work perfectly well in your one-token-at-a-time settin=
g.

I was trying to determine what was 'lazy' about it. But I think I get
it now. You're only building states as you need them and leaving those
transition functions partially defined. You always try the DFA first,
but if you run into a dead end (no transition to fit the input), you
have to fall back to the NFA. Yes, that might work.

>> (2) I would also need the ability to merge new (N|D)FA's into an
>> existing one and mark the corresponding accepting states with an
>> identifier. I could then query this identifier when a match was found,
>> so I could know which original pattern(s) matched.
>
> Doable, but whichever way you do it, you effectively end up executing a=
ll
> the component automata in parallel. With DFAs, the new state space is t=
he
> cartesian product of the component state spaces. With NFAs, you have a =
set
> of active branches for each component automaton. So if you're anyway fe=
eding
> data to your automaton one token at a time, then you might as well feed=
 each
> token to a sequence of automata instead.

Properly merging DFA's will gain you plenty if some rudimentary
optimizations are performed. But not so much for NFA's, as you say,
which is a shame. I've defined near 200 separate math-mode lexemes in
my thesis so far. I wouldn't want to start up 200 automata for every
potential occurrence. My current implementation uses a trie, i.e., a
very primitive DFA.

--=20
www.mhelvens.net