Received: from mout.gmx.net (mout.gmx.net [212.227.15.19]) by h1439878.stratoserver.net (8.14.2/8.14.2/Debian-2build1) with ESMTP id r9OBlMMw013003 for ; Thu, 24 Oct 2013 13:47:24 +0200 Received: from relay.uni-heidelberg.de ([129.206.100.212]) by mx-ha.gmx.net (mxgmx101) with ESMTPS (Nemesis) id 0LufXA-1ViLp714WO-00zk89 for ; Thu, 24 Oct 2013 13:47:17 +0200 Received: from listserv.uni-heidelberg.de (listserv.uni-heidelberg.de [129.206.100.94]) by relay.uni-heidelberg.de (8.14.1/8.14.1) with ESMTP id r9OBiVAT031263 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 24 Oct 2013 13:44:31 +0200 Received: from listserv.uni-heidelberg.de (listserv.uni-heidelberg.de [127.0.0.1]) by listserv.uni-heidelberg.de (8.13.8/8.13.8) with ESMTP id r9NM16Z0007510; Thu, 24 Oct 2013 13:44:31 +0200 Received: by LISTSERV.UNI-HEIDELBERG.DE (LISTSERV-TCP/IP release 16.0) with spool id 10486005 for LATEX-L@LISTSERV.UNI-HEIDELBERG.DE; Thu, 24 Oct 2013 13:44:31 +0200 Received: from relay.uni-heidelberg.de (relay.uni-heidelberg.de [129.206.100.212]) by listserv.uni-heidelberg.de (8.13.8/8.13.8) with ESMTP id r9OBiUXv017416 for ; Thu, 24 Oct 2013 13:44:30 +0200 Received: from mail-oa0-f53.google.com (mail-oa0-f53.google.com [209.85.219.53]) by relay.uni-heidelberg.de (8.14.1/8.14.1) with ESMTP id r9OBhwV3030814 (version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=FAIL) for ; Thu, 24 Oct 2013 13:44:01 +0200 Received: by mail-oa0-f53.google.com with SMTP id n12so2192350oag.26 for ; Thu, 24 Oct 2013 04:43:57 -0700 (PDT) X-Received: by 10.60.63.36 with SMTP id d4mr1807780oes.29.1382615037690; Thu, 24 Oct 2013 04:43:57 -0700 (PDT) MIME-Version: 1.0 Received: by 10.76.171.5 with HTTP; Thu, 24 Oct 2013 04:43:37 -0700 (PDT) References: <5268D3D0.1050209@residenset.net> Content-Type: text/plain; charset=ISO-8859-1 X-MIME-Autoconverted: from quoted-printable to 8bit by listserv.uni-heidelberg.de id r9OBiUXv017417 Message-ID: Date: Thu, 24 Oct 2013 13:43:37 +0200 Reply-To: Mailing list for the LaTeX3 project Sender: Mailing list for the LaTeX3 project From: Michiel Helvensteijn Subject: Re: l3regex feature request + a question about the implementation To: LATEX-L@LISTSERV.UNI-HEIDELBERG.DE In-Reply-To: <5268D3D0.1050209@residenset.net> Precedence: list List-Help: , List-Unsubscribe: List-Subscribe: List-Owner: List-Archive: Content-Transfer-Encoding: quoted-printable X-MIME-Autoconverted: from 8bit to quoted-printable by relay.uni-heidelberg.de id r9OBiVAT031263 Envelope-To: X-GMX-Antispam: 0 (Mail was not recognized as spam); Detail=V3; X-GMX-Antivirus: 0 (no virus found) X-UI-Filterresults: notjunk:1;V01:K0:QwtzSu68pkU=:mEFAfYIeTt1s0FPurvmrdkRdnR n6N4viO4UOXmlXMMVfxuGvQ2xH3Utg4Ve6eZL/6fpd1dJK3CVbOYwNWGdrUCucKMtdBms8c1/ fxLcTlqhDY2yqJdGeNeyr3vxLd0jKmp592AObXHqWPfjglx64E4CFJa3efRtA8LgKJMc2+1zZ gf7sfrnR9YppOpWh2p9oZCqJgJvMIdx45zUSXwa6eQFjBAZCETP/RBUJgg1DF45mSzZZr3csc H2WtQFvknvez6w/Rmszg48byKMevC5dbWtno+MTQYKdO0LZoAZxIR++auvdfeaENt5QIeNBLt q7XQBV2ox6WIoOhZJCLMZXJnUc1FZoVJBBSWSEa/l84OGAtWbOFq7THSP7gIJdZxu3RZoa+qB hYBOIWJaCm50fiP0XdIV6koFIxg8Q5dbjwSGKctqCtxzSM61zzaSRMsZ6jx6XgFT9ztrcQOHP lKUUa+2MuEIYxFSPDdrqTGp8gWTdGTa2sPoK/86/nKgqdvc3f6E8C0LOLxJ7KPLgYk01ekPGi cORdMxq9HisvmbM4FCUY7ukD06lmZUAeQ1znVUT55u02VGx1qt2shNi0eG6HrgyoILfMqr0MI ynS5v55xohhktSgqJjfpQPHY3LGQ9IisKjwAhVxvxBmBfMSOUoL2bnpmkqIqdlzxqN6bBfCHz qreNxoKEEPUIdYas32Kv0NtsVY2Hq91hLs428foXVxF62LG9zC0PJfYmSrNjXVlWicXxEHz86 CGumKZB5NMtuKXbVV3eLPhw8Hfr7jnho/1DxW5X7+djLpjJnyf1Ec+41pyeOkYW410hGdedue 5E9WuXSz+ATX4I1Ow1iOO98WBXcIrIHy6BS26YC/K8JlaG6ilkemHT0GJNkh+cOkeWh/H3sRQ 0OP/42NWTxfepuTbsVv+3gxlEGV4/DXa0HAZoNAlUqoxHC+2m22xsO5TE2Zc8RcFhHZUG8LUE N3UqB0D56QDhBaAED7eopMwLonSlrI0Xp/jasj8b37Ndgb1RqQzGhZ0v3vHLoz44E6tKnATke VQxw/7nROsXWNmS/hINTDiEKMosDapNmFxw3eEZsNmBx/nIW0/9X4TwAi+Qvw0+SL+jonzBh6 emEBh+l7mQRY76ieS+xqd/3bq2jbsqzHtW8z4wik4z5JbQ98uym+nlx9GXHS0a5f636c15R/y r/qwd89bXdW6HEDcC13Bq+CZFgVPPSd8ubNvezeE6qo0epjahilJvRc+FddfOPOVIA3dll3ye fXc4ehdBoS66weNo6ZcQ9x8ITbD9RDAYu3js7XxubprTvFdcrpM2WwVu6htEL1UbGqgczsZsb Hqp1g+mgVhVh7oVDjTLoz4J1ez8BVyUheEbiF+gecH2sg/gpD83kBX9oxNkDoYHOyowOHt4Wz CXLx5KT1dknAT675Q58rjd6sVR1+35CymhBRo45lvE+PelG6KiuF8l1wK67yNOWJ8WZiAuIrZ Wr2v1ZrVc0xU3XiF6gNDjZbJ7kdALHdqJvust1KYH4Ast2EX+Nam+hEvoGGDDOoLX4uBImSo9 xhX2C5C73Rcfc/B+qRsOxMGLIWgDGPkhMHW2Av6yC X-UI-Loop:V01:w9/G3cgm9fI=:H8QI5jGPH063RgArIa3HodzZPYETzbVRhxRD30+D9JE= Status: R X-Status: X-Keywords: X-UID: 7294 On Thu, Oct 24, 2013 at 10:01 AM, Lars Hellstr=F6m wrote: > So unless you plan to run a single regexp against a very large amount o= f > text, it's in all likelyhood not worth it. You may be right. I find it hard to be certain without testing, but that would take quite some effort. > An optimal approach for your setting would perhaps be a lazy powerset > construction: don't carry it out for a particular state in the DFA unti= l you > actually reach that state, but then record the result; matching typical= ly > doesn't reach all the states of the DFA, since you don't match against > arbitrary data. This, however, assumes that you have some efficient met= hod > of memoising the results from lazy computations. (I suppose a family of > control sequence names could do the trick.) Interesting, I haven't come across the idea of a lazy powerset construction before. But I don't think I fully understand what you are proposing. Using "a family of csnames" suggests to me the general technique of maintaining an input-output table for a pure function. But that would only memoize exact inputs; variations on the same pattern wouldn't get any benefit, and it wouldn't work in my setting anyway, where I get one token at a time. My current implementation uses a "trie" datastructure, which might be one step better than a simple table. But still not good enough by far, I think. When I first read the term "lazy powerset construction" in your post, I imagined restructuring the automaton on-the-fly somehow, which sounded pretty cool. My second thought, though: the powerset construction seems like an all-or-nothing deal. You can't get rid of nondeterministic branches in the automaton until you're sure you've captured all possible paths that lead from them into the DFA part, and this can't be done with a single input. Ah, and about my feature request: It's not quite as simple as I stated before (but not that far from it): (1) It's not enough to go 'one step back' when a match is no longer possible. I'd need to go back to the last time an accepting state was reached. Moreover, I'd need to get back the tokens that weren't part of the match, so I can feed them back into the input stream. (Although this is something I already implemented on top of my "trie", and could be easily transferred outside of the l3regex interface.) (2) I would also need the ability to merge new (N|D)FA's into an existing one and mark the corresponding accepting states with an identifier. I could then query this identifier when a match was found, so I could know which original pattern(s) matched. Hm.. I'm starting to suspect that a DFA is the only way to make this in any way efficient. --=20 www.mhelvens.net