Received: from mout.gmx.net (mout.gmx.net [212.227.15.15]) by h1439878.stratoserver.net (8.14.2/8.14.2/Debian-2build1) with ESMTP id r9O84LGD011982 for ; Thu, 24 Oct 2013 10:04:22 +0200 Received: from relay.uni-heidelberg.de ([129.206.100.212]) by mx-ha.gmx.net (mxgmx110) with ESMTPS (Nemesis) id 0MCxbT-1VRJnx0RVr-009kd3 for ; Thu, 24 Oct 2013 10:04:16 +0200 Received: from listserv.uni-heidelberg.de (listserv.uni-heidelberg.de [129.206.100.94]) by relay.uni-heidelberg.de (8.14.1/8.14.1) with ESMTP id r9O81EOw010753 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 24 Oct 2013 10:01:14 +0200 Received: from listserv.uni-heidelberg.de (listserv.uni-heidelberg.de [127.0.0.1]) by listserv.uni-heidelberg.de (8.13.8/8.13.8) with ESMTP id r9NM16Qn007508; Thu, 24 Oct 2013 10:01:13 +0200 Received: by LISTSERV.UNI-HEIDELBERG.DE (LISTSERV-TCP/IP release 16.0) with spool id 10484950 for LATEX-L@LISTSERV.UNI-HEIDELBERG.DE; Thu, 24 Oct 2013 10:01:13 +0200 Received: from relay.uni-heidelberg.de (relay.uni-heidelberg.de [129.206.100.212]) by listserv.uni-heidelberg.de (8.13.8/8.13.8) with ESMTP id r9O81DKZ018025 for ; Thu, 24 Oct 2013 10:01:13 +0200 Received: from csep02.cliche.se (csep02.cliche.se [195.249.40.184]) by relay.uni-heidelberg.de (8.14.1/8.14.1) with ESMTP id r9O810N8010556 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for ; Thu, 24 Oct 2013 10:01:03 +0200 Received: from nova.mydomain.example (37.250.70.5.bredband.tre.se [37.250.70.5]) by csep02.cliche.se (Postfix) with ESMTP id 772A618668B for ; Thu, 24 Oct 2013 10:00:57 +0200 (CEST) User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; sv-SE; rv:1.9.2.28) Gecko/20120306 Thunderbird/3.1.20 MIME-Version: 1.0 References: Content-Type: text/plain; charset=ISO-8859-1; format=flowed X-MIME-Autoconverted: from quoted-printable to 8bit by listserv.uni-heidelberg.de id r9O81DKZ018026 Message-ID: <5268D3D0.1050209@residenset.net> Date: Thu, 24 Oct 2013 10:01:20 +0200 Reply-To: Mailing list for the LaTeX3 project Sender: Mailing list for the LaTeX3 project From: =?ISO-8859-1?Q?Lars_Hellstr=F6m?= Subject: Re: l3regex feature request + a question about the implementation To: LATEX-L@LISTSERV.UNI-HEIDELBERG.DE In-Reply-To: Precedence: list List-Help: , List-Unsubscribe: List-Subscribe: List-Owner: List-Archive: Content-Transfer-Encoding: quoted-printable X-MIME-Autoconverted: from 8bit to quoted-printable by relay.uni-heidelberg.de id r9O81EOw010753 Envelope-To: X-GMX-Antispam: 0 (Mail was not recognized as spam); Detail=V3; X-GMX-Antivirus: 0 (no virus found) X-UI-Filterresults: notjunk:1;V01:K0:mI5Lnd30fxE=:QLFy6sr34Lay3ESAp3Fr7Mx8af 3D5XxX6VF7EbjEmJzz/oQazCQCicNvTycNb+d4ipnEtXLowNjowPXgGAkvqprQEz/zwCtu9Fz epdSXChZ5Hqf5R/z0MfecyZVdYllfqfEhfy4zKSMIbLc59oH6oD6gdca7sLOtJLFlBinKIgrW COK0QdKTFv9N6yv5CI3XzyIBG2t8UMLhXYuuOs5vgY99AvNAmm/eKfgfjvqdS/Ow1CJbJukFk FVh/sMH102MjLYRdIn7BU6LOy5kYBnFGE2YQG2cZbyMIjFLDfPVEkGniDBgWKg8ukHaofjt04 b80q4upaLHQOWRiYyY8y+bmxXFYPqeiRoLSLzudI0rvjCYeXvVmx/am44mMS9ELOWIR6pF0qg KpGGmlg0vl6BVGWM6mwOSFng2qvjaFVpIsbP6T6igaoK3CGZCcGV+C2ow5nBwQMGun6ASbPtZ DCM1YJC3wIgSsgL6qoL8iu2uP2J6IwnIkNr0mAoFM9qDvVLryak4zHzwN9IyglDo3DM87e0lq nPP1MYcUJoGrlOgpW57zf2AogM5rSq+14K8bQGbRMCAIV4gOM6TgUuSpuD61JURnFFqDdHUAb sRhK50U4fmCMy5ZVSddKkwMUMQePB70Nt9SlfKpe3WxnbNcUuFMtONYypwLd8IZBNgcUNfjoK DLj3030sslnvhNTPBJ5wzg8bUckbB2ljDSK0xdlv6dDQapeT65J1r3LigOckNhQPtH5OHFbuK auawAdy22GFSrVocG2+LyZQBE8qaHoj8orbjZIHQiUoHstt/1ISWB3YGtDf5R0/hG5f96jpf0 symOJjCz5lDv7TZB3ejLyyd2ggXbm3EkW06pnS3oWL3KRn4X5fSt8PkEcBmLdILqlmRy/mCXY lWtcyTbxjE4JVEcHZYBTZm4pWQu553BDgqo4qlqKKPZ4lOEuPsQBU1NHFi39/hD6d88W13+3i l3yEDVZAMl8qqz6ODZe4qjlRNLp/TKZlzj+Nut1mZz7OTb7FFBsGh4lPWr2gLyz5IiMamsq2V caoNtYVNo25XKWI+lcemOV4lD5HIBjgwJ8s0QteZ/lYmjblizEpD9dKKPIUzJzOepUSczFPOD +ifKByAU0he+GDg9KZSupX6SoDT2FxcRCjeB0XD1VMes4gJu/9xa7VpbNFLQI0vMMqlFow6mC 2eyA8HUK0+A30C1fGJZOUNoVsAJ0FvyYuySDWg+2uU9Bbd5C1F6DV4t/fzJ0s5K1gPAfOUuWC Y78lqFHgfErg9dEK39jD/YgP2MXzlprUP3GDJyMWp0nHxjkrLUi6WVQa8Lv5TkppXz/od363b ikG8VDeDNMxEAYTZuXIfXPNpOgSa8k5Et1GYsMuzsYIsv/1gMKnVJW/Pp0RfrrdhRPBVWpGaq /TZRIxerys62uoxZtpuffwOdJP1SEYRB1/BHHC+zKqRP+Bt1mWZjGOBkK9D8DTDVxx+yQa02x 3IXoIT2OwzCnSmw/wQXRGwFMKaFhmYVMvLDOSkGXCctEWt0OJ30WKZysY2dQMzQayXJfjVNla 0WE2zWARh3oigz+XYSAMWBCIusheS6KwY2I8pwvQh X-UI-Loop:V01:WzetZB6MUSk=:/970Hu9Ac9T8/nn65WV5F/UeAVA7l3J4ex0U3YmMREk= Status: R X-Status: X-Keywords: X-UID: 7293 Michiel Helvensteijn skrev 2013-10-24 02.12: > Secondly, an implementation question (which might lead to a > discussion): You're now running a regex as a NFA (Nondeterministic > Finite Automaton), keeping track of all active branches while matching > against an input. Is there a particular reason you're not translating > it into a DFA (Deterministic Finite Automaton) during the compilation > phase, i.e., applying the powerset construction? Not knowing the details of the implementation, I anyway hazard the guess=20 that this is because running the NFA tends to be faster in practice. Sure= ,=20 when executing an NFA you need to keep track of all active branches for t= his=20 particular input, but in carrying out the powerset construction you need = to=20 consider any possible set of active branches for any possible input! You=20 need to have run the match against a rather long string before NFA matchi= ng=20 work starts to catch up with the work required by the powerset constructi= on.=20 So unless you plan to run a single regexp against a very large amount of=20 text, it's in all likelyhood not worth it. Even in cases such as yours, where you actually do mean to run the same=20 regexp against lots of text, it's not entirely clear that it's worth it. = The=20 reason is that once the machinery is there, it is easy to underestimate h= ow=20 large the NFA is, and the exponential computational complexity of the=20 powerset construction really penalises even small NFA size increases. At = the=20 same time, it's easy to overestimate the amount of text a matching on=20 average gets applied to. An optimal approach for your setting would perhaps be a lazy powerset=20 construction: don't carry it out for a particular state in the DFA until = you=20 actually reach that state, but then record the result; matching typically= =20 doesn't reach all the states of the DFA, since you don't match against=20 arbitrary data. This, however, assumes that you have some efficient metho= d=20 of memoising the results from lazy computations. (I suppose a family of=20 control sequence names could do the trick.) Lars Hellstr=F6m