Received: from mx0.gmx.net (mx0.gmx.net [213.165.64.100]) by h1439878.stratoserver.net (8.14.2/8.14.2/Debian-2build1) with SMTP id p9C69Tu8011651 for ; Wed, 12 Oct 2011 08:09:30 +0200 Received: (qmail 22490 invoked by alias); 12 Oct 2011 06:09:24 -0000 Delivered-To: GMX delivery to rainer.schoepf@gmx.net Received: (qmail invoked by alias); 12 Oct 2011 06:09:24 -0000 Received: from relay.uni-heidelberg.de (EHLO relay.uni-heidelberg.de) [129.206.100.212] by mx0.gmx.net (mx035) with SMTP; 12 Oct 2011 08:09:24 +0200 Received: from listserv.uni-heidelberg.de (listserv.uni-heidelberg.de [129.206.100.94]) by relay.uni-heidelberg.de (8.14.1/8.14.1) with ESMTP id p9C66REd020839 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 12 Oct 2011 08:06:27 +0200 Received: from listserv.uni-heidelberg.de (localhost.localdomain [127.0.0.1]) by listserv.uni-heidelberg.de (8.13.1/8.13.1) with ESMTP id p9BM145e003205; Wed, 12 Oct 2011 08:06:26 +0200 Received: by LISTSERV.UNI-HEIDELBERG.DE (LISTSERV-TCP/IP release 16.0) with spool id 1794759 for LATEX-L@LISTSERV.UNI-HEIDELBERG.DE; Wed, 12 Oct 2011 08:06:26 +0200 Received: from relay2.uni-heidelberg.de (relay2.uni-heidelberg.de [129.206.210.211]) by listserv.uni-heidelberg.de (8.13.1/8.13.1) with ESMTP id p9C66Q4e005643 for ; Wed, 12 Oct 2011 08:06:26 +0200 Received: from lon1-post-2.mail.demon.net (lon1-post-2.mail.demon.net [195.173.77.149]) by relay2.uni-heidelberg.de (8.13.8/8.13.8) with ESMTP id p9C668wA020221 for ; Wed, 12 Oct 2011 08:06:14 +0200 Received: from cremornelane.demon.co.uk ([80.177.25.195] helo=palladium.local) by lon1-post-2.mail.demon.net with esmtp (Exim 4.69) id 1RDrxA-0004vt-an for LATEX-L@LISTSERV.UNI-HEIDELBERG.DE; Wed, 12 Oct 2011 06:06:08 +0000 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:7.0.1) Gecko/20110929 Thunderbird/7.0.1 MIME-Version: 1.0 References: <4E93664D.7090105@residenset.net> <7225.1318285652@cl.cam.ac.uk> <4E945FF9.1060803@residenset.net> X-Enigmail-Version: 1.3.2 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Message-ID: <4E952E50.3020004@morningstar2.co.uk> Date: Wed, 12 Oct 2011 07:06:08 +0100 Reply-To: Mailing list for the LaTeX3 project Sender: Mailing list for the LaTeX3 project From: Joseph Wright Subject: Re: Strings, and regular expressions To: LATEX-L@listserv.uni-heidelberg.de In-Reply-To: Precedence: list List-Help: , List-Unsubscribe: List-Subscribe: List-Owner: List-Archive: X-GMX-Antispam: 0 (Sender is in whitelist: joseph.wright@MORNINGSTAR2.CO.UK); Detail=5D7Q89H36p4L00VTXC6D4q0N+AH0PUCnBi0P5cROEGjO+pG7NAH/K+tf9SrVFtpLrKONl 2T9EL4W4U4jgzLbnCcGpk1z/zwmKT/K1fv3lD0=V1; X-Resent-By: Forwarder X-Resent-For: rainer.schoepf@gmx.net X-Resent-To: rainer@rainer-schoepf.de Status: R X-Status: X-Keywords: X-UID: 6918 On 12/10/2011 03:59, Bruno Le Floch wrote: >> I'm not really clear/keen on the 'save the regex' stuff. The result >> seems to be we have 'N' argument functions which need a pre-compiled >> regex, and 'n' ones which need a normal regex. I don't really like this, >> and am really not sure it's necessary to provide optimisation in this >> way. In the absence of use cases, I'm not sure about needing this type >> of additional complexity. > > For short strings (e.g., matching \d\d\d\d-\d\d-\d\d on 2011-10-11), > one third of the time is spent on building the automaton from the > regular expression, and two thirds on running the automaton. I don't > know how important that is in practice. Two aspects: > > - providing it requires more code --- true > > - the N arguments may be confusing (e.g., some people may think that > it expects the regex as a string variable) --- not such a problem > because the variable is checked to indeed be a proper compiled regex. > > If the feeling is that it should go, then I'll remove that this weekend. At this stage, I'm simply raising the issue and hoping others have some input. For any reused regex, there will be a speed gain even for small strings, I guess. Perhaps adding 'fast', 'precom' or 'prebuilt' to the names would be sensible, as this will keep the speed gains but avoid name ambiguity. (On use cases, I guess most strings will be small. I can see regexs useful for things like validating input such as numbers and dates. A more complex application might be syntax highlighting in code samples, but even then we should be looking at relatively small blocks.) >> since >> \regex_split:nnN { (\w+) } { Hello,~world! } \l_result_seq >> gives >> {} {Hello} {,~} {world} {!} >> if I understand you correctly. Maybe all that is needed is an example in the >> documentation. > > I'll add \regex_extract_all:nnN this weekend. > >> Rather than \regex_extract_once:nnN, I would probably call it >> \regex_extract_first:nnN. > > Not sure about that. We already have \tl_replace_(once|all):Nnn, which > is what prompted me to use once there. @Joseph (and others), would > \tl_replace_first:Nnn make more sense than _once? For the moment, go with 'once', and we can discuss separately whether it should be once/all or first/all in general. (We've only just changed this from having 'in' in the name, which has caused confusion enough!) -- Joseph Wright