Received: from mx0.gmx.net (mx0.gmx.net [213.165.64.100]) by h1439878.stratoserver.net (8.14.2/8.14.2/Debian-2build1) with SMTP id p9H20Wqp013832 for ; Mon, 17 Oct 2011 04:00:33 +0200 Received: (qmail 8692 invoked by alias); 17 Oct 2011 02:00:27 -0000 Delivered-To: GMX delivery to rainer.schoepf@gmx.net Received: (qmail invoked by alias); 17 Oct 2011 02:00:27 -0000 Received: from relay.uni-heidelberg.de (EHLO relay.uni-heidelberg.de) [129.206.100.212] by mx0.gmx.net (mx098) with SMTP; 17 Oct 2011 04:00:27 +0200 Received: from listserv.uni-heidelberg.de (listserv.uni-heidelberg.de [129.206.100.94]) by relay.uni-heidelberg.de (8.14.1/8.14.1) with ESMTP id p9H1uBnM015565 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 17 Oct 2011 03:56:11 +0200 Received: from listserv.uni-heidelberg.de (localhost.localdomain [127.0.0.1]) by listserv.uni-heidelberg.de (8.13.1/8.13.1) with ESMTP id p9GM1aUR007247; Mon, 17 Oct 2011 03:56:10 +0200 Received: by LISTSERV.UNI-HEIDELBERG.DE (LISTSERV-TCP/IP release 16.0) with spool id 1778659 for LATEX-L@LISTSERV.UNI-HEIDELBERG.DE; Mon, 17 Oct 2011 03:56:10 +0200 Received: from relay2.uni-heidelberg.de (relay2.uni-heidelberg.de [129.206.210.211]) by listserv.uni-heidelberg.de (8.13.1/8.13.1) with ESMTP id p9H1uACd028392 for ; Mon, 17 Oct 2011 03:56:10 +0200 Received: from mail-ey0-f177.google.com (mail-ey0-f177.google.com [209.85.215.177]) by relay2.uni-heidelberg.de (8.13.8/8.13.8) with ESMTP id p9H1u6Bs017882 (version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=FAIL) for ; Mon, 17 Oct 2011 03:56:09 +0200 Received: by eye3 with SMTP id 3so3130208eye.22 for ; Sun, 16 Oct 2011 18:56:06 -0700 (PDT) MIME-Version: 1.0 Received: by 10.223.81.205 with SMTP id y13mr20961277fak.34.1318816565910; Sun, 16 Oct 2011 18:56:05 -0700 (PDT) Received: by 10.152.4.193 with HTTP; Sun, 16 Oct 2011 18:56:05 -0700 (PDT) References: <4E9B4660.8020900@morningstar2.co.uk> Content-Type: text/plain; charset=ISO-8859-1 X-Spam-Whitelist: Message-ID: Date: Sun, 16 Oct 2011 21:56:05 -0400 Reply-To: Mailing list for the LaTeX3 project Sender: Mailing list for the LaTeX3 project From: Bruno Le Floch Subject: Re: Strings, and regular expressions To: LATEX-L@listserv.uni-heidelberg.de In-Reply-To: <4E9B4660.8020900@morningstar2.co.uk> Precedence: list List-Help: , List-Unsubscribe: List-Subscribe: List-Owner: List-Archive: X-GMX-Antispam: 0 (eXpurgate); Detail=5D7Q89H36p7zYQev1Bv5lawyulDRL8ctDPD6ZXuQen85Jv9LZnuvk0PWHMFM+dltxvtNv g8AnknxqwFl+XxUiMojrVYq/WJ2RbrZnia2G952nFrrv9pv1HTKpNdtEhHO0YC2Nricm36DJNjDO sihpFhxvuHj0LUBYCdE1oxaUm+2kqK5HfvUmRHkKwWi9Db6VnUPJiwR82o=V1; X-Resent-By: Forwarder X-Resent-For: rainer.schoepf@gmx.net X-Resent-To: rainer@rainer-schoepf.de Status: R X-Status: X-Keywords: X-UID: 6951 > I wonder if \regex_set:Nn would be better as \regex_save:Nn. My > reasoning is that \_set:Nn functions are used with a variable of > name \l_..._, while here we are (currently) naming as \l_..._tl. Thinking about this, I feel \regex_const:Nn would make most sense, since regular expressions would typically not be a "dynamic" variable. > I'm not sure about the approach on submatches. You say > > % Submatches with numbers higher than $10$ are accessed in the same way, > % namely |\10|, |\11|, \emph{etc}. To insert in the replacement text > % a submatch followed by a digit, the digit must be entered using the > % |\x| escape sequence: for instance, to get the first submatch followed > % by the digit $7$, use |\1\x37|, because $7$ has character code |37| > % (in hexadecimal). > > I wonder how likely it is that we'll need more than 9 submatches in the > sort of scenario that l3regex is likely to applied in. TeX programmers > are already used to the idea that we have up to 9 numbered parameters, > so why not limit to nine submatches and avoid the need to use "\x" syntax? I agree that \x37 is pretty awkward. But contrarily to macro arguments, the number of submatches can grow pretty quickly. One case where there can be more than 9 submatches is recognizing a date: \regex_const:Nn \c_date_regex { ((Jan)(uary)?|(Feb)(ruary)?|...) \ (\d\d?) } Then "\2\4\6\8\10\12\14\16\18\20\22\24" gives the three first letters of the month that was found, and "\26" is the day of the month. I guess that would be done better with \regex_const:Nn \c_date_regex { (Jan(?:uary)?|Feb(?:ruary)|Mar(?:ch)|...) \ (\d\d?) } then extracting the first three letters of \1 for the month, and \2 now holds the day. Note that I had to use non-capturing groups (?: ... ), otherwise the last submatch would be \14, again too large. The best for that particular example may be that I finally implement (?| ... ) groups, namely, non-capturing groups where the submatch number is reset for each alternative \regex_const:Nn \c_date_regex { ( (?|Jan(uary)?|Feb(ruary)?|...) ) \ (\d\d?) } Then the interesting submatches are \1 and \3. All in all, I don't know what's best. Perhaps provide a \g{...} (or whatever is standard) for submatches > 9 ? -- Bruno