Received: from mx0.gmx.net (mx0.gmx.net [213.165.64.100]) by h1439878.stratoserver.net (8.14.2/8.14.2/Debian-2build1) with SMTP id p9DFUvpW016124 for ; Thu, 13 Oct 2011 17:30:58 +0200 Received: (qmail 11196 invoked by alias); 13 Oct 2011 15:30:52 -0000 Delivered-To: GMX delivery to rainer.schoepf@gmx.net Received: (qmail invoked by alias); 13 Oct 2011 15:30:51 -0000 Received: from relay2.uni-heidelberg.de (EHLO relay2.uni-heidelberg.de) [129.206.210.211] by mx0.gmx.net (mx078) with SMTP; 13 Oct 2011 17:30:51 +0200 Received: from listserv.uni-heidelberg.de (listserv.uni-heidelberg.de [129.206.100.94]) by relay2.uni-heidelberg.de (8.13.8/8.13.8) with ESMTP id p9DFPqgK013865 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 13 Oct 2011 17:25:53 +0200 Received: from listserv.uni-heidelberg.de (localhost.localdomain [127.0.0.1]) by listserv.uni-heidelberg.de (8.13.1/8.13.1) with ESMTP id p9DCYChq001703; Thu, 13 Oct 2011 17:25:52 +0200 Received: by LISTSERV.UNI-HEIDELBERG.DE (LISTSERV-TCP/IP release 16.0) with spool id 1830836 for LATEX-L@LISTSERV.UNI-HEIDELBERG.DE; Thu, 13 Oct 2011 17:25:52 +0200 Received: from relay2.uni-heidelberg.de (relay2.uni-heidelberg.de [129.206.210.211]) by listserv.uni-heidelberg.de (8.13.1/8.13.1) with ESMTP id p9DFPqMI023302 for ; Thu, 13 Oct 2011 17:25:52 +0200 Received: from csep02.cliche.se (csep02.cliche.se [195.249.40.184]) by relay2.uni-heidelberg.de (8.13.8/8.13.8) with ESMTP id p9DFPabW013756 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for ; Thu, 13 Oct 2011 17:25:41 +0200 Received: from nova-2.local (unknown [130.239.119.177]) by csep02.cliche.se (Postfix) with ESMTP id A3492186631 for ; Thu, 13 Oct 2011 17:25:34 +0200 (CEST) User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; sv-SE; rv:1.9.2.22) Gecko/20110902 Thunderbird/3.1.14 MIME-Version: 1.0 References: <4E93664D.7090105@residenset.net> <7225.1318285652@cl.cam.ac.uk> <4E945FF9.1060803@residenset.net> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by listserv.uni-heidelberg.de id p9DFPqMI023303 Message-ID: <4E9702EF.7050802@residenset.net> Date: Thu, 13 Oct 2011 17:25:35 +0200 Reply-To: Mailing list for the LaTeX3 project Sender: Mailing list for the LaTeX3 project From: =?ISO-8859-1?Q?Lars_Hellstr=F6m?= Subject: Re: Strings, and regular expressions To: LATEX-L@listserv.uni-heidelberg.de In-Reply-To: Precedence: list List-Help: , List-Unsubscribe: List-Subscribe: List-Owner: List-Archive: X-GMX-Antispam: 0 (eXpurgate); Detail=5D7Q89H36p4vk16szHxzesIsLUbqSZ3+HtC8efAkBdMhOx/bsPmIqHlmK8zkKHsIUBq4L Bz1qn4NWcE98QxeVO/kgsF0EZdIEqBkg1QaiHCALzke7qID1+9dwecAweOy5mGJLWFf7Ay3+zhF8 dVx7coGGj+Ftxj7b+Bwk+9XJoT1eCcnJC1o+/ytpSNEQ4RYUfoGRhnuuU8=V1; X-Resent-By: Forwarder X-Resent-For: rainer.schoepf@gmx.net X-Resent-To: rainer@rainer-schoepf.de Status: R X-Status: X-Keywords: X-UID: 6940 Bruno Le Floch skrev 2011-10-12 04.59: >>> I looked at that this afternoon. Would that be the right framework for >>> code pretty-printing similar to listings/minted (but hopefully more >>> powerful)? >> >> Strongly yes. As Ford mentions in his POPL paper on PEGs, the traditional >> context-free grammars suck at expressing things like "an identifier is a >> _longest_ sequence of consecutive alphanumeric characters", which is why >> there is typically a separate tokenising step before the parser proper. >> PEGs, on the other hand, have no problem with parsing from the character >> level and up in one go. For pretty-printing, I'd expect it to be a great >> convenience to not have to do it in two steps. > > Do you have any idea on the natural resulting data structure? It's not > clear how trees should be implemented in TeX. I would suggest just return a token sequence, into which the TeX commands the user specified has been inserted to reflect the aspects of the structure that he's interested in. A silly example might be something like \begin{Sentence}\begin{NounPhrase}My \begin{Noun}hovercraft\end{Noun}\end{NounPhrase} \begin{VerbPhrase}\begin{Verb}is\end{Verb} full \begin{PrepositionalPhrase}of \begin{Noun}eels\end{Noun}\end{PrepositionalPhrase}\end{VerbPhrase}.\end{Sentence} It seems to be a fairly standard extension that something inside braces in an PEG means "When passing through here upon matching, then insert this piece of material." An awkward point is however that you'd often want to insert an opening brace at one point and a closing brace a bit later -- for example to make that \Noun{hovercraft} rather than the less flexible \begin{Noun}hovercraft\end{Noun} -- which is not obvious how to express. > Also, PEGs seem to lead > to endless trouble with left recursion. I'm a little bit afraid of > fighting an open problem :). I'd be inclined to consider a grammar with a left recursion loop a programming error, just like it is for ordinary TeX macros. There is a transformation of PEGs that turn left-recursion into runtime match failures rather than infinite loops, but I don't know how complicated it is, and I suspect it's mainly interesting when the PEG effectively gets compiled to machine code. LaTeX already runs inside the TeX virtual machine, so an infinite loop is not that fatal. [snip] > I haven't thought about doing it as a pure NFA. I think I'll go for > the most economical solution of running the assertion automaton on the > string, since that feature will probably not be used too much. I still > need to think of a good way of matching backwards, though (I believe > it is simply reversing all the transitions in the automaton, And exchanging the sets of initial and final states. Yes, that's all there is to it. (I assume you don't worry about submatch capturing there anyway.) > but the > representation I have makes that non-trivial). It is often there that the crux lies. [other mail] >> > As Will says, an alternative is simply to save all regexes >> > automatically, and check for the existence of the regex before building >> > it. That of course costs in terms of macros, so the question is how many >> > regexes are likely to be used. (We are talking about a typesetting >> > system, so really this should not normally be 100s.) > You guys are right. I'll add this automatic storage this weekend, and > remove the N variants (since they will be done automatically). I'm not so fond of that idea. I'd expect a bunch of regexps to be one-timers used during package initialisation, and keeping all of those around forever feels like it will be a lot of bloat. I'd prefer having both N and n variants, as in the initial version. Lars Hellström