Received: from mx0.gmx.net (mx0.gmx.net [213.165.64.100]) by h1439878.stratoserver.net (8.14.2/8.14.2/Debian-2build1) with SMTP id o1ACP7SA017870 for ; Wed, 10 Feb 2010 13:25:08 +0100 Received: (qmail 1508 invoked by alias); 10 Feb 2010 12:25:02 -0000 Delivered-To: GMX delivery to rainer.schoepf@gmx.net Received: (qmail invoked by alias); 10 Feb 2010 12:25:01 -0000 Received: from relay.uni-heidelberg.de (EHLO relay.uni-heidelberg.de) [129.206.100.212] by mx0.gmx.net (mx110) with SMTP; 10 Feb 2010 13:25:01 +0100 Received: from listserv.uni-heidelberg.de (listserv.uni-heidelberg.de [129.206.100.94]) by relay.uni-heidelberg.de (8.14.1/8.14.1) with ESMTP id o1ACMufk023305 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 10 Feb 2010 13:22:56 +0100 Received: from listserv.uni-heidelberg.de (localhost.localdomain [127.0.0.1]) by listserv.uni-heidelberg.de (8.13.1/8.13.1) with ESMTP id o1A9ak3A013381; Wed, 10 Feb 2010 13:22:53 +0100 Received: by LISTSERV.UNI-HEIDELBERG.DE (LISTSERV-TCP/IP release 16.0) with spool id 395120 for LATEX-L@LISTSERV.UNI-HEIDELBERG.DE; Wed, 10 Feb 2010 13:22:53 +0100 Received: from relay2.uni-heidelberg.de (relay2.uni-heidelberg.de [129.206.210.211]) by listserv.uni-heidelberg.de (8.13.1/8.13.1) with ESMTP id o1ACMres000607 for ; Wed, 10 Feb 2010 13:22:53 +0100 Received: from oberon.open.ac.uk (oberon.open.ac.uk [137.108.141.46]) by relay2.uni-heidelberg.de (8.13.8/8.13.8) with ESMTP id o1ACMh6g021025 for ; Wed, 10 Feb 2010 13:22:47 +0100 X-IronPort-AV: E=Sophos;i="4.49,443,1262563200"; d="scan'208";a="3471519" X-Disclaimed: 1 Importance: Normal X-Priority: 3 (Normal) References: <4B727378.8060704@morningstar2.co.uk>, <4B729944.5050308@residenset.net> MIME-Version: 1.0 X-Mailer: Lotus Domino Web Server Release 8.5.1FP1 January 05, 2010 X-MIMETrack: Serialize by HTTP Server on mcs-notes1/mcs/UK(Release 8.5.1FP1|January 05, 2010) at 10/02/2010 12:22:43, Serialize complete at 10/02/2010 12:22:43, Itemize by HTTP Server on mcs-notes1/mcs/UK(Release 8.5.1FP1|January 05, 2010) at 10/02/2010 12:22:43, Serialize by Router on mcs-notes1/mcs/UK(Release 8.5.1FP1|January 05, 2010) at 10/02/2010 12:22:43 MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by listserv.uni-heidelberg.de id o1ACMres000608 Message-ID: Date: Wed, 10 Feb 2010 12:22:43 +0000 Reply-To: Mailing list for the LaTeX3 project Sender: Mailing list for the LaTeX3 project From: Chris Rowley Subject: Re: String module To: LATEX-L@listserv.uni-heidelberg.de In-Reply-To: <4B729944.5050308@residenset.net> Precedence: list List-Help: , List-Unsubscribe: List-Subscribe: List-Owner: List-Archive: X-GMX-Antispam: 0 (Mail was not recognized as spam); Detail=5D7Q89H36p4WX0t+AtsdWzrXATe7U7iyEYsVEub6UEScnitTuLsF1dwROLenO8vOmwvdy pgJ/SUsFNrCZZib4xGIRhKVI5/3XJCDlTO3AUhsOBsdGA7aE8f9QyJnx4GXo3kvaK52rz5wLFjTv haXng==V1; X-Resent-By: Forwarder X-Resent-For: rainer.schoepf@gmx.net X-Resent-To: rainer@rainer-schoepf.de X-Scanned-By: MIMEDefang 2.63 on 85.214.41.38 Status: R X-Status: X-Keywords: X-UID: 6232 Apologies for the brevity. I have not had time to look at the details from Lars but his emphasis on analysing the many different TeXie things that look like a string is spot on. Also note that (as maybe someone already pointed out) in general a 'string of Unicode characters' is itself a rather slippery beast. Thus when you 'put Unicode inside TeX' (whatever nmeaning you give that phrase) strings could be even more underspecified than Lars' list shows. Cheers, chris -----Lars Hellström wrote: ----- To: LATEX-L@LISTSERV.UNI-HEIDELBERG.DE From: Lars Hellström Date: 10/02/2010 11:32 Subject: Re: String module Joseph Wright skrev: > Hello all, > > One of the questions that was raised recently on c.t.t concerning the > currently available LaTeX3 modules was the lack of "strings" > functionality. [snip] > The first "big" question is what exactly is a string in a TeX context. Indeed, there are several possible interpretations: (1) Sequences of character tokens (2) Sequences of character tokens with normalised catcodes (3) Sequences of characters from some alphabet (possibly large),      representation not necessarily native (4) Sequences of LICRs. Which you want to use depends on what is being targeted, i.e., what strings are going to be used for. \write, \special, and \csname are probably the main consumers, and since these want (1), that's probably the main thing to support. But it is also important to (eventually) provide conversions between different string-like concepts. One should not expect one size to fit all. > You also have to worry about what happens about special characters (for > example, how do you get % into a string). If you escape things at the > input stage [say \% => % (catcode 12)] then a simple \detokenize will > not work. For manual entering of string data, one might well find that (3) or (4) is most practical... > On features, things that seem to be popular: >  - Substring functions such as "x characters from one end", "first x > characters", etc. >  - Search functions such as "where is string x in string y". ...whereas searching typically requires (2). I would suggest that core string module would primarily operate on the (1) kind of string, possibly requiring (2) for some operations, and providing the necessary conversion operation 1->2 (trusting the user to apply it where necessary, rather than building it into each and every operation just to be on the safe side). Heiko Oberdiek wrote: > * Encoding conversions, see package `stringenc'. >   Application: PDF (outlines and other text fields). This is, at least for the input, rather (3) or (4). Or are you anticipating character sets larger than ^^@--^^ff for the underlying engine? Then one conversely needs an "octet string" concept, for \special and the like. > * Matching (replacing) using regular expressions, >   see \pdfmatch and luaTeX. >   Matching is useful for extracting information pieces or >   validating option values, ... Be aware that matching is one thing, extracting information pieces a somewhat trickier concept, and replacing even more so (from the CS theory point of view). >   Unhappily \pdfmatch has still the status "experimental" >   and the regular expression language differs from Lua's. The last time I looked, Lua's "regular expressions" were not regular expressions[*] at all, but rather a kind of beefed-up glob pattern (with a regexp-like syntax), so I wouldn't be sad if LaTeX was to deviate from Lua in that respect. I would be sad if something is called regular expression that really isn't. Lars Hellström [*] There are several equivalent and perfectly formal definitions of what it means to be "regular" as in "regular expression", the most familiar of which is probably that a regular language is one that can be recognised by a finite automaton. POSIX regexps are very close to this (the only irregular feature being backreferences), whereas Perl's "regexps" are way out in context-free-land. Lua's matching engine, OTOH, is too weak to recognise arbitrary regular languages. --------------------------------------------------------------------------- The Open University is incorporated by Royal Charter (RC 000391), an exempt charity in England & Wales and a charity registered in Scotland (SC 038302)