MIME-Version: 1.0
References: <CANQYN6xiwms0LdCgW-6AQ9Pok9qbtc6k0ZgN40z6RK9+a=RdWg@mail.gmail.com>
            <4E9AF462.1010401@morningstar2.co.uk>
Content-Type: text/plain; charset=ISO-8859-1
Message-ID:  <CANQYN6xZfGL=S+gZFkK74=38DHV_hsiOEHQtqeyqh4VRTT18Lg@mail.gmail.com>
Date:         Sun, 16 Oct 2011 21:36:47 -0400
Reply-To: Mailing list for the LaTeX3 project
              <LATEX-L@listserv.uni-heidelberg.de>
Sender: Mailing list for the LaTeX3 project <LATEX-L@listserv.uni-heidelberg.de>
From: Bruno Le Floch <blflatex@GMAIL.COM>
Subject: Re: Strings, and regular expressions
To: LATEX-L@listserv.uni-heidelberg.de
In-Reply-To:  <4E9AF462.1010401@morningstar2.co.uk>
Precedence: list
Status: R

On 10/16/11, Joseph Wright <joseph.wright@morningstar2.co.uk> wrote:
> On 10/10/2011 16:07, Bruno Le Floch wrote:
>> The l3str module provides functions to get the length of a string,
>> extract substrings or individual characters, testing for string
>> equality (the curent \str_if_eq:nnTF). Some support for encodings is
>> provided: percent encoding, conversion from utf-8 to a string of
>> bytes, and most functions of Heiko Oberdiek's pdfescape package.
>>hly welcome.
>
> Some comments having read the code and documentation.

Thank you Joseph for the cleanup.

> I don't like the name in \str_from_to:nnn - it sounds like a copy
> function. What's wrong with \str_substr:nnn or just \str_sub:nnn?

I couldn't think of an unambiguous name. \str_substr:nnn is fine.

> In the same function, the indexing is described as "\meta{start index}
> (inclusive) and \meta{end index} (exclusive)". This seems very odd to me
> - I'd expect
>
>   \str_from_to:nnn { abcdef } { 1 } { 4 }
>
> to leave "bcde" in the input stream.

I followed the python convention, in which you think of the index as
lying between pairs of characters:

(0)a(1)b(2)c(3)d(4)e(5)f(6)

Hence, extracting from 1 to 4 gives "bcd". The advantage of doing it
that way is that the length of what you get is \(4 - 1\). Another
advantage is that getting the first <n> characters is easy:
\str_substr:nnn { <string> } { 0 } { <n> }. A drawback is that getting
all characters from a given point to the end is \str_substr:nnn {
<string> } { <n> } { \c_max_int } rather than \str_substr:nnn {
<string> } { <n> } { -1 }. Does that make sense?


> What's the reasoning for "\str_if_contains_char:NN" rather than just
> "\str_if_in:NN"?

The second N argument is not enough to know whether you expect a char
or a string variable.

Should I code an expandable \str_if_in:nn?

> I see you have a number of "UTF_viii" functions. I can see that you are
> covering any confusion with UTF-16, but would simply "UTF" be better?

No, although I do agree that "UTF_viii" is long :(. We will need
utf-16 to deal with PDF, as Heiko pointed out in a previous email.
Perhaps we should drop support for utf-8 and instead only support
utf-16?

> I also saw that the docs mentioned "\str_if_UTF_viii:N", which does not
> exist. I've removed it, as I think the docs and the code should match as
> much as possible.

Yes. I never got to implementing it :).

Should we lower-case "utf" in function names?

--
Bruno