User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:31.0)
            Gecko/20100101 Thunderbird/31.4.0
MIME-Version: 1.0
References: <54D75835.1070804@morningstar2.co.uk>
Content-Type: text/plain; charset=utf-8
Message-ID:  <54D7E011.2080604@morningstar2.co.uk>
Date:         Sun, 8 Feb 2015 22:15:45 +0000
Reply-To: Mailing list for the LaTeX3 project
              <LATEX-L@LISTSERV.UNI-HEIDELBERG.DE>
Sender: Mailing list for the LaTeX3 project <LATEX-L@LISTSERV.UNI-HEIDELBERG.DE>
From: Joseph Wright <joseph.wright@MORNINGSTAR2.CO.UK>
Subject: Re: expl3 case changing functions
To: LATEX-L@LISTSERV.UNI-HEIDELBERG.DE
In-Reply-To:  <54D75835.1070804@morningstar2.co.uk>
Precedence: list
Envelope-To: <rainer.schoepf@GMX.NET>
Content-Transfer-Encoding: 8bit
Status: R

On 08/02/2015 12:36, Joseph Wright wrote:
> Hello all,
> 
> A few months ago now we added various expandable case changing functions
> to expl3 with clearly 'experimental' status. I've recently had some
> useful feedback on aspects of the behaviour and have revised some of the
> code. I've now got some more questions, so thought it would be useful to
> raise those here. (Note: I've updated the SVN code but this has yet to
> go to CTAN. I can arrange a release if people want to test but not grab
> via GitHub.)
> 
> *Background*
> 
> The current implementation has six functions
> 
>   \tl_upper_case:n
>   \tl_lower_case:n
>   \tl_mixed_case:n
>   \tl_upper_case:nn
>   \tl_lower_case:nn
>   \tl_mixed_case:nn
> 
> where the two-argument versions deal with language-specific case
> changing. The functions are x-type expandable. 'Letters' can be case
> changed from the full Unicode range when using XeTeX/LuaTeX and the
> mappings do not have to be 1-1 (cf. \uppercase/\lowercase).
> 
> There is also \str_fold_case:n which does folding for programmatic
> applications. That function has a different set of use cases and is not
> considered further here.
> 
> *Escaping from case changing*
> 
> The current implementation follows a BibTeX-like convention for
> preventing case changing: braced content is not changed. In the original
> approach there was no mechanism to do case changing inside the argument
> to a command as a result. I have now altered this to include a list of
> commands where case changing should be applied, so for example it would
> be possible to arrange that
> 
>     \tl_upper_case:n { Hello~\emph{world} }
> 
> will case change the argument to \emph. At present, this functionality
> is designed to work with commands taking one argument (i.e. a second or
> subsequent argument will be unaffected).
> 
> The alternative to such an approach is to case change everything and
> provide an escape mechanism (cf. the textcase package and
> \NoChangeCase). As a user, I can see advantages to both approaches.
> 
> One thing that is not currently covered is dealing automatically with
> math mode content. That is doable but would require some consistent
> interface. In particular, while dealing with "$ ... $" and "\( ... \)"
> is straight-forward (single-token delimiters), it would be more
> challenging to cover "\begin{math} ... \end{math}" or similar. Some of
> this has a relationship to expandability: see the next area.
> 
> *Expandability*
> 
> The current implementation is expandable as this allows the 'natural' usage
> 
>     \tl_set:Nx \l_tmpa_tl
>       { \tl_upper_case:n { foo } }
>     \tl_show:N \l_tmpa_tl % => "FOO"
> 
> Expandablity imposes some restrictions on the code and does have a
> performance knock-on. The need to deal with changes that are not 1-1 or
> have other context-dependence means that the performance aspect is not
> so important: a full solution using \uppercase/\lowercase would still
> require a mapping or similar to deal with all of the possibilities.
> 
> One area that is more tricky in this regard is input which is not fully
> expanded. For example
> 
>     \def\myname{Joseph Wright}
>     \MakeUppercase{Written by \myname}
> 
> will yield "WRITTEN BY JOSEPH WRIGHT" as there is an \edef inside the
> LaTeX2e command before case changing. In contrast, the expl3 functions
> currently do no expansion so
> 
>     \tl_upper_case:n { Written~by~\myname }
> 
> gives "WRITTEN BY Joseph Wright". Notably, if used in setting a token
> list the content would be "WRITTEN BY \myname", i.e. further expansion
> is inhibited.
> 
> It is not clear to me what the 'expected' outcome might be. It would be
> possible to use f-type expansion to deal with stored tokens before case
> changing, but for input such as
> 
>     \tl_upper_case:n { Written~by \\ Joseph~Wright }
> 
> that could break outcomes with LaTeX2e: \\ would be 'lost' and this
> would could problematic if the text was used later in for example a
> center environment. A non-expandable implementation could use the same
> logic as \MakeUppercase but at the cost that case changing for storage
> would then need dedicated functions for example
> 
>     \tl_set_upper_case:Nn
>     \tl_set_lower_case:Nnn
> 
> This looses the 'natural' approach to case changing inside a tl setting
> and requires separate 'set a tl with case changing' and 'typeset case
> changed text' functions.
> 
> *LICR/Non-native input*
> 
> The original implementation for the expl3 functions only case changes
> letters. Adding an 'escape' to cover e.g. \emph also allows coverage of
> things like "\'{e}" and so it was natural to consider LICR input. I have
> therefore extended the code to allow coverage of everything handled by
> \MakeUppercase when T1/T2A/T2B/T2C/T4/T5/LGR encodings are in use. There
> is of course a performance hit, but this should be comparable to that
> for processing letters.
> 
> That then leaves the question of input outside of the ASCII range when
> using pdfTeX. It would I think be possible to do this using an approach
> detecting inputenc active chars, but I am reluctant to go this way (in
> the longer term it will be increasingly hard to justify using a 8-bit
> program as the world standardises on Unicode). With inputenc loaded case
> changing does work if the input goes via LICR
> 
>     \documentclass{article}
>     \usepackage[utf8]{inputenc}
>     \usepackage{expl3}
>     \makeatletter
>     \ExplSyntaxOn
>     \cs_generate_variant:Nn \tl_upper_case:n { V }
>     \cs_new_protected:Npn \MakeExplUpperCase #1
>       {
>         \group_begin:
>           \protected@edef \l_tmpa_tl {#1}
>           \tl_upper_case:V \l_tmpa_tl
>         \group_end:
>       }
>     \ExplSyntaxOff
>     \makeatother
>     \begin{document}
>     \MakeExplUpperCase{Héllo}
>     \end{document}
> 
> Again, this has a link to expandability.
> 
> *Naming*
> 
> As noted in previous mails on this topic, the naming here (\tl_...) at
> least in part reflects the fact this code is difficult name. Any better
> naming schemes welcome!
> 
> *Conclusions*
> 
> The current code works but there are open questions. What I am hoping
> for is feedback on the ideas and in particular what issues come up with
> real use cases. Ideas about all or any of the above, or indeed other
> aspects, most welcome.

I've had some feedback via other channels and will summarise here 'for
the record'. (Sources: transcript
http://chat.stackexchange.com/transcript/message/19958526#19958526
onward and direct mail.)

*Escaping from case changing*

David Carlisle points out that using the BibTeX-like approach leaves a
problem with ligatures. Whilst input such as

    {Text}

rather than

    {T}ext

does help, the alternative route taken by textcase

    \NoChangeCase{Text}

allows for the 'escape' mechanism to be entirely transparent at the
typesetting stage (as the appropriate commands can be equivalent to \use:n).

Barbara Beeton provides a useful example where a brace group is
'trapped' inside a word with the BibTeX-like scheme as for example

    MacArthur => MacARTHUR

requires input

    M{ac}Arthur

with the current set up and this cannot be done to avoid a ligature break.

I am therefore minded to alter the approach in this area to follow
textcase: such a change will if done include adding a sensible set of
standard commands to the 'ignore list' (\label, \ref, ...).

Adopting a texcase-like approach also suggests that automatically
handling math mode might be desirable: a first pass for that might well
be based on matching single-token delimiters ($...$/\(...\) as standard
settings) with logic that more complex arrangements will be best covered
by the \NoChangeCase concept.

*Expandability*

One approach suggested (again by David C.) to this area is to start with
an assumption of e-TeX (\robustify for the etoolbox package for example
can be used to make existing commands e-TeX protected). With that
assumption, it is relatively straight-forward to expand 'variable-like'
macros and leave 'command-like' ones alone. (I already have code that
does much the same in siunitx.)

Retaining an expandable approach does seem sensible as it allows what
many other languages do: case changing in a 'functional' sense (or
rather as a macro language in an x-type expansion sense). As already
noted, the need for contextual case mappings means that using the TeX
primitives directly still requires a separate mapping phase and so
performance issues are not so significant.

*LICR/Non-native input*

As the code here is being developed primarily for use to support future
work, and that will increasingly mean Unicode-native engines, comments
here suggest sticking to the 'ASCII/Unicode' line taken to date. As
such, pdfTeX use with non-ASCII input will need pre-processing via
\protected@edef as suggested to produce LICR data which can be handled
correctly.

Depending on other feedback, I will likely implement the above changes
over the coming days and then look to update the release code.
--
Joseph Wright