User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101
            Thunderbird/24.5.0
MIME-Version: 1.0
References: <CA+jHFwStUS3Pwp8tYOm2kCvk+ZhaiJ2k4PEuw9g7dzbVnRuKGQ@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Message-ID:  <537C728F.8000604@latex-project.org>
Date:         Wed, 21 May 2014 11:31:59 +0200
Reply-To: Mailing list for the LaTeX3 project
              <LATEX-L@LISTSERV.UNI-HEIDELBERG.DE>
Sender: Mailing list for the LaTeX3 project <LATEX-L@LISTSERV.UNI-HEIDELBERG.DE>
From: Frank Mittelbach <frank.mittelbach@LATEX-PROJECT.ORG>
Subject: Re: Unicode math
To: LATEX-L@LISTSERV.UNI-HEIDELBERG.DE
In-Reply-To:  <CA+jHFwStUS3Pwp8tYOm2kCvk+ZhaiJ2k4PEuw9g7dzbVnRuKGQ@mail.gmail.com>
Precedence: list
Envelope-To: <rainer.schoepf@GMX.NET>
Status: R

In my opinion the Unicode consortium has not screwed up (backspace 
backspace backspace ...) has not found the best possible for math and 
there is no way to *properly* reconcile the two worlds.

Unicode started out as an attempt to codify plain text letters of all 
languages. One of the most important axioms in that respect was the idea 
that a "letter" is an abstract entity, e.g., Latin-small-a and that 
different glyphs in fonts all represent that single entity "a" 
regardless of shape or form it takes. So attributes like bold or 
serif/sans etc are all outside the scope of Unicode encoding.

That makes sense if you try to convey textual meaning. This makes sense 
as "word" has a meaning regardless of being in italics or bold or both. 
(of course such attributes extend the semantics, e.g. bold may indicate 
a heading or italic some emphasis but underlying that "word" still has a 
meaning of its own (in a language).

The problem with math though is that symbols in math are traditionally 
be not just defined by an abstracted shape, but the mathematical 
community early one used additional attributes of glyphs to convey 
semantics. So bold-lowercase-latin-letters may denote vectors and in one 
formula a integral symbol and a bold-integral may have totally different 
semantics. On top of it the semantics may change from field to field or 
even from paper to paper (so other than calling it a bold-integral there 
is not way to describe such symbols semantically).

The problem with this is that mathematicians have come up with using 
effectively any kind of symbol/letter to denote specific semantics and 
long ago started to use all kind of attributes (that unicode on the 
level plain text regards as  irrelevant) to indicate semantics too. The 
main point here then is that the moment that happens the attributes 
become frozen and symbols+attribute become relevant symbols in their own 
right.

As a result to express the language of mathematics unicode would have 
needed to codify all kind of letter/symbol+attribute(s) as individual 
unicode points which is a difficult if not impossible task.

Nevertheless, they went for this approach to some extend by codifying 
mathematical alphabets (mainly digits+a-z+A-Z plus some greek) and of 
course a large number of symbols.

In the unicode book it says:

The alphabets in this block encode only semantic distinction, but not 
which font will be used to supply the actual plain, script, Fraktur 
[...] Characters from the Mathematical Alphanumeric Symbol block are not 
to be used for nonmathematical styled text.

All mathematical alphanumeric symbols have compatibility decompositions 
to the base Latin and Greek letters. This does not imply that the use of 
these characters (I guess the base ones - Frank) is discouraged for 
mathematical use. Folding away such distinctions [..] is usually not 
desirable, however, as it loses the semantic distinction for which these 
characters are encoded.

That is all true and sensible and to explicitly encode that something is 
a math-caligraphic S and not just a Latin-S (that happens to be in some 
caligraphic font) is desirable when passing data from one application to 
the next as the font information is likely to be lost and thus the 
semantics.

However, it is by no means offering a full codification of mathematical 
semantics, so by the end of the day you may end up with a mixture of 
"properly" encoded material + stuff that lost the semantic distinction.

the good part is that it covers a lot but it is not comprehensive by any 
means and can't be due to the approach chosen.

It reminds me a bit of a talk I heard recently where somebody was 
advocating to use sub-superscript unicode digits to avoid having to type 
_2 or ^3 arguing that this is easier and nicer and better readable. Well 
to me it isn't the moment you get to real math because then it gets 
inconsistent and you end up with mixed syntax.

For the same reason believe that it would have been better to approach 
math alphabets differently in unicode and instead of codifying a few 
(with limited letter sets) acknowledge the fact that this "language" has 
a meta level where symbol+attribute encode semantics and not just symbol 
as such.

Anyway this is no here nor there as  this is what unicode offers nowadays.

So where does it fail?

  - in case of attributed mathematical symbols, most prominently using 
bold as offered by the bm package, resulting in new symbols as far as 
the semantics are concerned

  - in case of multi-letter symbols (that require a fixed font (ie 
frozen attributes) but with kerning for aesthetic reason)

  - in case of using alphabets which have not been considered (like two 
distinctive calligraphic alphabets in parallel, or old german \neq 
Fraktur (as my Algebra prof did) or cyrillic or ...

  - in the fact of not supporting diacritics for those alphabets (minor 
case though)

LaTeX2e's math support codified most of the needs of the mathematics 
language  albeit only with its domain (that is within the LaTeX syntax), 
i.e., it wasn't supporting any unicode code points for math (as they 
didn't exist). So something like \mathbf was defining individual bold 
math letters (for which unicode now has its own code point as long as 
they are basic latin) but it was also offering this for word-like 
symbols such as \mathbf{Set}

So if one now maps that to a full fledged text font that supports 
kerning, you lose the code point semantic distinction outside LaTeX and 
if you map it to the unicode plane then you have to manually deal with 
kerning for multi-letter sequence (which is on-trivial and can't be 
perfect) or live with horrible spacing.

Or you need to change the interface in LaTeX and offer different 
commands or you change internals and distinguish between single letter 
and multi-letter arguments. Or ...

frank