Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-ID:  <199710211026.MAA05364@sponsor.iti.informatik.tu-darmstadt.de>
Reply-To: Mailing list for the LaTeX3 project
              <LATEX-L@RELAY.URZ.UNI-HEIDELBERG.DE>
Date:         Tue, 21 Oct 1997 12:26:11 +0200
From: Roger Kehr <kehr@ITI.INFORMATIK.TU-DARMSTADT.DE>
Sender: Mailing list for the LaTeX3 project
              <LATEX-L@RELAY.URZ.UNI-HEIDELBERG.DE>
To: Multiple recipients of list LATEX-L
              <LATEX-L@RELAY.URZ.UNI-HEIDELBERG.DE>
Subject:      Re: von v. van der & other problems
Status: R

Phillip Helbig writes:
 > Maarten Gelderman <mgelderman@ECON.VU.NL>:
 >
 > This is something which needs to be considered.  There are many ways to
 > treat things like v (this is the same as the letter marked above if it
 > gets garbled by some email software along the way).
 >
 >    o  treat it as a separate letter (in Swedish it's the last letter of
 >       the alphabet)
 >
 >    o  treat it as o (common practice in English, mixing it with things
 >       really written with o)
 >
 >    o  treat it as oe, mixing it with things REALLY written with oe
 >       (this is done in German telephone books)
 >
 >    o  put it immediately before o
 >
 >    o  put it immediately after o
 >
 >    o  put it immediately before oe
 >
 >    o  put it immediately after oe
 >
 > I've definitely seen the first three in use.  The rest are thinkable. In
 > German, sometimes (such as in address books, file collections etc) Th,
 > Ph, Chr, Sch, St etc, either as the beginnings of words or as initials
 > of names (one sees both), are treated as essentially separate letters,
 > usually coming after the first letter of the group (corresponding to
 > example 5 above).

The problem with these is that in German there are surnames are
M\"uller as well as Mueller. Hence, both styles are possible. In this
case there is still no exact order defined for both entries which
might confuse the reader if several people have names like these,
especially in telephone books with dozens of entries. Then we need a
rule that says: 1. Phase: Treat all \"o's as if they were oe's. 2.
Phase: Put all "\o's in front of oe (or vice versa). Hence, specifying
rules like this is not trivial.

The French sorting rules for example as well require more than one
sorting phase, due to the fact that at first diacritical marks are not
considered, so e and \'e are equal in a first phase. If then there are
words left such as cote, cot\'e the diacritical marks define the exact
order, but lexicographically from right to left. This is rather
complicated to implement.

I have done some work on the xindy index processor that to a large
extend offers mechanisms to solve these problems. The current
implementation uses a string rewriting mechanism that operates in
several stages. And what I learned from that project is that it
requires a lot of effort to obtain a complete and consistent
specification of sorting rules.

There exists an ISO Standard "ISO/IEC CD 14651 - International String
Ordering - Method for comparing Character Strings and Description of a
Default Tailorable Ordering" about his topic. But it does not offer
solutions for the PhD. stuff Maarten mentioned above.

I hope I haven't frustrated you.

Cheers
--Roger

P.S: The ISO standard is available at http://www.dkuug.dk/JTC1/SC22/WG20.
--
======================================================================
Roger Kehr                         kehr@iti.informatik.tu-darmstadt.de
Computer Science Department         Darmstadt University of Technology