MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----_=_NextPart_001_01C092AE.A6FDF580"
Content-class: urn:content-classes:message
Subject:      hyphenation morass
Date: Fri, 9 Feb 2001 16:38:29 +0100
Message-ID:  <14980.3829.670026.478733@istrati.zdv.uni-mainz.de>
From: "Frank Mittelbach" <frank.mittelbach@LATEX-PROJECT.ORG>
Sender: "Mailing list for the LaTeX3 project" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
To: "Multiple recipients of list LATEX-L" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
Reply-To: "Mailing list for the LaTeX3 project" <LATEX-L@URZ.UNI-HEIDELBERG.DE>
Status: R

This is a multi-part message in MIME format.

------_=_NextPart_001_01C092AE.A6FDF580
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

I've just looked again at the hyphenation patterns available for TeX and =
every
time I do that again I'm shocked what I find (in several respects).

one of the big problems that i see is that for most patterns out there =
it is
absolutely not clear for what kind of font encoding they have been =
produced.

With very few notable exceptions all of them are encoded using some sort =
of
hard code table, ie ^^ notation so that they are valid only for a single =
font
encoding.

This seems very unfortunate since if they would be stored in a different
format it would be possible to apply them to different font encodings.

Take, for example, T1 and LY1 both of which do contain all the glyphs =
needed
for a number of languages. Therefore a pattern set for French, or =
German, or
Polish, or ...  should be usable with any fonts in either encoding. But
unfortunately they are not because they refer to things like ^^b9 =
meaning \'z
(this is an example from the plhyph.tex file).

if we would replace such patterns by patterns looking like .\'c\'z8 we =
would
be able to reuse the patterns for several font encodings provided the =
internal
latex representations \'c etc are doing the right thing within the =
\patterns
command.

Analysing the behaviour of the fontencoding specific commands we can see =
the
following:

 \DeclareTextComposite \DeclareTextSymbol are fine as they expand in the =
mouth
 (using old TeX terminology) to a font position which is what we want

 But toplevel \DeclareTextCommand (such as \L in OT1) are likely a =
problem
 and so is most likely anything done via \DeclareTextCompositeCommand.

 Finally we do have \DeclareTextAccent which is also not suitable by =
default
 in a \patterns declaration since it results in a call to the \accent
 primitive.

So before discussion what could be done here let me first explain what =
is
currently being done with some of the hyphenation files.

A concept found in several files is to surround potentially problematic
patterns by a command \n which is, depending on the encoding used, =
either
defined to be \def\n#1{} (ie ignore) or \def\n#1{#1} use.

In other words you have a hyphenation file, say for German, which can be =
used
with T1 encoding but also with OT1 encoding by simply removing all =
patterns
which contain references to umlauts or sharp s. I'm not sure if the =
resulting
pattern file is the optimum possible for an encoding like OT1 (actually =
i
doubt that) but it is certainly sufficently accurate enough to be =
usable.
Perhaps Bernd Raichle can comment on this.

The problem in my eyes with this approach is that you have to know =
beforehand
which of the patterns are impossible to use in a certain encoding. Which =
is is
trivial if you design the file for a fixed (peferably small) number of
encodings it is in practice impossible if you do not know the encodings =
it
should be applied to.

so what would be the alternatives?

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

here is my idea which most likely would need some further refinement.

suppose we have the pattern \patterns{.r\"u8stet} (which is taken from =
the
German hyphenation file).

suppose further that for each encoding we have defined a code point =
which is
not a letter, say, the position of `!' (the latter might be a bad choice =
i
don't know). For encodings which do encode not 256 characters we should =
be
able to chose a code point outside the encoding itself. Let's call this
character X

during pattern reading we then map the \add@accent command (which is =
what
finally is used in case of an internal representation for an accent =
which is
not also a composite) to

  \def\add@accent#1#2{X}

so what we get is the pattern \patterns{.rX8stet} which is an impossible
combination (especially if X lies outside the encoding range)

\DeclareTextCommand and \DeclareTextCompositeCommand would need to be =
handled
in a similar fashion (which would require some small changes to the =
internals
of nfss since at the moment \DeclareTextSymbol is internally calling
\DeclareTextCommand etc which would then not appropriate.

The downside of this approach is that for  encodings which would make a =
large
number of patterns invalid this way we unnecessarily store spurious =
patterns.

On the positive side is that i can go and say

\fontencoding{LY1}\selecfont
\input german-hyphenation-patterns

and automatically get the a set of patterns suitable for LY1

the reason i bring this up is that if one extends the notion of "current
encoding" to something like "current encodings suitable for a language" =
then
one needs to have hyphenation patterns for all language/font combintions =
(at
least that would be desirable)

the technical support for this approach doesn't seem to be very =
difficult to
provide, but i don't know if there would be enough people willing to =
actually
look at the hyphenation files out there bring them into a suitable =
source
form. The latter would be necessary in my opinion to some extend anyway, =
there
are a number of such files which would result in very strangly behaving =
LaTeX
format if one actually adds them

frank

------_=_NextPart_001_01C092AE.A6FDF580
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Diso-8859-1">
<META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version =
6.5.7654.12">
<TITLE>     hyphenation morass</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->

<P><FONT SIZE=3D2>I've just looked again at the hyphenation patterns =
available for TeX and every</FONT>

<BR><FONT SIZE=3D2>time I do that again I'm shocked what I find (in =
several respects).</FONT>
</P>

<P><FONT SIZE=3D2>one of the big problems that i see is that for most =
patterns out there it is</FONT>

<BR><FONT SIZE=3D2>absolutely not clear for what kind of font encoding =
they have been produced.</FONT>
</P>

<P><FONT SIZE=3D2>With very few notable exceptions all of them are =
encoded using some sort of</FONT>

<BR><FONT SIZE=3D2>hard code table, ie ^^ notation so that they are =
valid only for a single font</FONT>

<BR><FONT SIZE=3D2>encoding.</FONT>
</P>

<P><FONT SIZE=3D2>This seems very unfortunate since if they would be =
stored in a different</FONT>

<BR><FONT SIZE=3D2>format it would be possible to apply them to =
different font encodings.</FONT>
</P>

<P><FONT SIZE=3D2>Take, for example, T1 and LY1 both of which do contain =
all the glyphs needed</FONT>

<BR><FONT SIZE=3D2>for a number of languages. Therefore a pattern set =
for French, or German, or</FONT>

<BR><FONT SIZE=3D2>Polish, or ...&nbsp; should be usable with any fonts =
in either encoding. But</FONT>

<BR><FONT SIZE=3D2>unfortunately they are not because they refer to =
things like ^^b9 meaning \'z</FONT>

<BR><FONT SIZE=3D2>(this is an example from the plhyph.tex file).</FONT>
</P>

<P><FONT SIZE=3D2>if we would replace such patterns by patterns looking =
like .\'c\'z8 we would</FONT>

<BR><FONT SIZE=3D2>be able to reuse the patterns for several font =
encodings provided the internal</FONT>

<BR><FONT SIZE=3D2>latex representations \'c etc are doing the right =
thing within the \patterns</FONT>

<BR><FONT SIZE=3D2>command.</FONT>
</P>

<P><FONT SIZE=3D2>Analysing the behaviour of the fontencoding specific =
commands we can see the</FONT>

<BR><FONT SIZE=3D2>following:</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp;\DeclareTextComposite \DeclareTextSymbol are =
fine as they expand in the mouth</FONT>

<BR><FONT SIZE=3D2>&nbsp;(using old TeX terminology) to a font position =
which is what we want</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp;But toplevel \DeclareTextCommand (such as \L in =
OT1) are likely a problem</FONT>

<BR><FONT SIZE=3D2>&nbsp;and so is most likely anything done via =
\DeclareTextCompositeCommand.</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp;Finally we do have \DeclareTextAccent which is =
also not suitable by default</FONT>

<BR><FONT SIZE=3D2>&nbsp;in a \patterns declaration since it results in =
a call to the \accent</FONT>

<BR><FONT SIZE=3D2>&nbsp;primitive.</FONT>
</P>

<P><FONT SIZE=3D2>So before discussion what could be done here let me =
first explain what is</FONT>

<BR><FONT SIZE=3D2>currently being done with some of the hyphenation =
files.</FONT>
</P>

<P><FONT SIZE=3D2>A concept found in several files is to surround =
potentially problematic</FONT>

<BR><FONT SIZE=3D2>patterns by a command \n which is, depending on the =
encoding used, either</FONT>

<BR><FONT SIZE=3D2>defined to be \def\n#1{} (ie ignore) or \def\n#1{#1} =
use.</FONT>
</P>

<P><FONT SIZE=3D2>In other words you have a hyphenation file, say for =
German, which can be used</FONT>

<BR><FONT SIZE=3D2>with T1 encoding but also with OT1 encoding by simply =
removing all patterns</FONT>

<BR><FONT SIZE=3D2>which contain references to umlauts or sharp s. I'm =
not sure if the resulting</FONT>

<BR><FONT SIZE=3D2>pattern file is the optimum possible for an encoding =
like OT1 (actually i</FONT>

<BR><FONT SIZE=3D2>doubt that) but it is certainly sufficently accurate =
enough to be usable.</FONT>

<BR><FONT SIZE=3D2>Perhaps Bernd Raichle can comment on this.</FONT>
</P>

<P><FONT SIZE=3D2>The problem in my eyes with this approach is that you =
have to know beforehand</FONT>

<BR><FONT SIZE=3D2>which of the patterns are impossible to use in a =
certain encoding. Which is is</FONT>

<BR><FONT SIZE=3D2>trivial if you design the file for a fixed (peferably =
small) number of</FONT>

<BR><FONT SIZE=3D2>encodings it is in practice impossible if you do not =
know the encodings it</FONT>

<BR><FONT SIZE=3D2>should be applied to.</FONT>
</P>

<P><FONT SIZE=3D2>so what would be the alternatives?</FONT>
</P>

<P><FONT =
SIZE=3D2>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D</FONT>
</P>

<P><FONT SIZE=3D2>here is my idea which most likely would need some =
further refinement.</FONT>
</P>

<P><FONT SIZE=3D2>suppose we have the pattern \patterns{.r\&quot;u8stet} =
(which is taken from the</FONT>

<BR><FONT SIZE=3D2>German hyphenation file).</FONT>
</P>

<P><FONT SIZE=3D2>suppose further that for each encoding we have defined =
a code point which is</FONT>

<BR><FONT SIZE=3D2>not a letter, say, the position of `!' (the latter =
might be a bad choice i</FONT>

<BR><FONT SIZE=3D2>don't know). For encodings which do encode not 256 =
characters we should be</FONT>

<BR><FONT SIZE=3D2>able to chose a code point outside the encoding =
itself. Let's call this</FONT>

<BR><FONT SIZE=3D2>character X</FONT>
</P>

<P><FONT SIZE=3D2>during pattern reading we then map the \add@accent =
command (which is what</FONT>

<BR><FONT SIZE=3D2>finally is used in case of an internal representation =
for an accent which is</FONT>

<BR><FONT SIZE=3D2>not also a composite) to</FONT>
</P>

<P><FONT SIZE=3D2>&nbsp; \def\add@accent#1#2{X}</FONT>
</P>

<P><FONT SIZE=3D2>so what we get is the pattern \patterns{.rX8stet} =
which is an impossible</FONT>

<BR><FONT SIZE=3D2>combination (especially if X lies outside the =
encoding range)</FONT>
</P>

<P><FONT SIZE=3D2>\DeclareTextCommand and \DeclareTextCompositeCommand =
would need to be handled</FONT>

<BR><FONT SIZE=3D2>in a similar fashion (which would require some small =
changes to the internals</FONT>

<BR><FONT SIZE=3D2>of nfss since at the moment \DeclareTextSymbol is =
internally calling</FONT>

<BR><FONT SIZE=3D2>\DeclareTextCommand etc which would then not =
appropriate.</FONT>
</P>

<P><FONT SIZE=3D2>The downside of this approach is that for&nbsp; =
encodings which would make a large</FONT>

<BR><FONT SIZE=3D2>number of patterns invalid this way we unnecessarily =
store spurious patterns.</FONT>
</P>

<P><FONT SIZE=3D2>On the positive side is that i can go and say</FONT>
</P>

<P><FONT SIZE=3D2>\fontencoding{LY1}\selecfont</FONT>

<BR><FONT SIZE=3D2>\input german-hyphenation-patterns</FONT>
</P>

<P><FONT SIZE=3D2>and automatically get the a set of patterns suitable =
for LY1</FONT>
</P>

<P><FONT SIZE=3D2>the reason i bring this up is that if one extends the =
notion of &quot;current</FONT>

<BR><FONT SIZE=3D2>encoding&quot; to something like &quot;current =
encodings suitable for a language&quot; then</FONT>

<BR><FONT SIZE=3D2>one needs to have hyphenation patterns for all =
language/font combintions (at</FONT>

<BR><FONT SIZE=3D2>least that would be desirable)</FONT>
</P>

<P><FONT SIZE=3D2>the technical support for this approach doesn't seem =
to be very difficult to</FONT>

<BR><FONT SIZE=3D2>provide, but i don't know if there would be enough =
people willing to actually</FONT>

<BR><FONT SIZE=3D2>look at the hyphenation files out there bring them =
into a suitable source</FONT>

<BR><FONT SIZE=3D2>form. The latter would be necessary in my opinion to =
some extend anyway, there</FONT>

<BR><FONT SIZE=3D2>are a number of such files which would result in very =
strangly behaving LaTeX</FONT>

<BR><FONT SIZE=3D2>format if one actually adds them</FONT>
</P>

<P><FONT SIZE=3D2>frank</FONT>
</P>

</BODY>
</HTML>
------_=_NextPart_001_01C092AE.A6FDF580--