Received: from webgate.proteosys.de (mail.proteosys-ag.com [62.225.9.49]) by lucy.proteosys (8.11.0/8.9.3/SuSE Linux 8.9.3-0.1) with ESMTP id f19FedH04904 for ; Fri, 9 Feb 2001 16:40:39 +0100 Received: by webgate.proteosys.de (8.11.0/8.11.0) with ESMTP id f19Fecd18172 . for ; Fri, 9 Feb 2001 16:40:38 +0100 Received: from mail.Uni-Mainz.DE (mailserver1.zdv.Uni-Mainz.DE [134.93.8.30]) by mailgate1.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f19FeWM15789 for ; Fri, 9 Feb 2001 16:40:33 +0100 (MET) MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----_=_NextPart_001_01C092AE.A6FDF580" Received: from mailgate1.zdv.Uni-Mainz.DE (mailgate1.zdv.Uni-Mainz.DE [134.93.8.56]) by mail.Uni-Mainz.DE (8.9.3/8.9.3) with ESMTP id QAA12003 for ; Fri, 9 Feb 2001 16:40:32 +0100 (MET) X-MimeOLE: Produced By Microsoft Exchange V6.5 Received: from mail.listserv.gmd.de (mail.listserv.gmd.de [192.88.97.5]) by mailgate1.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f19FeWM15785 for ; Fri, 9 Feb 2001 16:40:32 +0100 (MET) Received: from mail.listserv.gmd.de (192.88.97.5) by mail.listserv.gmd.de (LSMTP for OpenVMS v1.1a) with SMTP id <1.75D6EB21@mail.listserv.gmd.de>; Fri, 9 Feb 2001 16:40:22 +0100 Received: from RELAY.URZ.UNI-HEIDELBERG.DE by RELAY.URZ.UNI-HEIDELBERG.DE (LISTSERV-TCP/IP release 1.8b) with spool id 488640 for LATEX-L@RELAY.URZ.UNI-HEIDELBERG.DE; Fri, 9 Feb 2001 16:40:25 +0100 Received: from ix.urz.uni-heidelberg.de (mail.urz.uni-heidelberg.de [129.206.119.234]) by relay.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id QAA22138 for ; Fri, 9 Feb 2001 16:40:24 +0100 (MET) Received: from relay.uni-heidelberg.de (relay.uni-heidelberg.de [129.206.100.212]) by ix.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id QAA09318 for ; Fri, 9 Feb 2001 16:40:24 +0100 Received: from moutvdom01.kundenserver.de (moutvdom01.kundenserver.de [195.20.224.200]) by relay.uni-heidelberg.de (8.10.2+Sun/8.10.2) with ESMTP id f19FeOu12651 for ; Fri, 9 Feb 2001 16:40:24 +0100 (MET) Received: from [195.20.224.209] (helo=mrvdom02.schlund.de) by moutvdom01.kundenserver.de with esmtp (Exim 2.12 #2) id 14RFej-0007SE-00 for LATEX-L@urz.uni-heidelberg.de; Fri, 9 Feb 2001 16:40:17 +0100 Received: from manz-3e364817.pool.mediaways.net ([62.54.72.23] helo=istrati.zdv.uni-mainz.de) by mrvdom02.schlund.de with esmtp (Exim 2.12 #2) id 14RFes-0007sA-00 for LATEX-L@URZ.UNI-HEIDELBERG.DE; Fri, 9 Feb 2001 16:40:26 +0100 Received: (from latex3@localhost) by istrati.zdv.uni-mainz.de (8.9.3/8.9.3/SuSE Linux 8.9.3-0.1) id QAA28448; Fri, 9 Feb 2001 16:38:29 +0100 Return-Path: X-Mailer: VM 6.75 under Emacs 20.4.1 X-Authentication-Warning: istrati.zdv.uni-mainz.de: latex3 set sender to frank@mittelbach-online.de using -f Content-class: urn:content-classes:message Subject: hyphenation morass Date: Fri, 9 Feb 2001 16:38:29 +0100 Message-ID: <14980.3829.670026.478733@istrati.zdv.uni-mainz.de> X-MS-Has-Attach: X-MS-TNEF-Correlator: From: "Frank Mittelbach" Sender: "Mailing list for the LaTeX3 project" To: "Multiple recipients of list LATEX-L" Reply-To: "Mailing list for the LaTeX3 project" Status: R X-Status: X-Keywords: X-UID: 3778 This is a multi-part message in MIME format. ------_=_NextPart_001_01C092AE.A6FDF580 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable I've just looked again at the hyphenation patterns available for TeX and = every time I do that again I'm shocked what I find (in several respects). one of the big problems that i see is that for most patterns out there = it is absolutely not clear for what kind of font encoding they have been = produced. With very few notable exceptions all of them are encoded using some sort = of hard code table, ie ^^ notation so that they are valid only for a single = font encoding. This seems very unfortunate since if they would be stored in a different format it would be possible to apply them to different font encodings. Take, for example, T1 and LY1 both of which do contain all the glyphs = needed for a number of languages. Therefore a pattern set for French, or = German, or Polish, or ... should be usable with any fonts in either encoding. But unfortunately they are not because they refer to things like ^^b9 = meaning \'z (this is an example from the plhyph.tex file). if we would replace such patterns by patterns looking like .\'c\'z8 we = would be able to reuse the patterns for several font encodings provided the = internal latex representations \'c etc are doing the right thing within the = \patterns command. Analysing the behaviour of the fontencoding specific commands we can see = the following: \DeclareTextComposite \DeclareTextSymbol are fine as they expand in the = mouth (using old TeX terminology) to a font position which is what we want But toplevel \DeclareTextCommand (such as \L in OT1) are likely a = problem and so is most likely anything done via \DeclareTextCompositeCommand. Finally we do have \DeclareTextAccent which is also not suitable by = default in a \patterns declaration since it results in a call to the \accent primitive. So before discussion what could be done here let me first explain what = is currently being done with some of the hyphenation files. A concept found in several files is to surround potentially problematic patterns by a command \n which is, depending on the encoding used, = either defined to be \def\n#1{} (ie ignore) or \def\n#1{#1} use. In other words you have a hyphenation file, say for German, which can be = used with T1 encoding but also with OT1 encoding by simply removing all = patterns which contain references to umlauts or sharp s. I'm not sure if the = resulting pattern file is the optimum possible for an encoding like OT1 (actually = i doubt that) but it is certainly sufficently accurate enough to be = usable. Perhaps Bernd Raichle can comment on this. The problem in my eyes with this approach is that you have to know = beforehand which of the patterns are impossible to use in a certain encoding. Which = is is trivial if you design the file for a fixed (peferably small) number of encodings it is in practice impossible if you do not know the encodings = it should be applied to. so what would be the alternatives? =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D here is my idea which most likely would need some further refinement. suppose we have the pattern \patterns{.r\"u8stet} (which is taken from = the German hyphenation file). suppose further that for each encoding we have defined a code point = which is not a letter, say, the position of `!' (the latter might be a bad choice = i don't know). For encodings which do encode not 256 characters we should = be able to chose a code point outside the encoding itself. Let's call this character X during pattern reading we then map the \add@accent command (which is = what finally is used in case of an internal representation for an accent = which is not also a composite) to \def\add@accent#1#2{X} so what we get is the pattern \patterns{.rX8stet} which is an impossible combination (especially if X lies outside the encoding range) \DeclareTextCommand and \DeclareTextCompositeCommand would need to be = handled in a similar fashion (which would require some small changes to the = internals of nfss since at the moment \DeclareTextSymbol is internally calling \DeclareTextCommand etc which would then not appropriate. The downside of this approach is that for encodings which would make a = large number of patterns invalid this way we unnecessarily store spurious = patterns. On the positive side is that i can go and say \fontencoding{LY1}\selecfont \input german-hyphenation-patterns and automatically get the a set of patterns suitable for LY1 the reason i bring this up is that if one extends the notion of "current encoding" to something like "current encodings suitable for a language" = then one needs to have hyphenation patterns for all language/font combintions = (at least that would be desirable) the technical support for this approach doesn't seem to be very = difficult to provide, but i don't know if there would be enough people willing to = actually look at the hyphenation files out there bring them into a suitable = source form. The latter would be necessary in my opinion to some extend anyway, = there are a number of such files which would result in very strangly behaving = LaTeX format if one actually adds them frank ------_=_NextPart_001_01C092AE.A6FDF580 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable hyphenation morass

I've just looked again at the hyphenation patterns = available for TeX and every
time I do that again I'm shocked what I find (in = several respects).

one of the big problems that i see is that for most = patterns out there it is
absolutely not clear for what kind of font encoding = they have been produced.

With very few notable exceptions all of them are = encoded using some sort of
hard code table, ie ^^ notation so that they are = valid only for a single font
encoding.

This seems very unfortunate since if they would be = stored in a different
format it would be possible to apply them to = different font encodings.

Take, for example, T1 and LY1 both of which do contain = all the glyphs needed
for a number of languages. Therefore a pattern set = for French, or German, or
Polish, or ...  should be usable with any fonts = in either encoding. But
unfortunately they are not because they refer to = things like ^^b9 meaning \'z
(this is an example from the plhyph.tex file).

if we would replace such patterns by patterns looking = like .\'c\'z8 we would
be able to reuse the patterns for several font = encodings provided the internal
latex representations \'c etc are doing the right = thing within the \patterns
command.

Analysing the behaviour of the fontencoding specific = commands we can see the
following:

 \DeclareTextComposite \DeclareTextSymbol are = fine as they expand in the mouth
 (using old TeX terminology) to a font position = which is what we want

 But toplevel \DeclareTextCommand (such as \L in = OT1) are likely a problem
 and so is most likely anything done via = \DeclareTextCompositeCommand.

 Finally we do have \DeclareTextAccent which is = also not suitable by default
 in a \patterns declaration since it results in = a call to the \accent
 primitive.

So before discussion what could be done here let me = first explain what is
currently being done with some of the hyphenation = files.

A concept found in several files is to surround = potentially problematic
patterns by a command \n which is, depending on the = encoding used, either
defined to be \def\n#1{} (ie ignore) or \def\n#1{#1} = use.

In other words you have a hyphenation file, say for = German, which can be used
with T1 encoding but also with OT1 encoding by simply = removing all patterns
which contain references to umlauts or sharp s. I'm = not sure if the resulting
pattern file is the optimum possible for an encoding = like OT1 (actually i
doubt that) but it is certainly sufficently accurate = enough to be usable.
Perhaps Bernd Raichle can comment on this.

The problem in my eyes with this approach is that you = have to know beforehand
which of the patterns are impossible to use in a = certain encoding. Which is is
trivial if you design the file for a fixed (peferably = small) number of
encodings it is in practice impossible if you do not = know the encodings it
should be applied to.

so what would be the alternatives?

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D

here is my idea which most likely would need some = further refinement.

suppose we have the pattern \patterns{.r\"u8stet} = (which is taken from the
German hyphenation file).

suppose further that for each encoding we have defined = a code point which is
not a letter, say, the position of `!' (the latter = might be a bad choice i
don't know). For encodings which do encode not 256 = characters we should be
able to chose a code point outside the encoding = itself. Let's call this
character X

during pattern reading we then map the \add@accent = command (which is what
finally is used in case of an internal representation = for an accent which is
not also a composite) to

  \def\add@accent#1#2{X}

so what we get is the pattern \patterns{.rX8stet} = which is an impossible
combination (especially if X lies outside the = encoding range)

\DeclareTextCommand and \DeclareTextCompositeCommand = would need to be handled
in a similar fashion (which would require some small = changes to the internals
of nfss since at the moment \DeclareTextSymbol is = internally calling
\DeclareTextCommand etc which would then not = appropriate.

The downside of this approach is that for  = encodings which would make a large
number of patterns invalid this way we unnecessarily = store spurious patterns.

On the positive side is that i can go and say

\fontencoding{LY1}\selecfont
\input german-hyphenation-patterns

and automatically get the a set of patterns suitable = for LY1

the reason i bring this up is that if one extends the = notion of "current
encoding" to something like "current = encodings suitable for a language" then
one needs to have hyphenation patterns for all = language/font combintions (at
least that would be desirable)

the technical support for this approach doesn't seem = to be very difficult to
provide, but i don't know if there would be enough = people willing to actually
look at the hyphenation files out there bring them = into a suitable source
form. The latter would be necessary in my opinion to = some extend anyway, there
are a number of such files which would result in very = strangly behaving LaTeX
format if one actually adds them

frank

------_=_NextPart_001_01C092AE.A6FDF580--