View on GitHub

Babel

The multilingual framework to localize LaTeX, LuaLaTeX, XeLaTeX

Non-standard hyphenation with luatex

Hyphenation in LaTeX is accomplished by means of the so-called discretionaries. You can find a brief description here.

This article describes an extension which can serve to several purposes, particularly dealing with non-standard hyphenation rules, including changes in letters and weighted hypenation points. (Note luatex currently provides built-in ways the deal with some frequent cases, too. Please, refer to its manual for further information.)

Here is a simple example of a declaration, which tell LaTeX to change the group ‘ck’ to ‘kk’ if the hyphenation point falls inside this group (it’s not meant as a full rule for German, just a starting point).

\babelposthyphenation{german}{ck}{
  { no = c, pre = k- },
  {}
}

It consists of:

The language here refers to a set of hyphenation rules, ie, to \language. So, the first letter in the pattern is replaced with the first item in the list, the second letter with the second item and so on. (This is not strictly true, because the replace list is filled with nil’s if shorter.)

Rules

‘Regular’ hyphenation points, as inserted automatically by the hyphenation algorithm, are entered in the pattern as vertical bars (|). Explicit hyphens are entered as =. Spaces are allowed for clarity, and they are discarded.

The items in the replacement list are of four kinds:

  1. An empty group {} leaves the corresponding item untouched.
  2. A list like { no = c, pre = k-, post = } replaces the letter by the corresponding discretionary. Only one of the keys is necessary, and the rest defaults to empty. By default the penalty is \hyphenpenalty or \exhyphenpenalty (TeXbook, p96), but a different value can be set with the key penalty. A further field is data - automatic hyphens contain no information about the font and the like, and with this key you can set which element in the list (as captured) they will the taken from.
  3. The key string replaces the character with the string. If empty, the char node is removed; to insert chars, just use a multi-character string. The nodes created are literal copies of the original, but with the new characters.
  4. With remove the node is, well, removed (ie, it’s like and empty string=).
  5. Spaces are declared with something like space =.2 .1 0. The values are in em units, and they are the natural width, the plus, and the minus. Here, you may need data, too. With spacefactor the unit is the font size of the current font (if the node is a glyph; you may need a data= pointing to a specific glyph).

A few keys can be used in conjunction with insert, which must be the very first one in the replacement.

The pattern is matched with lua empty captures, which are automatically added before and after the string. You may set different empty captures, to reduce the number of items in the replacement list:

\babelposthyphenation{ngerman}{very()long()pattern}{
  string = L,
  string = OOO,
  string = N,
  string = G
}

Dots, characters classes (with %) and char-sets (with [], including complementing and ranges) are allowed, too. When using the dot, be aware it matches | and =, too. A matched | or = can be replaced with the hex value (at least 4 digits): {007C} and {003D}. +, -, ? and * are allowed outside the ()() block, but not inside. So, {a}|?()Á() is a letter followed optionally by a discretionary, but only Á is actually transformed (in these cases, you may wanto to go back with the key step).

Ordinary captures are allowed inside the empty captures (they must resolve to exactly one character). In the pattern, the syntax {n} is a backreference matching the n-th capture inside the empty captures. This syntax can be used in the replacement strings, with the corresponding capture:

\babelposthyphenation{ngerman}{([fmtrp]) | {1}}{
  { no = {1}, pre = {1}{1}- },
  remove,
  {}
}
\babelposthyphenation{ngerman}{ ([cC]) ([kK]) }{
  { no = {1}, pre = {2}- },
  {}
}

Since the percent sign has a quite different meaning in lua and tex, as a convenience the {} syntax can be used to enter character classes in the pattern, too (ie, {d} becomes %d, but note {1} is not internally the same as %1).

And here is a complete example:

\documentclass{article}

\usepackage[ngerman]{babel}

\babelposthyphenation{ngerman}{([fmtrp]) | {1}}{
  { no = {1}, pre = {1}{1}- },
  remove,
  {}
}

\begin{document}

\rightskip5cm

Auffrisierende Auffrisierendem Auffrisierenden Auffrisierender
Auffrisierendes Auffrisierst Auffrisiert Auffrisierte Auffrisiertem
Auffrisierten Auffrisierter Auffrisiertes Auffrisiertest Auffrisiertet
Auffrisst Auffuhr Aufführbar Aufführbare Aufführbarem Aufführbaren
Aufführbarer Aufführbares Aufführe Auffuhren Aufführen Aufführend
Aufführende Aufführendem Aufführenden Aufführender Aufführendes

\end{document}

In the replacement list, there is an extended syntax which allows to map the captured characters. For example, {2|ΐΰῒῢ|ίύὶὺ} means: if the second captured char is ΐ replace it with ί, ύ with ύ, and so on. This feature is particularly useful when a letter changes if there is a hyphen, and also when transliterating. Here is a partial example of the latter (the full example is here, with digraphs and trigraphs):

\babelprehyphenation{transrussian}
  {([ABVGDEËZIJKLMNOPRSTUFHÈY"abvgdeëzijklmnoprstufhèy'])}{
  string = {1|ABVGDEËZIJKLMNOPRSTUFHÈY"abvgdeëzijklmnoprstufhèy'%
             |АБВГДЕЁЗИЙКЛМНОПРСТУФХЭЫЬабвгдеёзийклмнопрстуфхэыь}
}

Short examples