Received: from webgate.proteosys.de (mail.proteosys-ag.com [62.225.9.49]) by lucy.proteosys (8.11.0/8.9.3/SuSE Linux 8.9.3-0.1) with ESMTP id f4RBCYf25260 for ; Sun, 27 May 2001 13:12:34 +0200 Received: by webgate.proteosys.de (8.11.0/8.11.0) with ESMTP id f4RBCT726978 . for ; Sun, 27 May 2001 13:12:29 +0200 Received: from mail.Uni-Mainz.DE (mailserver1.zdv.Uni-Mainz.DE [134.93.8.30]) by mailgate2.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f4RBCS018605 for ; Sun, 27 May 2001 13:12:29 +0200 (MET DST) MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----_=_NextPart_001_01C0E69D.EDC8FD00" Received: from mailgate2.zdv.Uni-Mainz.DE (mailgate2.zdv.Uni-Mainz.DE [134.93.8.57]) by mail.Uni-Mainz.DE (8.9.3/8.9.3) with ESMTP id NAA26160 for ; Sun, 27 May 2001 13:12:28 +0200 (MEST) Received: from mail.listserv.gmd.de (mail.listserv.gmd.de [192.88.97.5]) by mailgate2.zdv.Uni-Mainz.DE (8.11.0/8.10.2) with ESMTP id f4RBCS018601 for ; Sun, 27 May 2001 13:12:28 +0200 (MET DST) X-MimeOLE: Produced By Microsoft Exchange V6.5 Received: from mail.listserv.gmd.de (192.88.97.5) by mail.listserv.gmd.de (LSMTP for OpenVMS v1.1a) with SMTP id <8.DD448D32@mail.listserv.gmd.de>; Sun, 27 May 2001 13:10:28 +0200 Received: from RELAY.URZ.UNI-HEIDELBERG.DE by RELAY.URZ.UNI-HEIDELBERG.DE (LISTSERV-TCP/IP release 1.8b) with spool id 496980 for LATEX-L@RELAY.URZ.UNI-HEIDELBERG.DE; Sun, 27 May 2001 13:12:25 +0200 Received: from ix.urz.uni-heidelberg.de (mail.urz.uni-heidelberg.de [129.206.119.234]) by relay.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id NAA07570 for ; Sun, 27 May 2001 13:12:24 +0200 (MET DST) Received: from relay.uni-heidelberg.de (relay.uni-heidelberg.de [129.206.100.212]) by ix.urz.uni-heidelberg.de (8.8.8/8.8.8) with ESMTP id NAA94180 for ; Sun, 27 May 2001 13:12:23 +0200 Received: from naf1.mathematik.uni-tuebingen.de (naf1.mathematik.uni-tuebingen.de [134.2.161.197]) by relay.uni-heidelberg.de (8.10.2+Sun/8.10.2) with ESMTP id f4RBCO119523 for ; Sun, 27 May 2001 13:12:24 +0200 (MET DST) Received: from na13.mathematik.uni-tuebingen.de (na13 [134.2.161.180]) by naf1.mathematik.uni-tuebingen.de (8.9.3+Sun/8.9.3) with ESMTP id NAA21593 for ; Sun, 27 May 2001 13:12:23 +0200 (MET DST) Received: (from oliver@localhost) by na13.mathematik.uni-tuebingen.de (8.9.3+Sun/8.9.1) id NAA21453; Sun, 27 May 2001 13:12:21 +0200 (MET DST) In-Reply-To: <200105270954.f4R9sBI23611@smtp.wanadoo.es> References: <200105270954.f4R9sBI23611@smtp.wanadoo.es> Return-Path: X-Mailer: VM 6.88 under Emacs 20.7.2 X-Authentication-Warning: na13.mathematik.uni-tuebingen.de: oliver set sender to oliver@na13 using -f Content-class: urn:content-classes:message Subject: Re: \InputTranslation Date: Sun, 27 May 2001 12:12:21 +0100 Message-ID: <15120.57621.256542.391864@gargle.gargle.HOWL> X-MS-Has-Attach: X-MS-TNEF-Correlator: From: "Marcel Oliver" Sender: "Mailing list for the LaTeX3 project" To: "Multiple recipients of list LATEX-L" Reply-To: "Mailing list for the LaTeX3 project" Status: R X-Status: X-Keywords: X-UID: 4110 This is a multi-part message in MIME format. ------_=_NextPart_001_01C0E69D.EDC8FD00 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Javier Bezos writes: > If I say > > \begin{mandarin} > \newcommand{\foo}{} > \end{mandarin} > > how TeX knows that \foo\ was defined in a Mandarin context (including > perhaps input encoding information)? And what is expected by the = user, > that the Chinese char should be considered "conceptual" (thus = rendered > differently in Japanese and Mandarin) or that the Chinese char must = be > rendered with the simplified ideogram (ie, Mandarin vs. Japanese)? > What makes that different from, say, > \newcommand{\foo}{\unichar{}} > (without specifying the language)? Oh, looks like I fell into the eurocentric mind-trap that "character=3Dglyph"... So it looks like there are a couple of strategies: 1. Store the full language context with every character token sequence along the lines that Javier suggests. In other words, treat the language context as part of the input encoding. It would seem that if Frank's requirement for an ICR ("a single item must have a unique and well-defined meaning") is to be met, it would essentially mean that every character needs to be tagged for language context. 2. Treat input encoding completely separate from language context. Input encoding just determines how to get from an arbitrary encoding to the Unicode(-like) ICR. Thus, switches in the language context have to be tagged explicitly by the user. So the example would become \begin{utf8-encoding} \newcommand{\foo}{} \end{utf8-encoding} Now I have to say something like \mandarin{\foo} or \japanese{\foo}. Of course, putting the language switch into the definition of \foo would be legal, too. The main restriction of this approach is that we cannot (easily) do something like \begin{mandarin} \section{<...>} \end{mandarin} \begin{japanese} \section{<...> \end{japanese} and expect that the language context is properly preserved in the TOC. a) Is it reasonable and necessary at all for this example to work, i.e. that a TOC or index should mix languages "automatically"? b) If the "japanese" in the second example would be "english", one could simply "stack" language context globally. I.e., below the primary language we can have an arbitrary number of working languages which only determine features which languages higher in the hierarchy have not explicitly defined (such as rendering of glyphs in certain Unicode regions). So only in cases when there are conflicting choices (japanese vs. mandarin, for example) we need local mark-up: \section{\japanese{<...>}} 3. Extreme version of 2 (the only strategy that seems to be cleanly implementable on current Omega): We simply define the \InputTranslation to be fixed on a per-file basis. In other words, we acknowledge that it does not make any sense in terms of usability to mix input encodings, as such files simply cannot (and should not) be displayed cleanly in any editor. So preparing multiencoded text must proceed along the following options: a) Split text into several files. (Useful for blocks of original source which is not subject to frequent modification.) b) Use UTF-8 and rely on the editor for encoding translation during import. (For example, the Emacs command insert-file-contents can do coding translation; we should also expect that drag-and-drop protocols of various windowing systems will eventually be able to do this properly). c) For legacy source, the functionality of current inputenc could be provided independent of the particular ICR. --Marcel ------_=_NextPart_001_01C0E69D.EDC8FD00 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Re: \InputTranslation

Javier Bezos writes:
 > If I say
 >
 >   \begin{mandarin}
 >     = \newcommand{\foo}{<Unicode char corresponding to Chinese = ai>}
 >   \end{mandarin}
 >
 > how TeX knows that \foo\ was defined in a = Mandarin context (including
 > perhaps input encoding information)? And = what is expected by the user,
 > that the Chinese char should be considered = "conceptual" (thus rendered
 > differently in Japanese and Mandarin) or = that the Chinese char must be
 > rendered with the simplified ideogram (ie, = Mandarin vs. Japanese)?
 > What makes that different from, = say,
 >   = \newcommand{\foo}{\unichar{<Unicode code>}}
 > (without specifying the language)?

Oh, looks like I fell into the eurocentric mind-trap = that
"character=3Dglyph"...

So it looks like there are a couple of = strategies:

1. Store the full language context with every = character token sequence
   along the lines that Javier = suggests.  In other words, treat the
   language context as part of the input = encoding.  It would seem that
   if Frank's requirement for an ICR = ("a single item must have a
   unique and well-defined meaning") = is to be met, it would
   essentially mean that every character = needs to be tagged for
   language context.

2. Treat input encoding completely separate from = language context.
   Input encoding just determines how to = get from an arbitrary
   encoding to the Unicode(-like) = ICR.  Thus, switches in the language
   context have to be tagged explicitly by = the user.  So the example
   would become

     \begin{utf8-encoding}
       = \newcommand{\foo}{<Unicode char corresponding to Chinese = ai>}
     \end{utf8-encoding}
     Now I have to say something = like \mandarin{\foo} or
     \japanese{\foo}.  Of = course, putting the language switch into the
     definition of \foo would be = legal, too.

  The main restriction of this approach is that = we cannot (easily) do
  something like

     \begin{mandarin}
       = \section{<...>}
     \end{mandarin}
     \begin{japanese}
       = \section{<...>
     \end{japanese}

  and expect that the language context is = properly preserved in the
  TOC.

  a) Is it reasonable and necessary at all for = this example to work,
     i.e. that a TOC or index = should mix languages "automatically"?

  b) If the "japanese" in the second = example would be "english", one
     could simply = "stack" language context globally.  I.e., below = the
     primary language we can have = an arbitrary number of working
     languages which only = determine features which languages higher in
     the hierarchy have not = explicitly defined (such as rendering of
     glyphs in certain Unicode = regions).  So only in cases when there
     are conflicting choices = (japanese vs. mandarin, for example) we
     need local mark-up:

     = \section{\japanese{<...>}}

3. Extreme version of 2 (the only strategy that seems = to be cleanly
   implementable on current Omega):

   We simply define the \InputTranslation to = be fixed on a per-file
   basis.  In other words, we = acknowledge that it does not make any
   sense in terms of usability to mix input = encodings, as such files
   simply cannot (and should not) be = displayed cleanly in any editor.
   So preparing multiencoded text must = proceed along the following
   options:

   a) Split text into several files.  = (Useful for blocks of original
      source which is not = subject to frequent modification.)

   b) Use UTF-8 and rely on the editor for = encoding translation during
      import.  (For = example, the Emacs command insert-file-contents
      can do coding = translation; we should also expect that
      drag-and-drop = protocols of various windowing systems will
      eventually be able to = do this properly).

   c) For legacy source, the functionality = of current inputenc could
      be provided = independent of the particular ICR.

--Marcel

------_=_NextPart_001_01C0E69D.EDC8FD00--