Using UTF-8 locales in LFS

qing · 发表于 2003-11-14 13:38:46

[PHP] AUTHOR: Alexander E. Patrakov <semzx@newmail.ru>
DATE: 2003-11-06
LICENSE: Public Domain
SYNOPSIS: Using UTF-8 locales in LFS
DESCRIPTION:
This hint explains what should be changed in the LFS instructions curent at the
time of this writing in order to use locales such as ru_RU.UTF-8. Future
versions (if any) will also deal with BLFS.

PREREQUISITES: LFS 4.1 or later (because of bash-2.05b)

CHANGELOG:
2003-11-06: Initial submission

HINT:

1. Single-byte and double-byte encodings and UTF-8: what's wrong

Most Eropean languages have a relatively short alphabet (less than 40
characters). This makes it possible to create a represent the
characters of that alphabet (both upper-case and lower-case), English alphabet,
digits and punctuation with a single byte. The result is known as a single-byte
encoding. An example of such encoding is KOI8-R, commonly used in Russia. All
single-byte encodings are ASCII-compatible in the sense that characters
representable in ASCII are also representable in these encodings and have the
same code. They are also reverse-ASCII-compatible in the sense that every byte
with the value less than 0x7f represents the same character as it does in
ASCII. Current LFS and BLFS work well with such encodings.

This approach doesn't work with Asian languages such as Chinese, Japanese and
Korean (denoted together as CJK further in this hint). They have more than 256
different characters, because single characters represent syllables and even
words. So called double-byte encodings are used with these languages. They
represent English letters, digits and punctuation with single bytes equal to
ASCII representation of those characters. To represent native CJK characters,
two-byte sequences are used. Such encodings are called double-byte. An
example is GB2312, used in China. Since CJK characters are twice as wide as
English ones in monospaced font, the "on-screen" width of a string encoded with
such methods is directly proportional to the number of bytes in it (there is
one exception: any two-byte sequence starting with 0x8e byte in EUC-JP takes as
much space as an English letter). LFS and BLFS don't work well with Asian
languages and double-byte encodings because of two reasons:

1) It is impossible to display double-width characters on a Linux console (even
on a framebuffer console) without additional programs that are not in the book.
Installation of e.g. zhcon corrects this.

2) Some assumptions that work with single-byte encodings fail with double-byte
ones. First, some double-byte encodings are not reverse-ASCII-compatible: a
byte with value less than 0x7f can be either an ASCII-representable character
or a second byte of a two-byte sequence. Second, correctly finding the n-th
character in a string is a complex task because some characters occupy one
byte, and some characters are represented by two-byte sequences. Software that
makes bad assumptions needs to be either patched or not installed at all.

Today there is a need to encode multilingual texts. E.g., foreign clients of
companies don't want their names to be distorted up to unreconinzable state by
a chain of multiple transliterations. Since all single-byte and double-byte
encodings are capable of representing characters of at most two alphabets
(english + national), there is a need for a new character set to encode
multilingual texts. Such character set exists and it is named Unicode.

UTF-8 is a method of representing Unicode text with a stream of
8-bit bytes. The resulting stream is both ASCII-compatible
reverse-ASCII-compatible. A single character can occypy from 1 to 4 bytes. Many
current distributions of Linux configure locales using the UTF-8 character
encoding by default. This doesn't work with (B)LFS for the same reasons as with
double-byte encodings. However,

1) There is no framebuffer-based terminal that is capable of displaying the
full range of Unicode characters. Fortunately, it is not needed in most cases.
Linux console is capable of displaying Latin (including accented), Greek,
Arabian and Cyrillic characters together even without framebuffer. Also, xterm
compiled separately from XFree86 distribution works.

2) There is one more assumption that breaks with UTF-8. The relation of
on-screen width of a string to the number of bytes in it is very complex.
That's why e.g. Midnight Commander works with double-byte encodings, but
doesn't work with UTF-8.

2. Suggested changes to the LFS book

The following packages should be configured differently in Chapter 6:
- ncurses (optionally)
- vim
- man

Modified Ncurses installation instructions (optional):

The Ncurses has experimental support for wide characters. According to the
output of ./configure --help, it is activated by passing --enable-widec
argument to ./configure. I don't know what this support means: Vim works fine
even with non-wide-character version. The resulting libraries are
binary-incompatible with "normal" ncurses and therefore a letter "w" is appended
automatically to their names: libncursesw.so.5.3. For compatibility, we will
install two versions of ncurses:
First install the normal version:

patch -Np1 -i ../ncurses-5.3-etip-2.patch
patch -Np1 -i ../ncurses-5.3-vsscanf.patch
./configure --prefix=/usr --with-shared --without-debug
make
make install

This installs /usr/lib/libncurses.so.5.3. We will move it to /lib later.
Then (optionally) install a wide-character-enabled ersion on top of it:

make distclean
./configure --prefix=/usr --with-shared --without-debug --enable-widec
make
make install

This installs /usr/lib/libncursesw.so.5.3.

Move important libraries to /lib and correct permissions:

chmod 755 /usr/lib/*.5.3
chmod 644 /usr/lib/libncurses++*.a
mv /usr/lib/libncurses.so.5* /lib
mv /usr/lib/libncursesw.so.5* /lib
ln -sf ../../lib/libncurses.so.5 /usr/lib/libncurses.so
ln -sf libncurses.so /usr/lib/libcurses.so
ln -sf ../../lib/libncursesw.so.5 /usr/lib/libncursesw.so
ln -sf libncursesw.so /usr/lib/libcursesw.so

The installation of wide-character-enabled version of ncurses is considered
optional because AFAIK nothing in LFS and BLFS links now against it, although
I have not checked carefully. Debian provides the following packages linked
against libncursesw:

- centericq-utf8: A text-mode multi-protocol instant messenger client. The
utf-8 version may be buggy
- latrine: LaTrine is a curses-based LAnguage TRaINEr
- mutt-utf8: Text-based mailreader. The utf-8 version may be buggy
- dialog: Displays user-friendly dialog boxes from shell scripts
- screen: A terminal multiplexor with VT100/ANSI terminal emulation
- tin: A full-screen easy to use Usenet newsreader

Dialog has a ./configure option to link with -lncursesw. This results in a
dialog executable that has bugs when used with ru_RU.koi8r locale on Linux
console. I have not tested other packages. I don't use them.

Modified Vim instructions:

For Vim to work correctly in double-byte encodings and in UTF-8, the
--enable-multibye switch has to be added to the ./configure command line. Note
that it is not necessary in BLFS since --with-features= (more than normal)
implies this.

echo '#define SYS_VIMRC_FILE "/etc/vimrc"' >> src/feature.h
echo '#define SYS_GVIMRC_FILE "/etc/gvimrc"' >> src/feature.h
./configure --prefix=/usr --enable-multibyte
make
make install
ln -s vim /usr/bin/vi

Vim is able to edit files in arbitrary encodings if you use UTF-8-based locale.
E.g. to read the file price.txt that is known to be in CP1251 encoding, type:

:e ++enc=cp1251 price.txt

It will be automatically converted. To save the file in KOI8-R encoding under
the name price.koi, type:

:w ++enc=koi8-r price.koi

Vim is even able to automatically detect the character set of the file
being read under some conditions. This works because real texts in most
single-byte and double-byte encodings contain sequences of bytes that are not
valid in UTF-8.

This capability needs to be configured. To do so, create the file /etc/vimrc
with the following contents (replace koi8-r with the name of a single-byte or
double-byte encoding that is mostly often used in your country):

" Begin /etc/vimrc

set nocompatible
set bs=2
set fileencodings=ucs-bom,utf-8,koi8-r

" End /etc/vimrc

For more information, read /usr/share/vim/vim62/doc/mbyte.txt

Modified Man instructions:

Since Man internationalization does not work at all in UTF-8 locales (the
messages are still output in single-byte or double-byte encodings, appearing
as lines of unreadable squares on the screen) and because Russian messages are
improperly translated (and offensive!) we will disable NLS. This will not
prevent you from viewing manual pages in your native language. It just means
that messages like "What manual page do you want?" will remain untranslated.

patch -Np1 -i ../man-1.5m2-manpath.patch
patch -Np1 -i ../man-1.5m2-80cols.patch

We skipped the "-pager" patch because we will do the same manually and
differently. Creating a modified patch is not an option because it will be
different for each country.

DEFS="-DNONLS" ./configure -default -confdir=/etc +lang all
make
make install

After installation of man, search for the line in /etc/man.conf that starts
with "

AGER". Replace it with something like the following:

PAGER          /usr/bin/iconv -c -f koi8-r | /usr/bin/less -isR

(replace koi8-r with your 8-bit or double-byte encoding). Note that this change
does not hurt you if you later switch back to the usual encoding: iconv will
be a no-op.

3. Actually setting up UTF-8 based locale

Some UTF-8 locales (e.g. se_NO.UTF-8) are installed during the

make localedata/install-locales

step while installing glibc. But most of UTF-8 locales must be created
manually, e.g.:

localedef -c -i ru_RU -f UTF-8 ru_RU.UTF-8

The role of the -c switch is to continue the creation of the locale even though
warnings are issued. After the creation of the locale, it is needed to tell
applications to use it. All that is needed is to set some environment
variables. Add this to your /etc/profile:

export LC_ALL=ru_RU.UTF-8

Of course, you will have to replace ru_RU with something more appropriate. If
you are using X, you will also have to include that string in
/etc/X11/xinit/xinitrc.

Then, we will modify the /etc/rc.d/init.d/loadkeys script.

#!/bin/bash
# Begin $rc_base/init.d/loadkeys - Loadkeys Script

# Based on loadkeys script from LFS-3.1 and earlier.
# Rewritten by Gerard Beekmans  - gerard@linuxfromscratch.org
# Modified for UTF-8 locales by Alexander E. Patrakov - semzx@newmail.ru

source /etc/sysconfig/rc
source $rc_functions

echo "Loading keymap..."
kbd_mode -u &&
loadkeys ru 2>/dev/null &&
dumpkeys -ckoi8-r | loadkeys --unicode 2>/dev/null
evaluate_retval

echo "Setting screen font..."
setfont LatArCyrHeb-16
evaluate_retval
# End $rc_base/init.d/loadkeys

Some comments about this script.
1) The kbd package does not provide ready-to-use keymaps for UTF-8 locales,
except for Ukrainian one. First, we load the now-wrong ru keymap (it contains
numbers valid only in koi8-r character set), then we dump it replacing numbers
with human-readable descriptions of characters (e.g.
"cyrillic_small_letter_e"). The resulting keymap is usable in UTF-8 mode, so
we load it with loadkeys --unicode.
2) We don't switch the console output to UTF-8 here. We will do that in
/etc/issue (the idea is stolen from "redhat-style-logon" hint). This is
necessary because otherwise this switching will affect only the first console.
As an alternative, you can write a "for" loop here sending <ESC>%G to all
virtual consoles.

Let's create /etc/issue:

echo -e '\033[2J\033[f\033%GWelcome to Linux From Scratch\n' >/etc/issue

The meaning of the escape sequences:
<ESC>[2J = clear entire screen
<ESC>[f = move the cursor to the corner of the screen
<ESC>%G = put the console into UTF-8 mode

Set up screen font and keyboard now, if you don't want to reboot:

/etc/rc.d/init.d/loadkeys

Then kill all agetty processes for them to reread /etc/issue:

killall agetty

4. Conclusion

From your next login, you will use UTF-8 based locale, with all its benefits
and drawbacks.

5. Known bugs

The relevant package is denoted in brackets

- The BLFS modifications are not in the hint yet
- The Caps Lock key does not work on Linux console for national characters [kbd]
- Applications cannot display line drawing characters on Linux console,
  although the font contains them; xterm works [ncurses]
[/PHP]

http://www.linuxfromscratch.org/hints/downloads/files/utf-8.txt

晨想 · 发表于 2003-11-14 19:04:24

你的帖子，内容重复了一次。：)

谢谢。很好的转贴。

qing · 发表于 2003-11-14 19:54:06

[P H P][/P H P]--bug

晨想 · 发表于 2003-11-15 09:08:01

[PHP]

呼呼

[/PHP]

晨想 · 发表于 2003-11-15 09:08:44

没有问题啊，是你的帖子贴了2次相同的内容吧？：）

		自动登录	找回密码
密码			注册

Using UTF-8 locales in LFS

浏览过的版块