NKF.python2 enables us to convert Japanese text encoding without specifying input text coding. This article explains how to install and use NKF.python2.
First, install NKF.python2 on your linux system:
# download nkf using git
$ git clone git://git.sourceforge.jp/gitroot/nkf/nkf.git
Cloning into 'nkf'...
remote: Counting objects: 1378, done.
remote: Compressing objects: 100% (432/432), done.
remote: Total 1378 (delta 945), reused 1378 (delta 945)
Receiving objects: 100% (1378/1378), 503.08 KiB, done.
Resolving deltas: 100% (945/945), done.
$ cd nkf
$ sudo make install #if you want
$ cd NKF.python2
$ python setup.py build
$ python setup.py install
Second, check NKF.python2 is installed correctly:
$ python
Python 2.7.3 (default, Apr 20 2012, 22:39:59)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import nkf
>>>
Finally, test NKF.python2 (path_to_file points japanese enceded text file):
with open(path_to_file, 'r') as F:
data = [nkf("-w -Lu -d", row) for row in F]
Above code converts encoding of each row to utf-8 *without* specifying types of encoding of input file. This feature is very useful if you do not know which encode is used for input file or you need to read several input files which have different encoding each simultaneously.