I have also made available a Java version of the segmenter that works with Big5, GB, and UTF-8 encoded text files.
Usage: java -jar segmenter.jar [-b|-g|-8] inputfile.txt
-b Big5, -g GB2312, -8 UTF-8
Segmented text will be saved to inputfile.txt.seg
Tags: Chinese, Java, Perl, Word Segmentation
Posted in NLP, Resources |