Welcome to the HFU AI Lab. Weblog

This is where you can put an optional introduction to your blog. This intro will appear only on the home page of the blog.

The introduction can be edited in the file called intro.php. If you don't want an introduction, then just delete the contents of intro.php.

受保護的文章:中文斷詞程式

六月 18th, 2010

本文受密碼保護,須填寫您的密碼才能閱讀。


  • Share/Bookmark

Chinese Segmenter and Annotation Tool (Perl and Java)

六月 17th, 2010

URL: http://www.mandarintools.com/segmenter.html

You can download the zip file which contains four files. First is the perl script segment.pl which takes one argument, the name of the source file to segment. It expects the file name to end with 『.txt』. It needs the library file segmenter.pl which has all the actual segmenation code. The program also expects to find the lexicon file wordlist.txt in the same directory it’s running in (though this is easily modified). It outputs a new segmented file with 『.txt』 replaced with 『.seg』. Right now it only works on GB encoded files, but a Big5 version (converting to GB, segmenting, and using the segmented file to segment the original Big5 version file) would not be hard. Also included is a convenience file, segment.bat, for people working in Windows. It runs perl on segment.pl and expects a file name as an argument.

The segmenter requires Perl to run. It is a free and easily downloaded program.

I have also made available a Java version of the segmenter that works with Big5, GB, and UTF-8 encoded text files.

Usage: java -jar segmenter.jar [-b|-g|-8] inputfile.txt
-b Big5, -g GB2312, -8 UTF-8
Segmented text will be saved to inputfile.txt.seg

Words can be added or deleted directly from the lexicon file. The segmenter has algorithms for grouping together the characters in a name, especially for Chinese and Western names, but Japanese and South-east Asian names may not work well yet.

The segmentation process is also a perfect time to identify interesting 『entities』 in the text. These could include dates, times, person names, locations, money amounts, organization names, and percentages. This collection of interesting nouns is often refered to as 『named entities』 and the process of identifying them as 『named entity extraction』. There is already code to identify person names and number amounts in the segmenter, and I will adding more code to find the rest in the future.

The segmenter works with a version of the maximal matching algorithm. When looking for words, it attempts to match the longest word possible. This simple algorithm is suprisingly effective, given a large and diverse lexicon, but there also need to be ways of dealing with ambiguous word divisions, unkown proper names, and other words not in the lexicon. I currently have algorithms for finding names, and am researching ways to better handle ambiguous word boundaries and unknown words. Additional knowledge that would be useful would be a list of characters and whether they are bound or unbound. A segmentation that would leave a bound character by itself would not be allowed. A statistical way of choosing amongst ambiguous segmentations would also be useful.

More information on segmenting Chinese text can be found at ChineseComputing.com.

Contact Erik Peterson at this contact page with questions or comments. Please visit Online Chinese Tools for many more useful Chinese-related software tools.

  • Share/Bookmark

Techno Trade CGI Archive (PERL cgi scripts)

六月 17th, 2010

Techno Trade CGI Archive

URL: http://www.technotrade.com/cgi/

Following is a list of some PERL cgi scripts that we’ve created. Feel free to browse through them and see some examples of the scripts in action. Some are FREE, others require a license fee. For more information on installing these scripts to your server, please check the FAQ page.

This script is used to manage multiple usernames/passwords for .htaccess/.htpasswd directory protection. This works with the apache web server (tested on Unix systems) and can be used to handle multiple password protected directories.
URL
Search Engine
Feel like starting your own little yahoo ? Well, this is basically what this script does. It searches a Text database file for one or more keywords input by the user and displays all the URL’s that match the Search Query. This can also be changed to search any type of text database file.
Password
Protector
This script is a very simple form of web site password protection. Instead of directing people to your main html file, you use this script as a filter to verify users for the password to gain access. If the password is accepted, then the script automatically takes the browser to the 『protected』 web page.
Search Engine
Redirector
Lets your visitors search the web from your site by selecting the search engine they want to use. Also can log what they’re searching for on each engine.
Web Based
Message Board
The Message Board script lets users who come to your site join in a discussion group by reading and posting messages. Users get a very friendly graphical interface with icons for navigation, posting etc. and is very simple to use. This board doesn’t make a mess on your server by creating tons of directories and files each time a posting is made. Instead it just uses one text file for all the postings in a discussion group.
URL Jumper Do you hate having to clog up or redesign your page each time you add a new URL link to another site or another one of your web pages ? The URL Jumper fixes that problem by organizing your links in a selection box where the user would just click on one and hit Go.
Looking for more scripts ?
Be sure to visit The CGI Resource Index.

htaccess

  • Share/Bookmark

UCI Machine Learning Repository

五月 19th, 2010

UCI Machine Learning Repository

Welcome to the UC Irvine Machine Learning Repository!

We currently maintain 189 data sets as a service to the machine learning community. You may view all data sets through our searchable interface. Our old web site is still available, for those who prefer the old format. For a general overview of the Repository, please visit our About page. For information about citing data sets in publications, please read our citation policy. If you wish to donate a data set, please consult our donation policy. For any other questions, feel free to contact the Repository librarians. We have also set up a mirror site for the Repository

  • Share/Bookmark

Data Mining Tools See5 and C5.0

五月 19th, 2010

Data Mining Tools See5 and C5.0

  • Share/Bookmark

Weka 3: Data Mining Software in Java

五月 19th, 2010
Weka
  • http://www.cs.waikato.ac.nz/ml/weka/
  • http://sourceforge.net/projects/weka/
DataSet
  • http://www.cs.waikato.ac.nz/~ml/weka/index_datasets.html
  • http://www.public.asu.edu/~sji03/resources/index.html
  • http://datam.i2r.a-star.edu.sg/datasets/krbd/
Repository for Epitope Datasets (RED)
  • http://ailab.cs.iastate.edu/red/
Tutorial
  • Share/Bookmark

消基會檢測沒問題的6種洗衣清潔劑

一月 3rd, 2010

消基會在2008年5月16日發布了一項洗衣清潔劑的調查報告,這項報告提供不少訊息! 這次消基會總共檢測了20種洗衣清潔劑,檢測報告歸納如下:

1. 檢測無問題的洗衣清潔劑: (可以安心購買)

  •  一匙靈亮彩洗衣精 (130元)
  • 藍寶低泡沬濃縮洗衣精 (109元)
  • 白蘭無磷超濃縮洗衣精 (119元)
  • 妙管家濃縮洗衣乳 (109元)
  • 白鴿防蟎天然濃縮洗衣精 (159元)
  • 台麗洗衣粉 (34元)

cf: http://www.sharecool.org/archives/410

  • Share/Bookmark

會議資料

一月 3rd, 2010

會議資料

  • 標題: 2010′0102 論文標題
  • 上傳資料: 論文電子檔(doc/pdf), 投影片(ppt), …
  • 網址連結: URL
  • Share/Bookmark

Hello world!

一月 3rd, 2010

Welcome to WordPress. This is your first post. Edit or delete it, then start blogging!

  • Share/Bookmark
Better Tag Cloud