Training data:
TXT HTML
RMRB 406M 7.56G
XWLB 4.82M 257M
XW30 5.77M 352M
Since 20080612, they change the pages in UTF-8 format, and my program is not suit with it. I will rewrite the u8-ansi and firstly I decide to trainning my language model.
-Katrina
No comments:
Post a Comment