1. convert html/shtml/htm pages into txt. ----shtml2txt
Although I wrote this program, it's not that suitable to RMRB. To RMRB,I have rewrite a new one specially.
2. sentence segmentation. ----init_text
Init_text is a useful tool which which convert number into characther and delete useless symbols. Also, at the begining and end of the sentence, "" "" is needed. Because in HTK, these two symbols are the signal for one sentence.
I wrote this program last year for XWLB. But it have some bugs. First, when encounter long numbers which is very long like 12345556667778, my program will stop because it's too long. Second, I delete all the files which have is very small and have no useful information in them firstly. I will write another effectively initatial program...
-Katrina
No comments:
Post a Comment