Tuesday, March 3, 2009

Processes to initate training data

After download html, shtml or htm pages, I have to pick out the content of the pages.  After that, I  initate the tarining data because HTK have certain input format.
1. convert html/shtml/htm pages into txt.  ----shtml2txt
    Although I wrote this program, it's not that suitable to RMRB. To RMRB,I have rewrite a new one specially.
2. sentence segmentation. ----init_text
    Init_text is a useful tool which which convert number into characther and delete useless symbols. Also, at the begining and end of the sentence, "" "" is needed. Because in HTK, these two symbols are the signal for one sentence.
    I wrote this program last year for XWLB. But it have some bugs. First, when encounter long numbers which is very long like 12345556667778, my program will stop because it's too long. Second, I delete all the files which have is very small and have no useful information in them firstly. I will write another effectively initatial program...

-Katrina

No comments:

Post a Comment