Reply to thread

Message: <blockquote data-quote="nelik" data-source="post: 6836546" data-attributes="member: 60402">Corpus & Corpus Analysis Tool  The aim was to build an electronic corpus for various language processing tasks. Initially it contains large amount of Sinhala electronic text from a wide range of sources in UNICODE format. This corpus, containing 10,000,000 words can be obtained for research purposes through a written request to LTRL. At a later stage it will be enhanced to be a balanced corpus.  Since the existing corpus analysis tools did not support Sinhala Unicode properly, the requirement for such a tool with full Sinhala support was apparent. The tool we deliver is a Java-base platform-independent solution that virtually supports any Unicode text corpus. This tool is also available with the corpus.   Since the data collected used different proprietary font encodings, a tool was developed to convert them into UNICODE. This tool too, is available under <a href="http://www.ucsc.cmb.ac.lk/ltrl/?page=downloads&lang=en&style=default" target="_blank">downloads</a> for anyone who wishes to use it.                    The corpus went through following steps in growing into current state.<ol> <li data-xf-list-type="ol">Collating government documents</li> <li data-xf-list-type="ol">Negotiating and collecting publisher content</li> <li data-xf-list-type="ol">Collecting archived web content</li> <li data-xf-list-type="ol">Computerizing non-electronic content from above sources in UNICODE</li> <li data-xf-list-type="ol">Identifying frequent "font encodings" for non-UNICODE content</li> <li data-xf-list-type="ol">Building mappings for converting these to UNICODE</li> <li data-xf-list-type="ol">Converting all electronic content to UNICODE</li> <li data-xf-list-type="ol">Compiling the corpus</li> </ol>                  <a href="http://www.ucsc.cmb.ac.lk/ltrl/images/panl10n_p1_ctool_full.jpg" target="_blank"><img src="http://www.ucsc.cmb.ac.lk/ltrl/images/panl10n_p1_ctool_sample.jpg" alt="" class="fr-fic fr-dii fr-draggable " style="" /></a></blockquote>

Verification: Dahaya deken beduwama keeyada?

Top Bottom