Corpus & Corpus Analysis Tool
The aim was to build an electronic corpus for various language processing tasks. Initially it contains large amount of Sinhala electronic text from a wide range of sources in UNICODE format. This corpus, containing 10,000,000 words can be obtained for research purposes through a written request to LTRL. At a later stage it will be enhanced to be a balanced corpus.
Since the existing corpus analysis tools did not support Sinhala Unicode properly, the requirement for such a tool with full Sinhala support was apparent. The tool we deliver is a Java-base platform-independent solution that virtually supports any Unicode text corpus. This tool is also available with the corpus.
Since the data collected used different proprietary font encodings, a tool was developed to convert them into UNICODE. This tool too, is available under downloads for anyone who wishes to use it.
The corpus went through following steps in growing into current state.
The aim was to build an electronic corpus for various language processing tasks. Initially it contains large amount of Sinhala electronic text from a wide range of sources in UNICODE format. This corpus, containing 10,000,000 words can be obtained for research purposes through a written request to LTRL. At a later stage it will be enhanced to be a balanced corpus.
Since the existing corpus analysis tools did not support Sinhala Unicode properly, the requirement for such a tool with full Sinhala support was apparent. The tool we deliver is a Java-base platform-independent solution that virtually supports any Unicode text corpus. This tool is also available with the corpus.
Since the data collected used different proprietary font encodings, a tool was developed to convert them into UNICODE. This tool too, is available under downloads for anyone who wishes to use it.
The corpus went through following steps in growing into current state.
- Collating government documents
- Negotiating and collecting publisher content
- Collecting archived web content
- Computerizing non-electronic content from above sources in UNICODE
- Identifying frequent "font encodings" for non-UNICODE content
- Building mappings for converting these to UNICODE
- Converting all electronic content to UNICODE
- Compiling the corpus
Last edited:
