Search
Search titles only
By:
Search titles only
By:
Log in
Register
Search
Search titles only
By:
Search titles only
By:
Menu
Install the app
Install
Forums
New posts
All threads
Latest threads
New posts
Trending threads
Trending
Search forums
What's new
New posts
New ads
New profile posts
Latest activity
Free Ads
Latest reviews
Search ads
Members
Current visitors
New profile posts
Search profile posts
Contact us
Latest ads
Power Lifting Lever Belt
SkullVamp
Updated:
Jun 13, 2026
Ad icon
port.lk Domain for sale
Lankan-Tech
Updated:
Jun 13, 2026
Colombo
Kaduwela - Two Storey House for Sale
dilrasan
Updated:
Jun 11, 2026
Ad icon
Wechat qr verification
Pawan2005
Updated:
Jun 11, 2026
🚀 GOOGLE AI PRO 18 MONTHS ACTIVATION 🚀
sayuru bandara
Updated:
Jun 10, 2026
Electronics
Vehicles
Property
Search
Reply to thread
Forums
Computers & Internet
Downloads
Language translation
Get the App
JavaScript is disabled. For a better experience, please enable JavaScript in your browser before proceeding.
You are using an out of date browser. It may not display this or other websites correctly.
You should upgrade or use an
alternative browser
.
Message
<blockquote data-quote="nelik" data-source="post: 6836546" data-attributes="member: 60402"><p>Corpus & Corpus Analysis Tool</p><p> </p><p>The aim was to build an electronic corpus for various language processing tasks. Initially it contains large amount of Sinhala electronic text from a wide range of sources in UNICODE format. This corpus, containing 10,000,000 words can be obtained for research purposes through a written request to LTRL. At a later stage it will be enhanced to be a balanced corpus.</p><p> </p><p>Since the existing corpus analysis tools did not support Sinhala Unicode properly, the requirement for such a tool with full Sinhala support was apparent. The tool we deliver is a Java-base platform-independent solution that virtually supports any Unicode text corpus. This tool is also available with the corpus. </p><p> </p><p>Since the data collected used different proprietary font encodings, a tool was developed to convert them into UNICODE. This tool too, is available under <a href="http://www.ucsc.cmb.ac.lk/ltrl/?page=downloads&lang=en&style=default" target="_blank">downloads</a> for anyone who wishes to use it. </p><p> The corpus went through following steps in growing into current state.</p><p></p><ol> <li data-xf-list-type="ol">Collating government documents</li> <li data-xf-list-type="ol">Negotiating and collecting publisher content</li> <li data-xf-list-type="ol">Collecting archived web content</li> <li data-xf-list-type="ol">Computerizing non-electronic content from above sources in UNICODE</li> <li data-xf-list-type="ol">Identifying frequent "font encodings" for non-UNICODE content</li> <li data-xf-list-type="ol">Building mappings for converting these to UNICODE</li> <li data-xf-list-type="ol">Converting all electronic content to UNICODE</li> <li data-xf-list-type="ol">Compiling the corpus</li> </ol><p> <a href="http://www.ucsc.cmb.ac.lk/ltrl/images/panl10n_p1_ctool_full.jpg" target="_blank"><img src="http://www.ucsc.cmb.ac.lk/ltrl/images/panl10n_p1_ctool_sample.jpg" alt="" class="fr-fic fr-dii fr-draggable " style="" /></a></p></blockquote><p></p>
[QUOTE="nelik, post: 6836546, member: 60402"] Corpus & Corpus Analysis Tool The aim was to build an electronic corpus for various language processing tasks. Initially it contains large amount of Sinhala electronic text from a wide range of sources in UNICODE format. This corpus, containing 10,000,000 words can be obtained for research purposes through a written request to LTRL. At a later stage it will be enhanced to be a balanced corpus. Since the existing corpus analysis tools did not support Sinhala Unicode properly, the requirement for such a tool with full Sinhala support was apparent. The tool we deliver is a Java-base platform-independent solution that virtually supports any Unicode text corpus. This tool is also available with the corpus. Since the data collected used different proprietary font encodings, a tool was developed to convert them into UNICODE. This tool too, is available under [URL="http://www.ucsc.cmb.ac.lk/ltrl/?page=downloads&lang=en&style=default"]downloads[/URL] for anyone who wishes to use it. The corpus went through following steps in growing into current state. [LIST=1] [*]Collating government documents [*]Negotiating and collecting publisher content [*]Collecting archived web content [*]Computerizing non-electronic content from above sources in UNICODE [*]Identifying frequent "font encodings" for non-UNICODE content [*]Building mappings for converting these to UNICODE [*]Converting all electronic content to UNICODE [*]Compiling the corpus [/LIST] [URL="http://www.ucsc.cmb.ac.lk/ltrl/images/panl10n_p1_ctool_full.jpg"][IMG]http://www.ucsc.cmb.ac.lk/ltrl/images/panl10n_p1_ctool_sample.jpg[/IMG][/URL] [/QUOTE]
Insert quotes…
Verification
Dahaya deken beduwama keeyada?
Post reply
Top
Bottom