Language translation

nelik · Feb 27, 2010

Corpus & Corpus Analysis Tool

The aim was to build an electronic corpus for various language processing tasks. Initially it contains large amount of Sinhala electronic text from a wide range of sources in UNICODE format. This corpus, containing 10,000,000 words can be obtained for research purposes through a written request to LTRL. At a later stage it will be enhanced to be a balanced corpus.

Since the existing corpus analysis tools did not support Sinhala Unicode properly, the requirement for such a tool with full Sinhala support was apparent. The tool we deliver is a Java-base platform-independent solution that virtually supports any Unicode text corpus. This tool is also available with the corpus.

Since the data collected used different proprietary font encodings, a tool was developed to convert them into UNICODE. This tool too, is available under downloads for anyone who wishes to use it.
The corpus went through following steps in growing into current state.

Collating government documents
Negotiating and collecting publisher content
Collecting archived web content
Computerizing non-electronic content from above sources in UNICODE
Identifying frequent "font encodings" for non-UNICODE content
Building mappings for converting these to UNICODE
Converting all electronic content to UNICODE
Compiling the corpus

nelik · Feb 27, 2010

Lexicon

The lexicon contains a list of more than 25,000 Sinhala words together with some grammatical features. The features identified currently are, the part-of-speech, number and gender, but may be extended as the requirements arise. In addition, this lexicon contains English & Tamil translations for corresponding Sinhala words providing a resource for language translation work. This resource, too, is available for download.
The process of creating the lexical resource included

Collecting dictionary data in printed formats
Computerizing these content in UNICODE
Collecting dictionary data in electronic formats
Converting all electronic content to UNICODE
Extracting information by parsing data
Correcting errors and typos
Compiling the lexicon
Building interface applications to the lexicon

nelik · Feb 27, 2010

Text To Speech (TTS) System

While there were some experimental TTS systems by the UCSC for Sinhala are already under work, the aim of this project was to produce one that is of commercial quality. To this end, considerable effort was be spent on quality aspects of this activity. Apart from identifying the phonetic alphabet of the language, recording relevant word sentences in the database and building a text analysis component, the project also produced a synthesizing engine that facilitates natural sounding Sinhala voice. This application is available for download.
The basic methodology adopted is based on the diphone concatenation approach to TTS and included following components and procedures in developing them.

Text analysis component:
1. Studying types of non-textual content and how to convert them to text
2. Defining the text analysis interface
3. Building the text analysis component
Phonetic component:
1. Studying the phonology and phonetics of Sinhala
2. Identifying the phonetic vocabulary
3. Constructing word sentences for recording most common diphones
4. Defining phonetic processor components
5. Building the diphone database
6. Building the phonetic processor
Integrating all components and producing the TTS system.

nelik · Feb 27, 2010

Optical Character Recognition System

Previous works at the UCSC concerning OCRs had concentrated on developing a technique best suited for detecting printed Sinhala characters. This component of work focused on converting that research into a real product by making it robust for variations in font size, particularly those commonly used by the majority of the people including newspaper prints and government publications. Later it will be developed into a font-independent OCR software. You can download this software from the download section.
The methodology used for construction of the OCR consisted of following steps

Preprocessing activities:
1. Scanning documents and skew detection
2. Noise detection and removal
3. Extraction of text characteristics and individual characters
Data collection:
1. Identification of representative texts
2. Separation of training, validation and testing sets
Processing:
1. Feature extraction and pattern matching
2. Testing of competing algorithms
3. Optimization of algorithms
4. Application development

boxman · Feb 27, 2010

thank u machan !!

nelik · Feb 27, 2010

boxman said:
thank u machan !!

ela ela machanzzz

K_ZONE · Mar 28, 2010

එල මචෝ, ස්තුතියි.

Search

Latest ads

Language translation

nelik

Well-known member

nelik

Well-known member

nelik

Well-known member

nelik

Well-known member

boxman

Member

nelik

Well-known member

K_ZONE

Well-known member

Similar threads