A Systematic Literature Review on Spell Checkers for Bangla Language

Prianka Mandal; B M Mainul Hossain

Scientific articles \ Prolegomena. Fundamentals of knowledge and culture. Propaedeutics \ Computer science and technology. Computing. Data processing \ Software

A Systematic Literature Review on Spell Checkers for Bangla Language

Author: Prianka Mandal, B M Mainul Hossain

Journal: International Journal of Modern Education and Computer Science (IJMECS) @ijmecs

Article in issue: 6 vol.9, 2017.

Free access

Spell checkers check whether a word is misspelled and provide suggestions to correct it. Detection and correction of spelling errors in Bangla language which is the seventh most spoken native language in the world, is very onerous because of the complex rules of Bangla spelling. There is no systematic literature review on this research topic. In this paper, we present a systematic literature review on checking and correcting spelling errors in Bangla language. We investigate the current methods used for spell checking and find out what challenges are addressed by those methods. We also report the limitations of those methods. Recent relevant studies are selected based on a set of significant criteria. Our results indicate that there are research gaps in this research topic and has a potential for further investigation.

Systematic Literature Review, Spelling Errors, Spell Detecting, Spell Checking, Spell Checker, Bangla Language, Misspelled Word

Short address: https://sciup.org/15014978

IDR: 15014978

Text of the scientific article A Systematic Literature Review on Spell Checkers for Bangla Language

Published Online June 2017 in MECS

A Systematic Literature Review (SLR) is a process which identifies, evaluates and interprets all obtainable research relevant to a particular research area of interest. SLR can be exercised to summarize the existing evidence of a particular research topic, to identify any research gaps, to provide suggestions for further investigation and to provide a cooperation for generating new hypotheses. However, SLR requires more effort than traditional literature reviews [1]. SLR aims to detect as much as possible relevant information of a particular research domain. Conducting a systematic literature review on a particular topic is very supportive and beneficial.

Misspelled word is a word in a text which is not a valid word of a language and typically not found in a dictionary of the corresponding language. If it is in the corresponding dictionary then it is determined to be correctly spelled. Spell checking is a process of detecting spelling errors and provides most probable proper words to correct them. Spell checkers help users to improve their writing skill by reducing spelling errors. Detecting misspelled words and correcting those misspelled words automatically is a great research challenge. There are many well-established spell checkers for English and other western languages, but there is no well-established spell checker for Bangla. There are few research works on Bangla spell checking. Therefore, conducting a systematic literature review on checking and correcting spelling errors in Bangla language makes this topic more beneficial and would be very helpful for interested researchers to work on this engrossing research topic.

In this paper, we present our findings from a systematic literature review on checking and correcting spelling errors in Bangla language. Our approach looks into current methods on which Bangla spell checkers are developed, challenges which are addressed and limitations of existing works. By conducting an SLR on this topic, we make it easier for interested researchers to determine the present state of research on this topic. Our results make it possible for interested researchers to develop Bangla spell checker based on best knowledge and practice across many previous studies.

The remainder of the paper is organized as follows: Section II discusses the background of this study. Section III presents the methodology of the work. Section IV presents the validity of our review. Section V presents results of this SLR. Section VI provides a discussion about our findings. Finally, we summarize our conclusions in Section VII.

II. Background Study

In this section, we discuss about concepts that are relevant to this study. We provide an overview of Bangla language, since the main focus of this study is on spell checking task specifically for Bangla language. Different types of spelling errors along with spell checking techniques are also discussed here.

A. Bangla Language

Bangla or Bengali, a member of the Indo-Aryan languages, is the state language of Bangladesh and the second most spoken language in India. Over two thousand ten million people speak in Bangla, the majority of whom live in Bangladesh and in the Indian state of West Bengal. Bangla is the seventh most spoken native language in the world. Even though it is easy to use Bangla verbally, due to its complex script nature, it is rather difficult when it comes to writing properly.

Bangla language has 49 letters in its alphabet and 10 digits in decimal number system [32]. Bangla alphabet comprises of 11 vowels and 39 consonant characters. Bangla alphabet has no concept of upper/lower case. Here, we discuss some challenges which are required to be addressed because of the complex rules of Bangla:

1. Phonetically similar characters: There are some characters in Bangla which are phonetically similar. Example: 4 (n) and ^ (N), ^ (sh^^Sh) and ^ (s).
2. Consonant clusters or Juktakkhors : Consonant cluster consists of up to four consonants which are not separated by vowels. Example: ^ , ^ .
3. Use of Phalas : There are different types of phala such as YA-phala, RA-phala and LA-phala .
4. Use of Matra : Matra is a headline of many Bangla characters. Example: ^^ ^^
5. Conjuncts with unusual pronunciations: Example: ^ = ^ + о + 4 . W pronounced as ^4
6. Different pronunciations on different context. Example: ^ = ^ + о + 4 . ^^1 pronounced as W W pronounced as ^^^ .
7. Multiple pronunciations of some letters in the same context.
8. Use of vowel diacritics: Every vowel has its diacritic. These vowel diacritics are used with consonants. These vowel diacritics are oT , fo , ot , о , о , о , (о , ^o , CoT and Cot .
9. Use of modifier symbols. There are some modifier symbols in Bangla such as ^ . o° , o; and o .

Example: ^HJ , 3'^'4'.

B. Types of Spelling Errors

Kukich [2] classified spelling errors into two types: non-word error and real-word error. Non-word error is word level error that occurs when a word is not a valid word. Example: “ ^^ ” (vol) is a non-word error, because it is not a valid word. Real-word error is sentence level error that occurs when a word is a valid word but it is inappropriate in the context of that sentence. Example: “ ^^Ш ^^T C^X ” (amar asa nei). In this sentence, the word “ ^^T ” (asa) is a valid word but it is inappropriate in the context of this sentence.

Kukich [2] also provided an alternative classification of spelling errors and divided them into two types, Cognitive error and Typographical error. Cognitive error occurs when user forgets the correct spelling during typing or does not know the correct spelling. Example: typing “ f^W (ridoy) instead of “^W (hridoy). Typographical error occurs when user makes mistakes during typing. Example: typing “ ^R^T ” (vumika) instead of “^>R^P’ (vumika).

There are many types of typographical spelling errors that can occur, such as insertion error, deletion error, substitution error and transposition error. Insertion error occurs when a user types an extra character in a word. For example, the word “^1з<ИИ ” (poribbar) contains an extra character “^” . The correct word is ''^ЬТ (poribar). Deletion error occurs when a user forgets to type a character in a word. For example, the word “^T^T^^” (sadhanata) is misspelled and the character “^ ” is missing. The correct word is “^T^T^^^” (sadharanata). Substitution error occurs when a user types wrong character in any position of a word. For example, the word “ЗГ^Аф” (prathonik) is misspelled. The word would be correct if replace the character “4” by “^” and the correct word is “ЗГ^Аф” (prathomik). Transposition error occurs when user types a word in which characters exchange their place. For example, the word “^^^ ” (pokol) is misspelled. The correct word “^^^” (polok) can be obtained if characters “^” and “^” interchange their place.

Most of the misspellings occur because of

• phonetic similarity of Bangla characters,
• the difference between the grapheme representation and phonetic utterances, and
• lack of proper knowledge of spelling rules [3].

C. Spell Checker

A Spell checker is an application that is used to detect misspelled word and correct spelling error. The main tasks of a spell checker are

1. Check whether a word is correct or misspelled,
2. Generate candidate corrections if the word is misspelled and
3. Provide the most likely candidate corrections as suggestions to the user.

A spell checker may a stand-alone application which takes texts from users and provides suggestions if there are any misspelled word in that text. Spell checker can be implemented as a part of a large application such as email client, text editor and word processor.

There are various algorithms which are used when implementing spell checkers that are accompanied with word suggestions. One approach is to encode all words into its corresponding phonological code and then check spelling errors and generating suggestions. This phonetic similarity is generally measured by different encoding algorithms such as Soundex [4], Metaphone [5], Double Metaphone [6] and PHONIX [7]. Soundex is a phonetic algorithm that is used to group phonetically similar letters together and assign each group a numerical number. Soundex works on a letter-by-letter basis and cannot handle context-sensitive rules. Metaphone is another phontetic algorithm that is more accurate than Soundex because it considers the context-sensitive rules of English pronunciation. Double Metaphone is a new version of the phontetic algorithm that ables to handle the problem of Metaphone and produces more accurate results than the

Metaphone algorithm. PHONIX is an improved version of Soundex encoding. These algorithm are languagespecific and typically designed for English language.

Structural similarity can be used to detect and correct misspelled words. Edit distance [8] is used to estimate structural similarity between misspelled word and candidate corrections. Edit distance measures the minimum number of total operations required to transform one string into the other. Three different operations are applied when measuring edit distance: insert a new character into one of the strings, delete an existing character, and replace one character by another character. However, it is highly inefficient to evaluate the entire dictionary repeatedly.

Stemming is a process of splitting a word into stem and its affix. Stemming algorithm is used to improve the performance and effectiveness of spell checkers. Stemming can reduce dictionary size which is utilized as a part of different natural language processing applications, particularly for highly inflected languages. However, it is easy to extract root words by applying stemming algorithm for language like English [31]. The design of stemmers is language specific and requires some to significant linguistic expertise in the language, as well as the understanding of the needs for a spelling checker for that language [9]. The first published stemming algorithm is Lovins stemming algorithm [10]. Porter’s algorithm [11] is the most common algorithm for stemming English. Porter’s stemming algorithm is used for reducing derived words to their stems [34]. Porter stemmer applies a set of rules to iteratively remove suffixes from a word until none of the rules apply and the Lovins stemmer has a larger set of suffixes and does not apply its rules iteratively.

N-gram model is a statistical prediction technique that is also used to checking the correctness of a word. The idea of using n-grams in language processing was discussed first by Shannon [12]. An n-gram model is a type of probabilistic language model for predicting the next item in such a sequence in the form of a (n - 1) order Markov model. An n-gram of size one is referred to as a unigram, size two is a bigram, and size three is a trigram. Larger sizes are sometimes referred to by the value of n, such as four-gram, five-gram, and so on. One main advantage of the n-gram method is that it is language independent [13].

III. Methodology

In pursuance of systematic review guidelines [1], our systematic literature review was conducted and is comprised of few steps. Details of every steps is described in this section.

A. Identify the Need for a Systematic Literature Review

A lot of significant research works have been done in checking and correcting spelling errors for English language. Research works also have been done more or less for some other languages. However, some research works have been conducted on checking and correcting spelling errors in Bangla language.

Systematic literature review on checking and correcting spelling errors in Bangla language is necessary for those researchers who have worked on this research topic or are interested to work on this topic. However, there are no such papers based on systematic literature review on checking and correcting spelling errors in Bangla language to the best of our knowledge. Our motivation for this work is to take a preview of checking and correcting Bangla spelling errors.

In this paper, a systematic literature review on checking and correcting spelling errors in Bangla language is presented. Researchers can be come to know research works which have already been done on this topic, what are the limitation of those research works and what key challenges are addressed.

B. Research Questions

Identifying research questions is an important step in systematic literature review. Three research questions were considered when conducting this study. The research questions and their motivations are presented in Table 1.

Table 1. Research Questions and their Motivations

Research Question		Motivation
RQ1	What are the current methods used to develop Bangla spell checker?	To identify existing methods and algorithms to develop Bangla spell checker
RQ2	What key challenges are being focused when developing a Bangla spell checker?	To identify most of the challenges that captured researchers’ attention when developing a Bangla spell checker
RQ3	What are the limitations of existing research?	To identify the limitations of existing research

C. Search for Studies

We used the following search string in our searches:

- check* AND ((spell* AND error*) OR (misspelled AND word*)) AND ((generate* AND suggestion*) OR correct*) AND (Bangla OR Bengali)

- (develop* OR implement*) AND (Bangla OR Bengali) AND ((spell* AND (check*) OR (misspelled AND word*))

D. Study Selection Criteria

The study selection criteria is based on the research questions. We included papers which focused on checking and correcting spelling errors in Bangla language and paper must be published as either a Journal paper or Conference proceedings.

There are few research works on Bangla spell checker. Therefore, we included most of the paper which answered our research questions. We only excluded those which have repeated works. In that case, we only included the most recent ones.

E. Study Selection Process

A manual search process was applied for searching documents which provided answers of our research questions. Initially, we used following sources for our search process.

i. IEEExplore
ii. ACM Digital Library
iii. Google Scholar
iv. Springer
v. ResearchGate
vi. CiteSeerX
vii. Science Direct

These sources were chosen because these sources covered the most of the publications of the selected research topic. Then, we selected papers based on our selection criteria. Next, we checked the reference section of selected study for any relevant papers or journals or books. We also checked papers which cited these selected studies.

F. Data Extraction

The following extraction form, shown in Table 2 was used to record the information gathered from the primary studies.

Table 2. Data Collection Form

Data Item	Value
Study Identifier	S#
Paper Name
Author Name
Paper Type	Journal / Book / Thesis / Conference
Name of e-library	IEEE, ACM or any other
Publication Year	2001-2016
Research Question	RQ1, RQ2, RQ3
Motivation of Paper
Method of the Paper	Approaches / Algorithms / Techniques
Limitation of the Paper

IV. Validity of the Systematic Review

We performed this systematic literature review for investigating the techniques of detecting and correcting spelling errors in Bangla. For this investigation, we accumulated all available evidence. The main threats to the validity of our study are that our publication selection may be biased and there can be lack of sufficient information resources. However, we made an effort to get in touch with all possible and relevant resources. We found that there have a few publication in this research domain. The search process was manual rather than an automated search process. Therefore, lack of sufficient resources may be a possible threat to our study. This implies that we may have missed some relevant resources. However, we searched most of the sources using our search strings many times and our results were same.

V. Results Analysis

At the beginning of the search process, initial results returned many studies which cover many other related topics such as English spell checker, grammar checker, spell checker for other languages and other Bangla language processing related topics. We focused only Bangla spell checker related papers. Using our inclusion and exclusion criteria, we identified relevant papers. We identified 11 papers (S1-S11) by the search process. One of the papers (S8) was a short version of another paper (S1) therefore we selected only one when give answers to our research questions. Thus, the number of identified papers is 10. We investigated those papers to give answers of our research questions. We also found some papers (S12-S14) from references section of those papers. We also considered some university thesis (S15-S16). Since Bangla is the most spoken language in Bangladesh and India, almost all of authors of the selected papers are either Indian or Bangladeshi. The publication year of these selected papers is between 2001 and 2016.

These studies and their descriptions in terms of conference or journal name, publication year, where to access them and which research questions were answered are shown in Table 3.

A. RQ1: What are the current methods used to develop Bangla spell checker?

References A Systematic Literature Review on Spell Checkers for Bangla Language

Keele, Staffs. "Guidelines for Performing Systematic Literature Reviews in Software Engineering." Technical report, Ver. 2.3 EBSE Technical Report. EBSE. 2007.
Kukich, Karen. "Techniques for Automatically Correcting Words in Text." ACM Computing Surveys (CSUR) 24.4 (1992): 377-439.
P. Kundu and B.B. Chaudhuri (1999) "Error Pattern in Bangla Text". International Journal of Dravidian Linguistics. 28(2): 49-88.
D. E. Knuth, The Art of Computer Programming, Vol. 3, Addison-Wesley Publishing Company, Reading, Massachusetts, 2nd edition, 1982.
Lawrence Phillips, “Hanging on the Metaphone”, Computer Language, 7(12), 1990.
Lawrence Phillips, “The Double Metaphone Search Algorithm”, C/C++ Users Journal, 18(6), June, 2000.
T. N. Gadd, “PHONIX: The Algorithm”, Program, 24(4), pp. 363-366, 1990.
Levenshtein, V. I. (1966). Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. 10(8), 707–710.
W. Kraaij and R. Pohlman, “Viewing Stemming as Recall Enhancement”, In the Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1996, pp. 40–48.
Lovins, Julie Beth (1968). "Development of a Stemming Algorithm". Mechanical Translation and Computational Linguistics 11: 22–31.
Porter, Martin F.1980. An Algorithm for Suffix Stripping. Program 14 (3): 130-137.
C. E. Shannon, “Prediction and Entropy of Printed English,” Bell Sys. Tec. J. (30):50–64, 1951.
Farag Ahmed, Ernesto William De Luca, and Andreas Nürnberger, “Revised N-Gram based Automatic Spelling Correction Tool to Improve Retrieval Effectiveness”, August 22, 2009.
Chaudhuri, Bidyut Baran. "Reversed Word Dictionary and Phonetically Similar Word Grouping based Spell-checker to Bangla Text." Proc. LESAL Workshop, Mumbai. 2001.
Naushad UzZaman and Mumit Khan, “A Bangla Phonetic Encoding for Better Spelling Suggestions”, Proc. 7th International Conference on Computer and Information Technology, Dhaka, December, 2004.
UzZaman, Naushad, and Mumit Khan. "A Double Metaphone Encoding for Bangla and its Application in Spelling Checker." 2005 International Conference on Natural Language Processing and Knowledge Engineering. IEEE, 2005.
Islam, Md, Md Uddin, and Mumit Khan. “A Light Weight Stemmer for Bengali and its Use in Spelling Checker,” Proc. 1st Intl. Conf. on Digital Comm. and Computer Applications (DCCA07), Irbid, Jordan, March 19-23, 2007.
N. UzZaman and M. Khan, “A Comprehensive Bangla Spelling Checker”, In the Proceeding of the International Conference on Computer Processing on Bengali (ICCPB), Dhaka, Bangladesh, 2006.
Hoque, Md Tamjidul, and Md Kaykobad. "Coding System for Bangla Spell Checker." 5th International Conference on Computer and Information Technology. 2002.
Abdullah, Md Munshi, Md Zahurul Islam, and Mumit Khan. "Error-tolerant Finite-state Recognizer and String Pattern Similarity Based Spelling-Checker for Bangla." Proceeding of 5th International Conference on Natural Language Processing (ICON). 2007.
Chaudhuri, Bidyut Baran. "Towards Indian Language Spell-checker Design." Language Engineering Conference, 2002. Proceedings. IEEE, 2002.
Abdullah, A. B. A., and Ashfaq Rahman. "A Generic Spell Checker Engine for South Asian Languages." Conference on Software Engineering and Applications (SEA 2003). 2003.
Murshed, M. Manzur, Mahbubur Rahman Syed, and M. Kaykobad. "A Linguistically Sortable Bengali Coding System and its Application in Spell Checking: A Case Study of Multilingual Applications." Interactive multimedia systems (2002): 251.
Khan, Nur Hossain, et al. "Checking the Correctness of Bangla Words using N-Gram." International Journal of Computer Application 89.11 (2014).
Haque, Md Tamjidul, and M. Kaykobad. "Use of Phonetic Similarity for Bangla Spell Checker." Proc. 5th International Conference on Computer and Information Technology. 2002.
Abdullah, A. B. A., and Ashfaq Rahman. "A Different Approach in Spell Checking for South Asian Languages." Proc. 2nd International Conference on Information Technology for Applications (ICITA), China. 2004.
Abdullah, Arif Billah Al-Mahmud, and Ashfaq Rahman. "Spell Checker for Bangla Language: An Implementation Perspective." Proc. 6th International Conference on Computer and Information Technology, Dhaka, Bangladesh. 2003.
UzZaman, Naushad. "Phonetic Encoding for Bangla and its Application to Spelling Checker, Name Searching, Transliteration and Cross Language Information Retrieval." Undergraduate thesis (Computer Science), BRAC University (2005).
Bhowmik, Kowshik, Afsana Zarin Chowdhury, and Sushmita Mondal. Development of A Word Based Spell Checker for Bangla Language. Diss. Department of Computer Science and Engineering, Military Institute of Science and Technology, 2014.
Asadullah, Munshi. Finite State Recognizer and String Similarity based Spelling Checker for Bangla. Diss. BRAC University, 2007.
Govilkar, Sharvari S., J. W. Bakal, and Sagar R. Kulkarni. "Extraction of Root Words using Morphological Analyzer for Devanagari Script." International Journal of Information Technology and Computer Science (IJITCS) 8.1 (2016): 33.
Aktaruzzaman, Md, and Md Farukuzzaman Khan. "A New Technique for Segmentation of Handwritten Numerical Strings of Bangla Language." International Journal of Information Technology and Computer Science (IJITCS) 5.5 (2013): 38.
Doumi, Noureddine, et al. "A Semi-Automatic and Low Cost Approach to Build Scalable Lemma-based Lexical Resources for Arabic Verbs." International Journal of Information Technology and Computer Science (IJITCS) 8.2 (2016): 1.
Divya, K. S., R. Subha, and S. Palaniswami. "Similar Words Identification Using Naive and TF-IDF Method." International Journal of Information Technology and Computer Science (IJITCS) 6.11 (2014): 42.