A partial string matching approach for named entity recognition in unstructured Bengali data

Автор: Nabil Ibtehaz, Abdus Satter

Журнал: International Journal of Modern Education and Computer Science @ijmecs

Статья в выпуске: 1 vol.10, 2018 года.

Бесплатный доступ

In today's data driven, automated and digitized world, a significant stage of information extraction is to look for special keywords, more formally known as 'Named Entity'. This has been an active research topic for more than two decades and significant progresses have been made. Today we have models powered by deep learning that, although not perfect, have near human level accuracy on certain occasions. Unfortunately these algorithms require a lot of annotated training data, which we hardly have for Bengali language. This paper proposes a partial string matching approach to identify a named entity from an unstructured text corpus in Bengali. The algorithm is a partial string matching technique, based on Breadth First Search (BFS) search on a Trie data structure, augmented with dynamic programming. This technique is capable of not only identifying named-entities present on a text, but also estimating the actual named-entities from erroneous data. To evaluate the proposed technique, we conducted experiments in a closed domain where we employed this approach on a text corpus with some predefined named entities. The texts experimented on was both structured and unstructured, and our algorithm managed to succeed in both the cases.

Еще

Named Entity Recognition, Dynamic Programing, Trie, String Matching, Edit Distance

Короткий адрес: https://sciup.org/15016728

IDR: 15016728 | DOI: 10.5815/ijmecs.2018.01.04

Текст научной статьи A partial string matching approach for named entity recognition in unstructured Bengali data

Published Online January 2018 in MECS DOI: 10.5815/ijmecs.2018.01.04

Named Entity Recognition problem (NER) holds a very important position in the domain of Natural Language Processing (NLP) and Information Retrieval (IR) [1]. In formal words, a Named Entity (NE) is some abstract or real object, which can be a person, a location, an organization or even numerical data that can be classified and denoted with a proper name. Named-entity recognition (NER) is a task of Information Extraction (IE) that identifies and tags named entities from a text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, numerical values etc. Early approaches to solve this problem used handcrafted algorithms whereas now with the advancement of data science, data mining and access to big data we are fortunate to employ the power of machine learning for solving this problem. However, we do not have much structured and annotated data for Bengali. So, we cannot use the state of art machine learning models to solve this problem. This is why our paper is limited to developing a partial string matching approach for solving the NER problem.

In today’s world scenario, a lot of our tasks are automated. Previously which were done by human agents are now being done by computers. A very popular example is scanning zip codes in USA by OCR technology. The time is not much far when all our day to day tasks will be governed by computers. Information plays a vital and inseparable role in our day to day life. A lot of our dealings is done by textual data. So, we need robust systems to retrieve information from textual data. These are active research areas of fields like NLP, IR etc. NER covers a fair part of retrieving information from textual data. If we manage to identify Named Identities from a text, then the text becomes structured and it becomes easier to parse a semantic meaning from it. Motivated from these needs, this paper tries to explore string matching based approach to solve the problem of NER for Bengali language.

Список литературы A partial string matching approach for named entity recognition in unstructured Bengali data

D. Nadeau, and S. Sekine, “A survey of named entity recognition and classification,” Lingvisticae Investigationes, vol. 30, no. 1, 2007, pp. 3–26.
T. K. Sang, F. Erik, and F. De Meulder, “Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition,” in Proceedings of the Seventh Conference on Natural Language Learning, Association for Computational Linguistics, 2003.
T. K. Sang, F. Erik, and F. De Meulder, “Introduction to the CoNLL-2002 shared task: language-independent named entity recognition,” in Proceedings of the 6th Conference on Natural Language Learning, Aug 2002, pp. 1–4.
R. Grishman, and B. Sundheim, “Message understanding conference-6: A brief history,” in Proceedings of the 16th International Conference on Computational Linguistics, 1996.
B. B. Chaudhuri, and S. Bhattacharya, “An Experiment on Automatic Detection of Named Entities in Bangla,” IJCNLP, pp.75–82, 2008.
L. F. Rau, “Extracting company names from text,” in Proceedings of the 7th IEEE Conference on Artificial Intelligence Applications, IEEE, Feb 1991, pp. 29–32.
S. Sekine, and H. Isahara, “IREX: IR & IE Evaluation Project in Japanese,” in LREC, 2000, pp. 1977–1980.
T. K. Sang, F. Erik, and F. De Meulder. “Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition,” in Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003-Volume 4. Association for Computational Linguistics, 2003.
G. R. Doddington, A. Mitchell, M. A. Przybocki, L. A. Ramshaw, S. Strassel, and R. M. Weischedel, “The Automatic Content Extraction (ACE) Program-Tasks, Data, and Evaluation,” In LREC, vol. 2, pp. 837–840, 2004.
D. Santos, N. Seco, N. Cardoso, and R. Vilela, “Harem: An advanced ner evaluation contest for portuguese.” quot; In Nicoletta Calzolari; Khalid Choukri; Aldo Gangemi; Bente Maegaard; Joseph Mariani; Jan Odjik; Daniel Tapias (ed) in Proceedings of the 5th International Conference on Language Resources and Evaluation, May 2006.
D. Maynard, V. Tablan, C. Ursu, H. Cunningham, and Y. Wilks, “Named entity recognition from diverse text types,” in Recent Advances in Natural Language Processing 2001 Conference, 2001, pp. 257–274.
E. Minkov, R. C. Wang, and W. W. Cohen, “Extracting personal names from email: Applying named entity recognition to informal text,” in Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Oct 2005, pp. 443–450.
T. Poibeau, and L. Kosseim, “Proper name extraction from non-journalistic texts," Language and computers, vol. 37, no.1, pp. 144–157, 2001.
M. Asahara, and Y. Matsumoto, “Japanese named entity extraction with redundant morphological analysis,” in Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, Association for Computational Linguistics, May 2003, pp. 8–15.
A. McCallum, and W. Li, “Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons,” in Proceedings of the seventh conference on Natural language learning, Association for Computational Linguistics, May 2003, vol. 4, pp. 188–191.
G. Zhou, and J. Su, “Named entity recognition using an HMM-based chunk tagger,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, Jul 2002, pp. 473–480.
R. C. Bunescu, and M. Pasca, “Using Encyclopedic Knowledge for Named entity Disambiguation,” Eacl, vol. 6, pp. 9–16, 2006.
Y. Shinyama, and S. Sekine, “Named entity discovery using comparable news articles,” in Proceedings of the 20th International Conference on Computational Linguistics. Association for Computational Linguistics, Aug 2004.
P. Selvaperumal, and A. Suruliandi, “Semi-Supervised Personal Name Disambiguation Technique for the Web,” International Journal of Modern Education and Computer Science(IJMECS), vol. 8, no. 3, pp. 28–36, Mar 2016.
C. N. Santos, and V. Guimaraes, “Boosting named entity recognition with neural character embeddings,” arXiv preprint arXiv:1505.05008 (2015).
J. P. Chiu, and E. Nichols, “Named entity recognition with bidirectional LSTM-CNNs,” arXiv preprint arXiv:1511.08308 (2015).
G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer, “Neural architectures for named entity recognition,” arXiv preprint arXiv:1603.01360 (2016).
Z. Yang, R. Salakhutdinov, and W. Cohen, “Multi-task cross-lingual sequence tagging from scratch,” arXiv preprint arXiv:1603.06270 (2016).
X. Ma, and E. Hovy, “End-to-end sequence labeling via bi-directional lstm-cnns-crf,” arXiv preprint arXiv:1603.01354 (2016).
M. Al-Yahya, M. Al-Shaman, N. Al-Otaiby, W. Al-Sultan, A. Al-Zahrani, M. Al-Dalbahie, “Ontology-Based Semantic Annotation of Arabic Language Text,” IJMECS, vol. 7, no. 7, pp. 53–59, 2015.
S. Kale, and S. Govilkar, “Survey of Named Entity Recognition Techniques for Various Indian Regional Languages,” International Journal of Computer Applications, vol. 164, no. 4, 2017.
M. S. Islam, and J. K. Das, “Design Analysis Rules to Identify Proper Noun from Bengali Sentence for Universal Networking language”, IJMECS, vol. 6, no. 8, pp. 1–9, 2014.
Risvik KM “Search system and method for retrieval of data, and the use thereof in a search engine.” ,United States Patent 6377945 B1, April 23 2002
Shang H, Merrettal T “Tries for approximate string matching.” , IEEE Trans Knowl Data Eng 8(4):540–547
Oommen, B.J. & Badr, G. Pattern Anal Applic (2007) 10: 1. https://doi.org/10.1007/s10044-006-0032-z

Еще

Статья научная