Please use this identifier to cite or link to this item:
Title: Language Identification, Named Entity Recognition and Word Sense Disambiguation
Authors: Bhatnagar, A.
Keywords: Computer Science & Engineering
Issue Date: 2015
Abstract: In natural language processing, language identification or guessing the language is the problem of finding what is the natural language of the given content. Various computational approaches to this problem of natural language identification view this as a special case of categorization of the text, solved with the various statistical methods. Language identification has been a well studied problem, but language identification is especially studied in canonical text classification formulation. It tries to identify the languages from the the given sample text of individual words within a document containing multiple languages. In this case we focus only on two languages English, Hindi and Others. The motivation behind studying this problem of language identification stems from issue encountered while trying to build language resources from minority languages. We also found that the majority of web pages are found to contain text in minority language also tend to contain text in other languages. It was necessary to formulate a system through which we can automatically find or detect that which word is corresponding to which language. In this report, we are going to explore the techniques for performing identification of the languages at the word level from a document containing multiple languages namely English, Hindi and Others. The results from our method shows us that we can do much better that performing independent word language classification, because there are clues in the context of a word. Words of a given language are very frequently surrounded by words in the same language and many texts or web pages have some kind of unique patterns that are marked by the presence of certain punctuation's or even certain words. To evaluate our idea , we have manually collected a corpus of over 19000 words of bilingual (mostly non parallel), text from the web and social networking sites. After testing many weakly supervised learning methods we have found that a conditional random field model trained with generalized expectations criteria was the most accurate and it performed quite consistently with the amount of data being varied.
Appears in Collections:01. CSE

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.