INVITED TALK: Named Entity Transliteration in a Variety of Scripts

Richard Sproat (University of Illinois at Urbana-Champaign)

In this talk I will review our work on unsupervised and lightly supervised approaches to automatic transliteration of names, for a variety of languages, including English, Chinese, Arabic, Hindi, Korean and Russian. Our approach relies on a combination of several methods: estimates of phonetic similarity between strings; correlation of distribution of the terms over time in comparable corpora; and cooccurrence in documents across comparable corpora. I will discuss  results reported in recent the EMNLP (2006) and ACL (2006/2007), as well as promising ongoing projects.  Throughout the talk I will highlight the particular problems that Arabic-based scripts pose for this kind of research.

Back


DAY 1 TALKS

Urdu morphology, orthography and lexicon extraction

Muhammad Humayoun, Harald Hammarström and Aarne Ranta

Urdu is a challenging language because of, first, its Perso-Arabic script and second, its morphological system having inherent grammatical forms and vocabulary of Arabic, Persian and the native languages of South Asia. This paper describes an implementation of the Urdu language as a software API, and we deal with orthography, morphology and the extraction of the lexicon. The morphology is implemented in a toolkit called Functional Morphology (Forsberg & Ranta, 2004), which is based on the idea of dealing grammars as software libraries. Therefore this implementation could be reused in applications such as intelligent search of keywords, language training and infrastructure for syntax. We also present an implementation of a small part of Urdu syntax to demonstrate this reusability.

Algorithm for subject zero pronoun detection and restoration in Urdu discourse

Abid Khan, Aamir Khan, and Naveed Ali

Zero pronouns are an optional feature in Urdu language. They are naturally found in both dialogue and monologue. The occurrence of zero pronouns is supported by rich agreement system in the language. Finding and restoring these pronouns is of importance in natural language processing (NLP) applications such as machine translation. This paper presents an algorithm for detecting and restoring personal zero pronouns at subject position. The algorithm uses grammatical constraints and centering for restoring zero pronouns.

Information Retrieval and the Arabic Noun Construct

Ali Farghaly

An analysis of query terms on most popular information retrieval platforms reveals that over 85% of query terms are noun phrases. In an analysis of randomly selected 4000 queries on Oracle's Secure Enterprise Search (Krihnamurphy et al, 2007) shows that over 95% of query terms are nominal. Research in Computational Linguistics in recent years focused on detecting mentions (Grishman and Sundheim, 1996; ACE, 2004, 2007; Mikheev et al, 1999; NIST, 2004, Florian et al, 2004, Zitouni, 2007) and identifying the entities that each mention belongs to i.e. recognizing the real life entities that each mention refers to. Identifying entities such as a person, a company, a geopolitical entity, a place, a date .etc is a crucial step towards the "understanding" of texts. Most entities by definition are noun phrases However, little work has been done for uncovering the complexities in noun phrase internal structure. This paper focuses on the analysis of a particular type of noun phrase, namely, the Noun Construct. An entity recognition task must tag a noun construct such as "The ambassador of the government" as of type PERSON while recognizing "The government of the ambassador" as of type GEOPO. The distinction between the two in Arabic script-based languages is extremely subtle because of the absence of overt preposition such as "of" to mark the relationship between the nouns involved. This paper is divided into 3 parts and a conclusion. In part 1, we deal with Entity Extraction in Information Retrieval. This is followed in part 2 by an analysis of the noun construct in Arabic. In part 3, we show the relevance of our analysis to the task of Entity Extraction.

Using OWA fuzzy operator to merge retrieval systems

Hadi Amiri, Farhad Oroumchian, Caro Lucas, and Masoud Rahgozar

With rapid growth of information sources, it is essential to develop methods that retrieve most relevant information according to the user requirements. One way of improving the quality of retrieval is to use more than one retrieval engine and then merge the retrieved results and show a single ranked list to the user. There are studies that suggest combining the results of multiple search engines will improve ranking when these engine are treated as independent experts. In this study, we investigated performance of Persian retrieval by merging four different language modeling methods and two vector space models with Lnu.ltu and Lnc.btc weighting schemes. The experiments were conducted on a large Persian collection of news archives called Hamshari Collection. Different variations of the Ordered Weighted Average (OWA) fuzzy operators method, called a quantifier based OWA operator and a degree-ofimportance based OWA operator method have been tested for merging the results. Our experimental results show that the OWA operators produce better precision and ranking in comparison with weaker retrieval methods. But in comparison with stronger retrieval models they only produce minimal improvements.

A note on extracting 'sentiments' in financial news in English, Arabic and Urdu

Yousif Almas and Khurshid Ahmad

The growth of specialist texts in languages used by hundreds of millions of people, say, in the Middle East (Arabic) and in Pakistan (Urdu), provides benchmarks for the Anglo-centric information extraction and polarity analysis systems. Public sentiment, openly expressed or surreptitiously recorded, is being analysed by using methods and techniques of computational linguistics on documents comprising everyday or general language, especially telephone conversations, news reports and so on. It is important to evaluate whether studies in the use of language in special registers, say financial trading, may benefit sentiment analysis: sentiment analysis or consumer/ institutional confidence is measured through opinion polls on the basis that both parties occasionally indulge in exuberant behaviour and make decisions that are irrational or have bounded rationality. We show that the notion of lexical signature of a specialism together with the existence of local grammars within a specialist register, rooted in collocational patterns, will help in identifying and deploying the signature and patterns for detecting sentiment that is openly expressed in news. Our study originates in the analysis of (British) English financial texts (c. 3.5 million tokens) and the sentiment-bearing patterns found in these texts can help to extract sentiment bearing in Arabic (c. 3.5 million) and in Urdu (c. 1 million).

The first parallel multilingual corpus of Persian: Towards a Persian BLARK

Behrang Qasemizadeh, Saeed Rahimi and Behrooz Mahmoodi Bakhtiari

In this article, we have introduced the first parallel corpus of Persian with more than 10 other European languages. This article describes primary steps toward preparing a Basic Language Resources Kit (BLARK) for Persian. Up to now, we have proposed morphosyntactic specification of Persian based on EAGLE/MULTEXT guidelines and specific resources of MULTEXT-East. The article introduces Persian Language, with emphasis on its orthography and morphosyntactic features, then a new Part-of- Speech categorization and orthography for Persian in digital environments is proposed. Finally, the corpus and related statistic will be analyzed.

Supervised lexical acquisition for Persian from a web corpus

Nick Pendar and Serge Sharoff

This paper reports on the compilation of a large Persian Web corpus and the cyclic supervised development of a lexicon and lemmatizer. We discuss the strategies adopted in compiling the corpus as well as some of the challenges in processing and tokenizing it. We also present the word patterns developed for the lemmatizer and the algorithms designed for the supervised lexical acquisition.

[ Back | Top ]



DAY 2 TALKS

The challenges and pitfalls of Arabic romanization and arabization

Jack Halpern

The high level of ambiguity of the Arabic script poses special challenges to developers of NLP tools in areas such as morphological analysis, named entity extraction and machine translation. These difficulties are exacerbated by the lack of comprehensive lexical resources, such as proper noun databases, and the multiplicity of ambiguous transcription schemes. This paper focuses on some of the linguistic issues encountered in two subdisciplines that play an increasingly important role in Arabic information processing: the romanization of Arabic names and the arabization of non- Arabic names. The basic premise is that linguistic knowledge in the form of linguistic rules is essential for achieving high accuracy.

Transcription of names written in Farsi into English

Joshua Johanson

This paper details a method for transcribing names written in Farsi into English using a combination of dictionary lookup, orthographical and phonetic comparison and a simple transcriber. It takes into account variability in encoding, diacritic usage and common affixes.

Automatic transliteration of proper nouns from Arabic to English

Mehdi M. Kashani, Fred Popowich, and Anoop Sarkar

This paper proposes a novel spelling-based method for the automatic transliteration of proper nouns from Arabic to English which exploits various types of letter-based alignments. The approach consists of three phases: the first phase uses single letter alignments, the second phase uses alignments over groups of letters to deal with diacritics and missing vowels in the English output, and the third phase exploits various knowledge sources to repair any remaining errors. Our results show a top-20 accuracy rate of 88% and 86% on development and blind test sets respectively.

Implementation of reverse chain mechanism in Pango for rendering Nastaliq script

Aamir Wali and Safiqu-ru Rahman

This paper implements the OpenType's Reverse chaining table in Pango to enable complex Arabic scripts such as Nastaleeq to be rendered correctly in Linux. The OpenType formalism provides a number of glyph substitution tables to handle all forms of glyph switching, including the Type 8 reverse chaining table. This table is useful for modeling Arabic scripts. Although these scripts are written from right-to-left, but when it comes to context, each letter is dependent on the letter to its left (i.e. on the following letter) - resulting in context dependency drift from leftto- right. This behavior can best be modeled by the reverse chaining table. This table was implemented in Pango and has been incorporated into freeware online Pango libraries. Implementation of this table has enabled the modeling of Urdu scripts according to its natural way of writing.

Frame approach to Persian verb generation for educational purposes

Artem Lukanin and Constance Bobroff

This paper describes an on-line Persian Verb Conjugator using a frame based system. Created rules allow the generation of Persian verb forms in conjunction with a Past-Present stem lexicon to handle exceptions. This program conjugates prefixed and compound verbs as well as independent verbs and creates full paradigms of positive and negative verb forms. In addition, the program is used to generate interactive multiple-choice tests for self-practice.

A Persian morphological parser using POS tagging

Ali Azimizadeh and Mohamad Mehdi Arab

In this paper the Persian Morphological Parser designed using a Finite State Transducer (FST) is studied. This FST is used in some parts of Persian Text-to- Speech (TTS) system-called ParsGooyan, such as: text tokenization, lexicon and homograph disambiguation. The main features used in this stemmer are wordsÂ’ orthography, length of words without their affixes, the Part of Speech (POS) of the words without their affixes, and POS of the words in their contexts. The last two features are received from a POS tagger that is based on Hidden Markov Models (HMM). This parser parses both derivational and inflectional words. The accuracy of the final system is 83.51%.

Statistical POS tagging experiments on Persian text

Fahimeh Raja , Samira Tasharofi and Farhad Oroumchian

Part-Of-Speech (POS) tagging is the process of marking-up the words in a text with their corresponding parts of speech. It is an essential part of text and natural language processing. There are many models and software for POS tagging in English and other European languages. Little work has been done on POS tagging of Persian language which uses Arabic script for writing. In these experiments we want to see how effective would be if we just applied a POS tagger from a language such as English to Persian. Although English and Persian are both Indo-European languages but they have subtle differences. This paper presents creation of a POS tagged corpus for evaluation purposes and evaluation of a statistical tagging method on Persian text. The results show that an overall tagging accuracy between 96.4% and 96.9% is achievable without the need to add any Persian linguistic knowledge to the tagging process. In This study we also looked at the effect of the size of training and test corpora on the accuracy of POS tagging.

Part-of-speech tagging for Persian

Sanaz Jabbari and Ben Allison

In this paper we present our work on part-of-speech-tagging Persian. The work demonstrates how a morphologically complex language can be tagged with a large, complex tagset (over 800 combinatorial possibilities) with very high accuracy, and a relative improvement over baseline performance which is substantially higher than English. Persian occupies a unique position among Indo-Eurpoean languages, because of its many commonalities with Arabic, yet it is a language which has received little attention in the NLP literature. This study shows how a method previously applied to English can be used for Persian. We use an implementation (Hepple, 2000) of Error-Driven Transformation- Based Learning (Brill, 1992), and learn tagging rules for very coarse part-of-speech categories, and subsequently for a full, complex tagset. We show the performance of this tagger to be significantly better than previous work on part-of- speech-tagging Persian.

Evaluation of part of speech tagging on Persian text

Fahimeh Raja , Hadi Amiri , Samira Tasharofi and Hossein Hojjat and Farhad Oroumchian

One of the fundamental tasks in natural language processing is part of speech (POS) tagging. A POS tagger is a piece of software that reads text in some language and assigns a part of speech tag to each one of the words. Our main interest in this research was to see how easy it is to apply methods used in a language such as English to a new and different language such as Persian (Farsi) and what would be the performance of such approaches. This paper presents evaluation of several part of speech tagging methods on Persian text. These are a statistical tagging method, a memory based tagging approach and two different versions of Maximum Likelihood Estimation (MLE) tagging on Persian text. The two MLE versions differ in the way they handle the unknown words. We also demonstrate the value of simple heuristics and post-processing in improving the accuracy of these methods. These experiments have been conducted on a manually part of speech tagged Persian corpus with over two million tagged words. The results of the experiments are encouraging and comparable with the other languages such as English, German or Spanish.

Generating Arabic text from Interlingua

Khaled Shaalan, Ahmed Rafea, Azza Abdelmonem, and Hoda Baraka

In this paper, we describe a grammar-based generation approach for task-oriented interlingua-based spoken dialogue that transforms a shallow semantic interlingua representation called Interchange Format (IF) into Arabic Text that corresponds to the intentions underlying the speakers' utterances. The generation approach is developed primarily within the framework of the NESPOLE! (NEgotiating through SPOken Language in E-commerce) multilingual speech-to-speech MT project. The IF-to-Arabic generator is implemented in SICStus Prolog. We conducted an evaluation experiment using the output from the English analyzer provided by Carnegie Mellon University (CMU). The results of this experiment were promising and assured the ability of the generation approach in generating Arabic text form the interlingua taken from the travel and tourism domain.

A rule-based semantic role labeling approach for Persian sentences

Mahrnoush Shamsfard and Maryam Sadrmousavi

Semantic role labeling (SRL) refers to finding the semantic relations between a predicate and syntactic constituents in a sentence. In this paper we present a rule-based semantic role labeling system for Persian sentences. The system exploits a two-phase architecture to (1) identify the arguments and (2) label them for each predicate. For the first phase we developed a rule based shallow parser to chunk Persian sentences and for the second phase we developed a knowledge-based system to assign 16 selected thematic roles to the chunks. The experimental results of testing each phase are shown at the end of the paper.

[ Back | Top ]


DEMONSTRATIONS AND POSTERS

FieldWorks language explorer and Arabic script data

Beth Bryson

SIL InternationalÂ’s FieldWorks Language Explorer (FLEx) is a powerful tool for linguistic exploration. It was designed from the ground up to handle multiple writing systems and complex scripts. I will demonstrate its various features with Urdu data to highlight its abilities to handle Arabic script for linguistic processing. The combination of processing features that have not been available before with the ability to do any of its linguistic analysis features with right-to-left Unicode data sets it apart from other linguistic analysis tools.

The Koran database

Mahmood Elsayess

The Read~Verse website has unique and powerful search engines developed by Read~Verse specialists that permit quick and easy searching of the entire Koran (Quran) in Arabic script by 1 word, 2 words, topic, and sura & verse. Also, users can search in English by topic and sura & verse. Visitors can access over 17,000 Arabic word and their Arabic word roots. Read~Verse Internet-based applications will display the Koran text in Arabic script along with 4 different English transitions by 4 different authors.

Human vision inspired Optical Character Recognition

Mandana Hamidi, Ali Borji and Fariborz Mahmoudi

Humans are very good at recognizing alphanumeric characters. They show much greater accuracy, robust-ness and speed compared with best engineering methods yet known for this task. A visual nervous system in-spired approach to optical character recognition is stud-ied in this paper with the hope to touch its performance on a limited extent. Particularly application of features motivated from the hierarchical structure of the visual ventral stream for recognition of Persian (Arabic) handwritten digits is investigated. A set of position- and scale-invariant edge detectors is combined over neigh-boring positions and multiple orientations to build more complex features. Extracted features are then used to train and test a classifier for recognition of handwritten digits. We examined three types of classifiers: ANN, SVM and kNN to show that features are not dependent on a specific classifier which is evidence toward strength of these features for our purpose. Evaluation of this method over standard Persian handwritten digits dataset shows a high recognition rate of 99.63%.

Pashto-English machine translation using TranSphere

Craig Kopris

Several factors impede the development of machine translation for the Pashto language, including encoding, standardization, and the grammar of the language itself. The rule-based TranSphere system, based on Lexical Functional Grammar, has been ex-tended through additions from Role and Reference Grammar to better handle these problems. This demo aims to show that despite the difficulties, it is possible to ob-tain useful translational capabilities.

Speech-Translation of languages with scarce resources

Hassan Sawaf and Craig Kopris

In this paper, we introduce a hybrid machine translation approach. We show the approach on applications to machine translation in different environments and language pairs: Pashto into English of newswire text, and Iraqi Arabic into English of spoken utterances. We compare the quality to a state-of-the-art statistical approach, which is being trained on the same set of data.

Extensible integrated Treebank annotation environment

Otakar Smrž

The proposed demo would aim to present the technology behind the Prague Arabic Dependency Treebank (Hajič et al., 2004), a project of linguistic annotation having application in many areas of Natural Language Processing.

[ Back | Top ]