INVITED TALK: Named Entity Transliteration in a Variety of Scripts
Richard Sproat (University of Illinois at Urbana-Champaign)
In this talk I will review our work on unsupervised and lightly supervised
approaches to automatic transliteration of names, for a variety of languages,
including English, Chinese, Arabic, Hindi, Korean and Russian. Our approach
relies on a combination of several methods: estimates of phonetic similarity
between strings; correlation of distribution of the terms over time in
comparable corpora; and cooccurrence in documents across comparable corpora.
I will discuss results reported in recent the EMNLP (2006) and ACL
(2006/2007), as well as promising ongoing projects. Throughout the talk I
will highlight the particular problems that Arabic-based scripts pose
for this kind of research.
Muhammad Humayoun, Harald Hammarström and Aarne Ranta
Urdu is a challenging language because
of, first, its Perso-Arabic script and
second, its morphological system having
inherent grammatical forms and
vocabulary of Arabic, Persian and the
native languages of South Asia. This
paper describes an implementation of the
Urdu language as a software API, and we
deal with orthography, morphology and
the extraction of the lexicon. The
morphology is implemented in a toolkit
called Functional Morphology (Forsberg
& Ranta, 2004), which is based on the
idea of dealing grammars as software
libraries. Therefore this implementation
could be reused in applications such as
intelligent search of keywords, language
training and infrastructure for syntax. We
also present an implementation of a small
part of Urdu syntax to demonstrate this
reusability.
Abid Khan, Aamir Khan, and Naveed Ali
Zero pronouns are an optional feature in
Urdu language. They are naturally found
in both dialogue and monologue. The
occurrence of zero pronouns is supported
by rich agreement system in the language.
Finding and restoring these pronouns is of
importance in natural language processing
(NLP) applications such as machine
translation. This paper presents an
algorithm for detecting and restoring
personal zero pronouns at subject position.
The algorithm uses grammatical
constraints and centering for restoring
zero pronouns.
Ali Farghaly
An analysis of query terms on most popular information retrieval platforms reveals that over 85% of query terms are noun phrases. In an analysis of randomly selected 4000 queries on Oracle's Secure Enterprise Search (Krihnamurphy et al, 2007) shows that over 95% of query terms are nominal. Research in Computational Linguistics in recent years focused on detecting mentions (Grishman and Sundheim, 1996; ACE, 2004, 2007; Mikheev et al, 1999; NIST, 2004, Florian et al, 2004, Zitouni, 2007) and identifying the entities that each mention belongs to i.e. recognizing the real life entities that each mention refers to. Identifying entities such as a person, a company, a geopolitical entity, a place, a date .etc is a crucial step towards the "understanding" of texts. Most entities by definition are noun phrases However, little work has been done for uncovering the complexities in noun phrase internal structure. This paper focuses on the analysis of a particular type of noun phrase, namely, the Noun Construct. An entity recognition task must tag a noun construct such as "The ambassador of the government" as of type PERSON while recognizing "The government of the ambassador" as of type GEOPO. The distinction between the two in Arabic script-based languages is extremely subtle because of the absence of overt preposition such as "of" to mark the relationship between the nouns involved. This paper is divided into 3 parts and a conclusion. In part 1, we deal with Entity Extraction in Information Retrieval. This is followed in part 2 by an analysis of the noun construct in Arabic. In part 3, we show the relevance of our analysis to the task of Entity Extraction.
Hadi Amiri, Farhad Oroumchian, Caro Lucas, and Masoud Rahgozar
With rapid growth of information sources,
it is essential to develop methods that retrieve
most relevant information according
to the user requirements. One way of improving
the quality of retrieval is to use
more than one retrieval engine and then
merge the retrieved results and show a single
ranked list to the user. There are studies
that suggest combining the results of multiple
search engines will improve ranking
when these engine are treated as independent
experts. In this study, we investigated
performance of Persian retrieval by merging
four different language modeling methods
and two vector space models with
Lnu.ltu and Lnc.btc weighting schemes.
The experiments were conducted on a large
Persian collection of news archives called
Hamshari Collection. Different variations
of the Ordered Weighted Average (OWA)
fuzzy operators method, called a quantifier
based OWA operator and a degree-ofimportance
based OWA operator method
have been tested for merging the results.
Our experimental results show that the
OWA operators produce better precision
and ranking in comparison with weaker retrieval
methods. But in comparison with
stronger retrieval models they only produce
minimal improvements.
Yousif Almas and Khurshid Ahmad
The growth of specialist texts in languages
used by hundreds of millions of
people, say, in the Middle East (Arabic)
and in Pakistan (Urdu), provides benchmarks
for the Anglo-centric information
extraction and polarity analysis systems.
Public sentiment, openly expressed or
surreptitiously recorded, is being analysed
by using methods and techniques of
computational linguistics on documents
comprising everyday or general language,
especially telephone conversations,
news reports and so on. It is important
to evaluate whether studies in the
use of language in special registers, say
financial trading, may benefit sentiment
analysis: sentiment analysis or consumer/
institutional confidence is measured
through opinion polls on the basis
that both parties occasionally indulge in
exuberant behaviour and make decisions
that are irrational or have bounded rationality.
We show that the notion of
lexical signature of a specialism together
with the existence of local grammars
within a specialist register, rooted in collocational
patterns, will help in identifying
and deploying the signature and patterns
for detecting sentiment that is
openly expressed in news. Our study
originates in the analysis of (British)
English financial texts (c. 3.5 million tokens)
and the sentiment-bearing patterns
found in these texts can help to extract
sentiment bearing in Arabic (c. 3.5 million)
and in Urdu (c. 1 million).
Behrang Qasemizadeh, Saeed Rahimi and Behrooz Mahmoodi Bakhtiari
In this article, we have introduced the first
parallel corpus of Persian with more than
10 other European languages. This article
describes primary steps toward preparing a
Basic Language Resources Kit (BLARK)
for Persian. Up to now, we have proposed
morphosyntactic specification of Persian
based on EAGLE/MULTEXT guidelines
and specific resources of MULTEXT-East.
The article introduces Persian Language,
with emphasis on its orthography and morphosyntactic
features, then a new Part-of-
Speech categorization and orthography for
Persian in digital environments is proposed.
Finally, the corpus and related statistic
will be analyzed.
Nick Pendar and Serge Sharoff
This paper reports on the compilation
of a large Persian Web corpus and the
cyclic supervised development of a lexicon
and lemmatizer. We discuss the
strategies adopted in compiling the corpus
as well as some of the challenges in
processing and tokenizing it. We also
present the word patterns developed for
the lemmatizer and the algorithms designed
for the supervised lexical acquisition.
Jack Halpern
The high level of ambiguity of the Arabic
script poses special challenges to
developers of NLP tools in areas such as
morphological analysis, named entity
extraction and machine translation.
These difficulties are exacerbated by the
lack of comprehensive lexical resources,
such as proper noun databases, and the
multiplicity of ambiguous transcription
schemes. This paper focuses on some of
the linguistic issues encountered in two
subdisciplines that play an increasingly
important role in Arabic information
processing: the romanization of Arabic
names and the arabization of non-
Arabic names. The basic premise is that
linguistic knowledge in the form of linguistic
rules is essential for achieving
high accuracy.
Joshua Johanson
This paper details a method for
transcribing names written in Farsi into
English using a combination of
dictionary lookup, orthographical and
phonetic comparison and a simple
transcriber. It takes into account
variability in encoding, diacritic usage
and common affixes.
Mehdi M. Kashani, Fred Popowich, and Anoop Sarkar
This paper proposes a novel spelling-based
method for the automatic transliteration
of proper nouns from Arabic to
English which exploits various types of
letter-based alignments. The approach
consists of three phases: the first phase
uses single letter alignments, the second
phase uses alignments over groups of letters
to deal with diacritics and missing
vowels in the English output, and the third
phase exploits various knowledge sources
to repair any remaining errors. Our results
show a top-20 accuracy rate of 88% and
86% on development and blind test sets
respectively.
Aamir Wali and Safiqu-ru Rahman
This paper implements the
OpenType's Reverse chaining table
in Pango to enable complex Arabic
scripts such as Nastaleeq to be
rendered correctly in Linux. The
OpenType formalism provides a
number of glyph substitution tables
to handle all forms of glyph
switching, including the Type 8
reverse chaining table. This table is
useful for modeling Arabic scripts.
Although these scripts are written
from right-to-left, but when it comes
to context, each letter is dependent
on the letter to its left (i.e. on the
following letter) - resulting in
context dependency drift from leftto-
right. This behavior can best be
modeled by the reverse chaining
table. This table was implemented in
Pango and has been incorporated
into freeware online Pango libraries.
Implementation of this table has
enabled the modeling of Urdu scripts
according to its natural way of
writing.
Artem Lukanin and Constance Bobroff
This paper describes an on-line Persian
Verb Conjugator using a frame based
system. Created rules allow the generation
of Persian verb forms in conjunction with
a Past-Present stem lexicon to handle
exceptions. This program conjugates
prefixed and compound verbs as well as
independent verbs and creates full
paradigms of positive and negative verb
forms. In addition, the program is used to
generate interactive multiple-choice tests
for self-practice.
Ali Azimizadeh and Mohamad Mehdi Arab
In this paper the Persian Morphological Parser designed
using a Finite State Transducer (FST) is studied.
This FST is used in some parts of Persian Text-to-
Speech (TTS) system-called ParsGooyan, such as: text
tokenization, lexicon and homograph disambiguation.
The main features used in this stemmer are wordsÂ’ orthography,
length of words without their affixes, the
Part of Speech (POS) of the words without their affixes,
and POS of the words in their contexts. The last two
features are received from a POS tagger that is based on
Hidden Markov Models (HMM). This parser parses
both derivational and inflectional words. The accuracy
of the final system is 83.51%.
Fahimeh Raja , Samira Tasharofi and Farhad Oroumchian
Part-Of-Speech (POS) tagging is the process
of marking-up the words in a text with
their corresponding parts of speech. It is an
essential part of text and natural language
processing. There are many models and
software for POS tagging in English and
other European languages. Little work has
been done on POS tagging of Persian language
which uses Arabic script for writing.
In these experiments we want to see how
effective would be if we just applied a POS
tagger from a language such as English to
Persian. Although English and Persian are
both Indo-European languages but they
have subtle differences. This paper presents
creation of a POS tagged corpus for evaluation
purposes and evaluation of a statistical
tagging method on Persian text. The results
show that an overall tagging accuracy between
96.4% and 96.9% is achievable
without the need to add any Persian linguistic
knowledge to the tagging process.
In This study we also looked at the effect
of the size of training and test corpora on
the accuracy of POS tagging.
Sanaz Jabbari and Ben Allison
In this paper we present our work
on part-of-speech-tagging Persian.
The work demonstrates how a
morphologically complex language
can be tagged with a large, complex
tagset (over 800 combinatorial
possibilities) with very high accuracy,
and a relative improvement
over baseline performance
which is substantially higher than
English. Persian occupies a unique
position among Indo-Eurpoean
languages, because of its many
commonalities with Arabic, yet it
is a language which has received
little attention in the NLP literature.
This study shows how a method
previously applied to English
can be used for Persian. We use an
implementation (Hepple, 2000) of
Error-Driven Transformation-
Based Learning (Brill, 1992), and
learn tagging rules for very coarse
part-of-speech categories, and
subsequently for a full, complex
tagset. We show the performance
of this tagger to be significantly
better than previous work on part-of-
speech-tagging Persian.
Fahimeh Raja , Hadi Amiri , Samira Tasharofi and Hossein Hojjat and Farhad Oroumchian
One of the fundamental tasks in natural
language processing is part of speech
(POS) tagging. A POS tagger is a piece of
software that reads text in some language
and assigns a part of speech tag to each one
of the words. Our main interest in this research
was to see how easy it is to apply
methods used in a language such as English
to a new and different language such as
Persian (Farsi) and what would be the performance
of such approaches. This paper
presents evaluation of several part of
speech tagging methods on Persian text.
These are a statistical tagging method, a
memory based tagging approach and two
different versions of Maximum Likelihood
Estimation (MLE) tagging on Persian text.
The two MLE versions differ in the way
they handle the unknown words. We also
demonstrate the value of simple heuristics
and post-processing in improving the accuracy
of these methods. These experiments
have been conducted on a manually part of
speech tagged Persian corpus with over
two million tagged words. The results of
the experiments are encouraging and comparable
with the other languages such as
English, German or Spanish.
Khaled Shaalan, Ahmed Rafea, Azza Abdelmonem, and Hoda Baraka
In this paper, we describe a grammar-based generation approach for task-oriented interlingua-based spoken dialogue that transforms a shallow semantic interlingua representation called Interchange Format (IF) into Arabic Text that corresponds to the intentions underlying the speakers' utterances. The generation approach is developed primarily within the framework of the NESPOLE! (NEgotiating through SPOken Language in E-commerce) multilingual speech-to-speech MT project. The IF-to-Arabic generator is implemented in SICStus Prolog. We conducted an evaluation experiment using the output from the English analyzer provided by Carnegie Mellon University (CMU). The results of this experiment were promising and assured the ability of the generation approach in generating Arabic text form the interlingua taken from the travel and tourism domain.
Mahrnoush Shamsfard and Maryam Sadrmousavi
Semantic role labeling (SRL) refers to finding
the semantic relations between a predicate
and syntactic constituents in a sentence.
In this paper we present a rule-based
semantic role labeling system for Persian
sentences. The system exploits a two-phase
architecture to (1) identify the arguments
and (2) label them for each predicate. For
the first phase we developed a rule based
shallow parser to chunk Persian sentences
and for the second phase we developed a
knowledge-based system to assign 16 selected
thematic roles to the chunks. The
experimental results of testing each phase
are shown at the end of the paper.
Beth Bryson
SIL InternationalÂ’s FieldWorks Language
Explorer (FLEx) is a powerful tool for linguistic
exploration. It was designed from
the ground up to handle multiple writing
systems and complex scripts. I will
demonstrate its various features with Urdu
data to highlight its abilities to handle Arabic
script for linguistic processing. The
combination of processing features that
have not been available before with the
ability to do any of its linguistic analysis
features with right-to-left Unicode data
sets it apart from other linguistic analysis
tools.
Mahmood Elsayess
The Read~Verse website has unique and
powerful search engines developed by
Read~Verse specialists that permit quick
and easy searching of the entire Koran
(Quran) in Arabic script by 1 word, 2 words,
topic, and sura & verse. Also, users can
search in English by topic and sura & verse.
Visitors can access over 17,000 Arabic word
and their Arabic word roots. Read~Verse
Internet-based applications will display the
Koran text in Arabic script along with 4
different English transitions by 4 different
authors.
Mandana Hamidi, Ali Borji and Fariborz Mahmoudi
Humans are very good at recognizing alphanumeric characters. They show much greater accuracy, robust-ness and speed compared with best engineering methods yet known for this task. A visual nervous system in-spired approach to optical character recognition is stud-ied in this paper with the hope to touch its performance on a limited extent. Particularly application of features motivated from the hierarchical structure of the visual ventral stream for recognition of Persian (Arabic) handwritten digits is investigated. A set of position- and scale-invariant edge detectors is combined over neigh-boring positions and multiple orientations to build more complex features. Extracted features are then used to train and test a classifier for recognition of handwritten digits. We examined three types of classifiers: ANN, SVM and kNN to show that features are not dependent on a specific classifier which is evidence toward strength of these features for our purpose. Evaluation of this method over standard Persian handwritten digits dataset shows a high recognition rate of 99.63%.
Craig Kopris
Several factors impede the development of machine translation for the Pashto language, including encoding, standardization, and the grammar of the language itself. The rule-based TranSphere system, based on Lexical Functional Grammar, has been ex-tended through additions from Role and Reference Grammar to better handle these problems. This demo aims to show that despite the difficulties, it is possible to ob-tain useful translational capabilities.
Hassan Sawaf and Craig Kopris
In this paper, we introduce a hybrid machine
translation approach. We show the
approach on applications to machine
translation in different environments and
language pairs: Pashto into English of
newswire text, and Iraqi Arabic into English
of spoken utterances. We compare
the quality to a state-of-the-art statistical
approach, which is being trained on the
same set of data.
Otakar Smrž
The proposed demo would aim to present the technology
behind the Prague Arabic Dependency Treebank
(Hajič et al., 2004), a project of linguistic annotation
having application in many areas of Natural
Language Processing.
Algorithm for subject zero pronoun detection and restoration in Urdu discourse
Information Retrieval and the Arabic Noun Construct
Using OWA fuzzy operator to merge retrieval systems
A note on extracting 'sentiments' in financial news in English, Arabic and Urdu
The first parallel multilingual corpus of Persian: Towards a Persian BLARK
Supervised lexical acquisition for Persian from a web corpus
DAY 2 TALKS
The challenges and pitfalls of Arabic romanization and arabization
Transcription of names written in Farsi into English
Automatic transliteration of proper nouns from Arabic to English
Implementation of reverse chain mechanism in Pango for rendering Nastaliq script
Frame approach to Persian verb generation for educational purposes
A Persian morphological parser using POS tagging
Statistical POS tagging experiments on Persian text
Part-of-speech tagging for Persian
Evaluation of part of speech tagging on Persian text
Generating Arabic text from Interlingua
A rule-based semantic role labeling approach for Persian sentences
DEMONSTRATIONS AND POSTERS
FieldWorks language explorer and Arabic script data
The Koran database
Human vision inspired Optical Character Recognition
Pashto-English machine translation using TranSphere
Speech-Translation of languages with scarce resources
Extensible integrated Treebank annotation environment