Read 'Lexifanis' A Lexical Analyzer of Modern Greek text version

" L e x i f a n i s " A L e x i c a l A n a l y z e r o f Modern Greek

Yannis Kotsanis - Yanis Maestros Computer Sc. D p t . - N a t i o n a l Tech. U n i v e r s i t y Heroon Polytechniou 9 GR - 157 73 - A t h e n s , Greece

' l ' ~criture fait du savoir une f~te' R.BARTHES

ABST~

L e x i f a n i s " i s a Software Tool designed and i m p l e m e n t e d by t h e a u t h o r s t o a n a l y z e Modern Greek Language ( ~ A n u o T L ~ ' ) . This system a s s i g n s g r a m m a t i c a l ~ l a s s e s ( p a r t s o f speech) t o 95-98% of t h e words of a t e x t which i s r e a d and n o r m a l i z e d by t h e computer. By p r o v i d i n g the system with the a p p r o p r i a t e grammatical knowledge ( i . e . : dictionaries of non-inflected words~ affixation m o r p h o l o g y and l i m i t e d s u r f a c e syntax rules ) any " v a r i a n t " o f Modern Greek Language ( d i a l e c t o r i d i o m ) can be processed. In d e s i g n i n g t h e system, s p e c i a l cons i d e r a t i o n i s g i v e n t o t h e Greek Language morphological characteristics, primarily to the inflection and t h e a c c e n t u a t i o n . In Linguistics, L e x i f a n i s , can a s s i s t the generation of i n d e x e s o r lemmata; on t h e o t h e r hand readability or style analysis can be performed using this software as a b a s i c component. In Word Processing this software may s e r v e as a background t o build dictionaries for a s p e l l i n g c h e c k i n g and e r r o r detection package. Through t h i s s t u d y o u r r e s e a r c h group has set the basis in designing an e x p e r t system " which is intended to "understand" and process Modern Greek texts. Lexifanis is t h e first working t o o l f o r Modern Greek Language.

PROLOGUE In L i n g u i s t i c s the systematic identification o f t h e word classes rises several questions in regard to the morphemic analysis. In C o m p u t a t i o n a l Linguistics several research a r e a s use fundamental i n f o r m a t i o n such as t h e "word c l a s s " of a g i v e n wordy i s o l a t e d o r i n i t s c o n t e x t . In Computer Science the automatic p r o c e s s i n g of Greek t e x t s i s based on r e l e v a n t knowledge, a t the l e x i c a l level. In an effort to present a software tool intended to identify the grammatical classes of the words we have designed and implemented L e × i f a n i s . We have used modern g r e e k t e x t s as a t e s t bed of our system, but L e x i f a n i s , can process any " v a r i a n t " o f modern greek, and even a n c i e n t g r e e k l a n g u a g e , p r o v i d e d that it is appropriately initialized.

In this paper s whenever we use the term greek or greek language we refer to the modern greek l a n g u a g e (~AnuoTL}::~') in i t s r e c e n t monotonic version (i.e. a single a c c e n t i s used, i n s t e a d o f t h r e e , and t h e r e a r e no b r e a t h i n g s --~n~'¢O~,=T,=')

WORD

CLASSES

" ~AeEL~,i~n~ ~ : Who B r i n g s t h e Words to Light. N a m e g i v e n by Lucian (circa [email protected] A . C . ) t o one o f h i s d i a l o g u e s .

We have f o u n d that morphological analysis o f t h e g r e e k words can p r o v i d e a d e quate information for the word class assignment. The m a j o r i t y of t h e words in a text can De assigned a unique ( single class >. However, there exist some words that may be assigned two "possible" classes. This ambiguity is inherent to their morphology. On the other hand we know that consideration of the words in their context may d i s ambiguate t h i s classification, if required. In t h i s w o r k t h e r e is n o need to use any stem dictionary.

154

The ~undamental i n f o r m a t i o n used by Lexifanis to provide the classes of all g r e e k words i s e x t r a c t e d f r o m t h e a f f i x a tion m o r p h o l o g y and especially from a morphemic suffix analysis. In t h i s d o main, we f o l l o w three axes of investigation : the "Accentual Scheme", the "Ending" and the " P r e - - e n d i n g " o f each word.

fourth endings. These endings may n o t necessarily coincide with the inflectional suffixes, described in the greek grammar [ T R I A , 4 1 ] . Consider for example t h e f o l l o w i n g p a i r o f words highlighting the difference in t h e e n d i n g o f t h e two words. ( In t h i s example the ending is the inflexional suffix, as w e l l ) . ~xT¢~ - mo - n ( : mx~ - $o - .~ ( : execution) NOUN I h a v e e x e c u t e d ) ADJ scheme

Accentual

scheme of

The "accentual scheme" o f the word reflects the position of the s t r e s s on t h e w o r d ; The s t r e s s may come o n l y on one of the last three syllables ( law o f t h e three syllables ). T h i s scheme i s i d e n tified in our system by a c o d e number. Table 1 lists all possible schemes and their corresponding identification codes

(IC).

Notice the identical accentual the above two words.

Pre--ending On t h e o t h e r hand, t h e s e endings reflect the incidental cases of morphemic ambiguity [KOKT,85] i n t h e inflectional greek language. This a m b i g u i t y can be resolved if we f u r t h e r penetrate to the word t o i d e n t i f y what we c a l l p r e - - e n d i n g . This pre-ending, i n most c a s e s , can be easily used to disambiguate word classes and it yields to a unique class a s s i g n m e n t when t h e e n d i n g alone is not sufficient. Generally, the pre-ending does n o t coincide with the derivational suffix of t h e word under consid eration [TPIA,41]. Let us example : now consider the following

TABLE

1 :

"accentual scheme" of the greek words

accent. scheme

"

I_~C

+}

I

example

@" ~a, nw~

:e ~e ~ee ~ee eee eee eee Notation

: e

2 3 4 5 b 7

nQ~(;) natO[ xdon

~oxa'~>~ out',~T~

no~6~nu,=

: will : will,that : what(?) : child : grace : archaic : I compose : problem

xd$' - a t e .9~vaT - ~

(: (:

you h a v e done> death, in vocative

case~

"word s t a r t d e l i m i t e r " "syllable" "accent" "apostroph"

where,the consideration of the linguistic inflectional s u f i × e s - u T z and+m a r e completely m i s l e a d i n g , as f a r as t h e class assignment is concerned. You may n o t i c e that these two w o r d s h a v e t h e same p r e ending -,=T-. In t h i s case a further morphemic penetration in the word is required t o r e s o l v e t h e a m b i g u i t y [KRAU,

81]: [email protected],it" ,=T ,~T ~ m VERB NOUN

An example t o illustrate feature is the following: ~SL-+O~t-O-OO-t'n xo~.-U.5-.~u-vn

t h e above

(:justice>

(:joyful>

IC=&

IC=7

NOUN

ADJ

Ending A detailed suffix analysis of the highly inflected g r e e k l a n g u a g e [KOYP,bT] [MIRA,59] i n d i c a t e s t h a t t h e r e e x i s t morphemes a t t h e end o f t h e word w h i c h can be used t o i d e n t i f y the grammatical classes o f t h e words. The morphological analysis, presented i n t h i s p a p e r ~ i s based on a r i g h t - t o - l e f t scanning of the words. This analysis identifies word suffixes, named hence-

The morphemes i d e n t i f i e d at this last penetration may n o t n e c e s s a r i l y form the stem of these words. Our s y s t e m c l a s sifies the first word as a v e r b and t h e s e c o n d as a n o u n .

Words

in

their

Context

Finally, i f more ambiguities exist in word c l a s s a s s i g n m e n t , a c o n s i d e r a t i o n o f t h e " w o r d s i n t h e i r c o n t e x t " may be added to the affixa~ion morphology. This classification technique is fruitful in poorely inflectional languages, such a s E n g l i s h [CHER,8~], [KRAU,81], [ R O B I , 8 2 ] .

155

This syntax analysis is recommended when t h e t a s ~ i s t o d e t e r m i n e t h e c l a s s e s of the words i n a ~hole text, as opposed t o the class assignment to isolated words. By t h i s a n a l y s i s we g a i n i n f o r m a t i o n f r o m up t o two words t h a t p r e c e d e o r f o l l o w t h e word u n d e r c l a s s i f i c a tion [TZAP,53]. The f o l l o w i n g is a classic disambiguation example : ol ~ ~vT~¢o ~vT~o - ¢~ - ¢~ <: t h e c o n t r a s t s ) <: t o c o n t r a s t ) NOUN VERB

are identified. carried out using mentioned above.

Efficient search is the accentual code,

EXAMPLE:

"Five"

Morphological

Rules : :

:

<leZ/eE> < n / n q >

"-:eE> ,~¢~16~1,5p~.=:: ,: d U , ~ ; ' > :-

<~l~ql¢>

noun verb

<u.'~/~>

.::1a l , : q <:1Q;.' ).

/ m~ >'- :

<auo~ > Notation

: name noun : noun

IMPLEMENTATION

Dictionaries

of

e

"word s t a r t "syl lable"

"accent"

delimiter"

N ~ n - - l n f l e ~ t ~ d Words L i mi t e d Syntax

"ex I usi ve

or"

Greek language is highly inflected. H o w e v e r , due t o t h e f a c t t h a t one o u t o f two words o f a text is a non-inflected word we h a v e c o n s t r u c t e d t h e d i c t i o n a r i e s o~ non-inflected words containing about 4~ entries. In t h e s e d i c t i o n a r i e s we accommodated a l l t h e non i n f l e c t e d words, t h a t h a v e no derivational suffix, o f mod e r n g r e e k , such a s p a r t i c l e s , pronouns, prepositions, conjunctions, homonyms,etc. and t h e i n f l e c t e d articles. Each word that enters Lexifanis is first searched in these dictionaries. If there exist an i d e n t i c a l entry, its class is assigned to this word. Fig. i lists some o f the entries of these dictionaries. As an e x a m p l e c o n s i d e r " o ~ o " (:to the, it). This word can b e either "article w i t h p r e p o s i o n " or "pronoun".

Anal y s i s

When we w a n t t o analyze and c l a s s i f y t h e w o r d s o f a t e x t as a w h o l e , L e x i f a n i s e x a m i n e s t h e word under consideration in its context. T h i s can be a c c o m p l i s h e d by invoking t h e n e a r l y 25 Limited Surface Syntax Rules. This step is recommended, in case a word, i s a s s i g n e d two p o s s i b l e c l a s s e s <double c l a s s a s s i g n m e n t ) , s e e T a b l e 2, using only the affixation morphology. This double class assignment i s due t o the ambiguity inherent to the morphology of the word. EXAMPLE: syntax rules <prep_pron> "Two" o f : <verb>

=>

the

limited

surface

art : art_pron : art.prep : art,prep_pron : prep_pron : pron : prep : conj : homonym : particle : num: adv : Fig. I

n

O

Ot T~R ~TO

TWV TOU ~TQ

...

<pron>

.::]verb>

Tn OTn

<prep_pron

...

> <art_pron

=>

,~Tn~

~TOU

~TWV

> <uncl ass> <prep> <art> <name.>

Uou ~ u q

~aL

eu~vu ...

...

a ~

T~

SOFTWARE SYSTEM

of structured two v e r s i o n s : pro-

~50o ;Suo TO¢~q . . . noO ~¢~a x~¢q . . .

Lexifanis is a set gramms i m p l ~ m e n t e d i n * The BATCH s y s t e m , t h e words o f a whole performs the limited above, in a d d i t i o n t o

Part of the Dictionaries o f N o n - l n f l e c t e d Words

assigns classes to text. T h i s system syntax, mentioned the morpholog,/.

Morpholoqical Analysis The Morphological Analysis is performed u s i n g a b o u t 250 r u l e s . The u s e r may add, delete or modify anyone o f these rules. These r u l e s contain all the information relevant to the endings and pre-endings. During this phase, the inflected words, m a i n l y v e r b s and nouns,

* The INTERACTIVE s y s t e m , a s s i g n s c l a s s e s to isolated words. T h i s system performs only the morphological analysis.

Structure

of

Lexifanis

The w h o l e s o f t w a r e system i s d e s i g n e d and i m p l e m e n t e d i n MODULES o r PHASES, t i ~ s t r u c t u r e o f which i s illustrated in the

156

B l o c k D i a g r a m of t h e F i g u r e 2. The scription of each m o d u l e f o l l o w s . INITIALIZATION During this processes take place :

de-

phase t w o

SUFFIX ANALYSIS This is the main p r o c e s s of our s y s t e m w h i c h is a c t i v a t e d for w o r d s not c o n t a i n e d in dictionaries. Finite State Automata [ A H O ,79] a r e u s e d to r e p r e s e n t the morphological rules. LIMITED SYNTAX ANALYSIS The r e l e v a n t i n f o r m a t i o n i s r e p r e s e n t e d by a u t o m a t a .

* the creation of the Dictionaries of N o n - l n f l e c t e d Words~ and * the generation of the appropriate Automata r e q u i r e d t o e x p r e s s t h e morphological rules and the surface syntax rules INPUT AND NORMALIZATION OF THE TEXTThe interactive version of the software system p e r f o r m s only the accentual scheme process, whereas t h e batch version performs this process in parallel to the i n p u t and n o r m a l i z a t i o n p r o c e s s e s . Normalization or Word R e c o g n i t i o n i s t h e t a s k of i d e n t i f y i n g what c o n s t i t u t e s a word i n a stream of characters.

Fig.

3

the

...

two d i m e n t i o n a l

garden

I: set up d i c t i o n a r i e s sl o f non-inflected words g~ate morphological & l i m i t e d surface syntax r u l e

RESULTS T h i s module i s b e s t f i t t e d to the batch version of our system, but it can be used in the interactive version~ as w e l l .

~i

input

and n ( x ' m a l t z e

text

of

TABLE wordsJ

2

identify acc.~hm

: Results obtained from a Scientific Text after morph. analys. % 5.16 0.00 5.11 3.91 2.96 b.47 b. 12 0.60 12.73 0.3~ 7.2T 1.50 13.18 &5.31 after

surface

~earch in dic~ionaries~ f non-inflectedl ~ d s )

mf m ~m ~ 1 sinqle classes

syntax % 13.53 [email protected] 6.42 3.91 5.26 8.22 6.12 0.70 12.98 0.30 7.27 [email protected] 13.18 8e.&e

I

"

Fig. 2

r0.r,o,- ----,.

Llmorfological) analysi

;

~perform l i m i t ~ Lsurface syntax analysis

)

I

rocess & output the

results

J

I. 2. 3. 4. 5. 6. 7. S. 9. I~. 11. 12. 13.

article article with pronoun numeral preposition conjuction adverb particle noun p r o p e r noun adjective participle verb

prepos.

do~!ble Structure of Lexifanis 14. 15. 16. 17. 18. 19.

classes 2.16 @[email protected] @.05 @.85 !1.33 [email protected] 16.69 2.71

SEARCH IN DICTIONARIES A l l t h e NonI n f l e c t e d Words, w i t h t h e same a c c e n t u a l schemer and word lengthy are grouped together forming a set of small dictionary-trees, "cultivated in a two dimentional...garden", minimizing thus the search t i m e (Fig.3).

art_pronoun 11.78 art w i t h p r e p _ p r o n 1.25 preposition_pronoun 2.36 non-inflected h o m o n y m 2.71 name : noun_adject 11.33 adject_adverb 2.06 31.48 unclassified words 3.21

157

The R e s u l t s c o n c e r n i n g the classificat i o n of a greek text, are summarized in T a P l e 2.

~hich assigns grammatical classes to t h e 95-98% o f t h e words o+ a g i v e n t e x t . T h i s system p e r f o r m s suffix analysis ~o a s s i g n c l a s s e s t o a l l t h e g r e e k words. For t h e f i r s t time a c c e n t u a l scheme has been p r o v e d u s e f u l i n t h e c l a s s i f i c a t i o n o f g r e e k words. Moreover, ambiguities inherent to the suffix morphology of g r e e k words can be r e s o l v e d w i t h o u t any stem d i c t i o n a r y . . .

* A s i n g l e c l a s s i s assigned t o 80-90% o+ t h e words o f any t e x t , 8-15% are ass i g n e d two p o s s i b l e c l a s s e s ( d o u b l e c l a s s a s s i g n m e n t ) , a n d t h e r e m a i n i n g 2-5% o+ t h e words, a r e l e f t u n c l a s s i f i e d . * The v a r i a t i o n o + t h e above p e r c e n t a ges i s due t o t h e d i f f e r e n c e i n s t y l e o+ the t e x t s being processed. A scientific w r i t i n g , f o r example, c o n t a i n fewer ambig u i t i e s t h a n a poem.

REFERENCES

[ KOYP, b7 ] : F. KououoO2n, A'VT ;, ,.~TO.S.q0Ov O m ~ tx 6 v "rn~ N~c:~ E 2 2 n ' v t }~c;, Ac~nv,~, 1.96..-' [TZAP,53] : A. TC~OT~avo~, N~o~n~'ti~n ~OvTaEt~, 2 T6Uol, [email protected]~va, 194b/1953 [TPIA,41] : M. A. To~.=VTa~UA3i6n~, N~om3nvlx~ FOqUUaTt~, A~v,~ 194111978

COMPUTATIONAL DETAILS

Lexi+anis" modules a r e written in "Pascal" programming language. This s o f t w a r e r u n s u n d e r NOS o p e r a t i n g system on a Cyber 171 main f r a m e c o m p u t e r . Topdown d e s i g n and structured programming guarantee the portability o+ t h i s product. The system uses a b o u t 35 K i l o w o r d s o f t h e Cyber computer memory (60bits/word) and i t requires 12 seconds " c o m p i l a t i o n time". The b a t c h v e r s i o n c l a s s i f i e s t h e words a t a r a t e o+ 110 word c l a s s e s p e r second.

[AHO , 7 9 ] : A.Aho, P a t t e r n Matching in S t r i n g s , Symposium on Formal Language Theory, Santa Barbara, Univ. of C a l l i + o r n i a , Dec. 1979 [CHER,80] : L.L.Cherry, PARTS-A System +or A s s i g n i n g Word C l a s s e s t o E n g l i s h Text, Computing S c i e n c e Technical R e p o r t #81, B e l l L a b o r a t o r i e s , M u r r a y H i l l N3 07974, 1980 [KOKT,85] : Eva K o c t o v a , Towards a New Type o f Morphemic A n a l y s i s , A C L , 2nd European C h a p t e r , Geneva, 1985 [KRAU,81] : W.Krause and G . W i l l ~ e , Lemm a t i z i n g German Newspaper T e x t s with the Aid o f an A l g o r i t h m , Computers and t h e H u m a n i t i e s 15, 1981 CMIRA,59] : A . Mirambel, La Langue Brecque Moderne Description et Analyse, K l i n c k s i e c k , P a r i s , 1959 CROBI,S2] : J.J.Robinson, Grammar f o r Dialogues, ACM, V o l . 2 5 , No i , 1982 DIAGRAM : A Comm. o f t h e

AIMM_IP~TIONS

L e x i f a n i s i s a complete s o f t w a r e t o o l which assigns classes t o i s o l a t e d words e n t e r e d by t h e u s e r o r , a l t e r n a t i v e l y , to a l l t h e words o f an i n p u t t e x t . T h i s s y s tem can be u s e f u l to a v a r i e t y of a p p l i c a t i o n s , some o f which a r e l i s t e d below. The m o d u l a r i t y in i t s design and i m p l e mentation, along with the g e n e r a l i t y of t h e concepts implemented g u a r a n t e e a p r o p e r t y t o o u r system : i t can be e a s i l y i n t e g r a t e d i n t o v a r i o u s s o f t w a r e systems. The most a p p a r e n t a p p l i c a t i o n o+ L e x i ~anis i s , in Lexicography, the g e n e r a t i o n of "morpheme-based" d i c t i o n a r i e s and t h e g e n e r a t i o n o f lemmata. L e x i f a n i s may s e r v e as a b a c k g r o u n d i n a spelling c h e c k i n g and e r r o r d e t e c t i o n package , o r any "writers aid" software system. F i n a l l y , Machine T r a n s l a t i o n woulO be another major a r e a o f a p p l i c a t i o n where L e x i f a n i s may be i n c l u d e d , as a module o r p r o c e s s , i n an " e x p e r t s y s t e m " .

[SOME,SO] : H.L.Somers, Brief Descrip t i o n and User Manual, I n s t i t u t pour l e s Etudes S~mantiques e t C o g n i t i v e s , Working Paper #41, 1980 [TURB,81] : T. N. T u r b a , Checking for S p e l l i n g and Typographical Errors in Computer-Based T e x t , F'roceedinqs of t h e ACM SIGPLAN-SIGOA on T e x t M a n i o u l a t i o n , P o r t l a n d - Oregon, 1981 [WINd,83] : T. Winograd, Language as a Cognitive Process, Vol. I : Syntax, Addison - Wesley, 1983

EPILO6~JE

... we have p r e s e n t e d a s o f t w a r e t o o l ,

158

Information

'Lexifanis' A Lexical Analyzer of Modern Greek

5 pages

Find more like this

Report File (DMCA)

Our content is added by our users. We aim to remove reported files within 1 working day. Please use this link to notify us:

Report this file as copyright or inappropriate

918814

You might also be interested in

BETA
handwriting
NEW TESTAMENT EXEGESIS