Read Microsoft Word - Lexicography Manual v3.doc text version

Lexicography Manual

Samakalin Nepali Sabdakos (Contemporary Nepali Dictionary) Report Ref. No. 32

Prepared By: Yogendra Yadava and Pat Hall March 12th, 2008

Bhasha Sanchar Project

Madan Puraskar Pustakalaya

Table of Contents

1. INTRODUCTION............................................................................................................................................3 2. THE DESIGN OF LEXICAL ENTRIES .....................................................................................................5 2.1 HEADWORD ..................................................................................................................................................5 2.2 WORD CLASS OR PART OF SPEECH ..............................................................................................................5 2.3 MEANINGS OF WORDS..................................................................................................................................6 2.4. EXAMPLES ..................................................................................................................................................7 2.5 PRAGMATICS ...............................................................................................................................................7 2.6 GUIDEWORDS ..............................................................................................................................................7 2.7 IDIOMS .........................................................................................................................................................8 2.8 PROVERBS ...................................................................................................................................................8 2.9 REFERENCES ...............................................................................................................................................8 3. COMPILATION OF LEXICAL ENTRIES ................................................................................................9 3.1 SELECTION OF LEXICAL ENTRIES ................................................................................................................9 3.2 COMPILING AN ENTRY ..............................................................................................................................11 4. EDITING .........................................................................................................................................................14 5. CONCLUSION...............................................................................................................................................15 REFERENCES ...................................................................................................................................................16 APPENDIX 1: INVENTORY OF PART OF SPEECH...............................................................................17 APPENDIX 2: NEPALI DEFINING VOCABULARY ...............................................................................20 APPENDIX 3: PRAGMATIC TERMS ..........................................................................................................28 APPENDIX 4: GUIDE WORDS......................................................................................................................29

version 3, 13th March 2008

2

Bhasha Sanchar Project

Madan Puraskar Pustakalaya

1. Introduction

The field of dictionary making, especialy in English, has long been influenced by empirical and corpus-based approaches. For example, Johnson used texts as examples in his dictionary as early as 1755 and in the late 1800s again, this practice was followed in compiling Oxford English Dictionary. However, corpusbased lexicographical research has much improved with recent advances in the area of computation. As a result, most of the dictionaries, not only of English but of several other languages, are based on corpora. We have produced the first corpus-based dictionary of the Nepali language and also of any South Asian language to our knowledge, it has been produced as a digital on-line edition (www.nepalisabdakos.com) and its enlarged edition will be published later in book form as well To compile this Samkalin Nepali Shabdkos (`Contemporary Dictionary of Nepali') we depended upon having a written corpus of sufficient size. This corpus is described more fully in the report on the Nepali National Corpus (Nelralec 2008) it comprised of two parts, the core corpus of 800,000 words and general corpus which contain about 14 million words from various genres, contemporary books, journals, and digitized materials. This corpus was tagged with parts of speech and a header describing the orginals of the sample. The core corpus had been produced first, with the general corpus growing alongside it. We started the dictionary when the core corpus was partially complete, so with only around 600,000 words. However the distribution of word frequencies follows the Zipf distribution, so the most frequent words occurred in considerable numbers - for example, when there were 800,000 words, the 1000th most frequent word occurred 127 times, and there were 8375 words with 10 or more occurrences. Thus with just a relatively small corpus there was plenty of data on which to progress the initial entries. English dictionaries like Collins and Longmans will have been compiled with corpiora of 200 million words, but such a large corpus is not necessary, and indeed could be overwhelming and counterproductive. A potentially much more serious impediment to early progress was the absence of software to support dictionary preparation. We had a mature tool with which to analyse the corpus ­ the Xaira tool from Oxford Univeristy (see OUCS 2008). What we did not have was a tool for recording and managing the dictionary entries as we proceeded, since software for this was either proprietary, or aimed at single users and not Unicode compliant. Thus we had had to build our own dictionary software, and were building the dictionary software at the same time as we were compiling the dictionary, and thus had to limit the dictionary entry elements that could be compiled by what was available at the time. See the report on the dictionary software (Nelaralec 2008) for details of what the software eventually ended up ­ closely tied to the dictionary entieres we were compiling.

version 3, 13th March 2008

3

Bhasha Sanchar Project

Madan Puraskar Pustakalaya

When compiling the dictionary we followed a two-stage process, carrying out an initial analysis to produce entries for the first 20,000 words, and than editing these in depth to produce final entries that could be published. We call these stages respectively compilation sketching, and compilation finalizing, covered in sections 3 and 4 below. In section 2 we desribe the structure of the entries in detail, so central to the dictionary compilation process.

version 3, 13th March 2008

4

Bhasha Sanchar Project

Madan Puraskar Pustakalaya

2. The design of lexical entries

The lexical entries in this dictionary comprises a number of fields which are described below. These entries were were discussed in depth before compilation of the dictionary began, but were not designed completely beforehand, they evolved as words were encountered and new needs were discovered.

2.1 Headword

The headword selected from among the wordlist is one of the following forms: a root: a stem: ·

·

a citation form (which is most apt to come to the mind of native users esp. when they want to look up the meaning of a word ): a form from which the greatest number of subentries can be derived: >

> > ,

· ·

·

an irregular inflected form (a suppletive):

­

a particle: Homographs (i.e. words having the same spelling but differing in word class, meaning, etc.) will be listed serially on frequency basis as separate headwords with superscript numbers immediately following the boldface spelling, e.g.

: : : : () () () ()

·

a reduplicated word ,

Different alternate spellings are listed as headwords, directing readers to the main entry (which will be decided by their frequencies); they are also shown with their main entries, e.g. :

2.2 Word class or part of speech

The word class or part of speech and/or its subcategory have been given for each word using the inventory given in Appendix 1. For example, the word (`meaning') is tagged with the part of speech ( , i.e. noun):

version 3, 13th March 2008

5

Bhasha Sanchar Project

Madan Puraskar Pustakalaya

.

. . , , . .

. . .

. . .

Similarly, each phrase is tagged with a phrasal category, e.g.

(.)

2.3 Meanings of words

Meanings of words have been discerned by observing its concordance and surrounding contexts (i.e. left and right collocates) and defined using a limited vocabulary as far as possible in terms of three parts: a genus, a criterial part and a comment. As far as possible, the discerned meanings are described in terms of their three parts: a genus, a criterial part and a comment. Genus: a generic to which the headword is semantically related ­ ideally a single word but often a phrase ­ must not be too general, e.g. ( ) , gnu = an animal (instead of an African antelope). Criterial part: the words to modify or limit the meaning of the genus part, e.g.

, , ;

The genus and criterial part constitute the denotative meaning i.e. the gloss proper. Comment : intensional meaning, i.e. extralinguistic information. e.g.

(comment) (criterial part) (genus)

(comment) (criterial part) (genus)

version 3, 13th March 2008

6

Bhasha Sanchar Project

Madan Puraskar Pustakalaya

The meaning of a headword has been defined as far as possible in plain, simple terms. For the purpose a defining vocabulary was developed using the criteria such as frequency and intuitive judgement of Nepali speakers (see Appendix 2).

2.4. Examples

Most meanings are followed by authentic examples from concordances to show the meaning of a word by showing it in use.

, ,

But there may not be examples with rare or very infrequent words. The examples are in short phrases or sentences depending on the requirement.

. . , . .

Examples of a word contain important information about the typical patterning associated with a word. An attempt has been made to include all the possible patterns appearing in the concordance.

. . .

2.5 Pragmatics

Pragmatics signs are used to refer to the uses of a headword in various contexts, e.g.

. . . . . . ,

. .

2.6 Guidewords

Guidewords are used to help users to look up for the specific meaning when a word has more than one main meaning.various meanings, e.g.

.

version 3, 13th March 2008

7

Bhasha Sanchar Project

Madan Puraskar Pustakalaya

. () . ()

2.7 Idioms

Certain collocations in the corpus are fixed phrases or idioms which have special meaning that cannot be clear from the meanings of the separate words. Such collocations have generally been shown with the entry of their first main word, e.g.

2.8 Proverbs

Proverbs, which are short sentences usually stating something commonly experienced or giving advice, are also listed in the dictionary.

2.9 References

References are given to other dictionary entries to make the definition or usage note more clear, e.g. (`city') for the word (`village').

version 3, 13th March 2008

8

Bhasha Sanchar Project

Madan Puraskar Pustakalaya

3. Compilation of Lexical Entries

To compile the exitieres, we needed to work through the words of Nepai in some systematic manner. One approach would be to do this in alphabetical order, but this raised two difficiculties: · · If we had to stop before the dictionary was complete we would only have a particial dictionary of no use at all. As the corpus was being collected in parallel, we would start with a relatively small corpus in which unusual words may only occur one or tiwice, if at all.

The alternative was to progress through the dictionary on the basis of frequency of use, so that common words are done before unusual and infrequently used words. It was this approach that we adopted, though there still remained difficulties, as described in subsection 3.1 Then, having selected a word, the lexicography set about systematically creating a dictionary entry for it, as described in subection 3.2.

3.1 Selection of lexical entries

A list of words and their frequencies was first derived using Xaira, the XML-aware text analysis software, from the written corpora (comprised of both the core corpus and the general corpus). Figure 1 shows samples from three different sections of the list of the 10,000 most frequent words. The list comprised not only the Nepali words in common use, but also word suffices (`clitics') as an artefact of oiur tagging method, as well as the foreign terms that have become a part of the Nepali languages, e.g. (TV), (radio), (quotation), etc. The suffices, proper nouns and the foreign terms would not be included in the dictionary. Note that the list is in alphabetical order, showing the number of occurrences of each `word'. The `word' with the highest number of occurrences was `' (ko, a clitic suffix) which occurred 68,589 times in the corpus. Words 7757 to 8375 all had frequency 10. If possible we wanted all dictionary entries to be based on at least 10 occurences in the corpus. The alphabetical list of the most frequent lexical entries was then divided between the lexicographers based on the first letter of the words, aiming to give each person roughly the same number of words. With 5 linguists dedicated to lexicography, this means roughly 4,000 each over the two 21 months of dictionary compilation. In practice the corpus was itself being accumulated, and initial allocations were of just a few thousand words, with new allocations periodically as the corpus grew. A particular problem that we had to address was variant spellings. For example the "ba" and "va" distinctions are not marked in Nepali, so a name like "Vijay" might also be written "Bijay" and the word "samvad" is more commonly written "sambad". Similarly the various forms of "s" are not usually distinguished and

version 3, 13th March 2008 9

Bhasha Sanchar Project

Madan Puraskar Pustakalaya

thus words could be spelled with any of the alternative letters. Further short and long vowels are not distinguished. This meant that alternative spellings would come up as distinct entries in the frequency list, but could have been assigned to different lexicographers. Our policy of assigning the all the words with particular initial letters to the same lexicographer was aimed at alleviating this proble, most spelling variants would be handled by the same person who we hoped would recognise the variants, though this did not always happen. These alternatives clearly needed to be combined for the purposes of producing the dictionary entry, though of course the original corpus data should not be "corrected". These alternative spellings would then be recorded in the dictionary as described in subsection 2.1 above.

Word form . Frequency 18 58 67 14 20 52 59 17 17 27 46 37 68 11 476 17 12 59 14 14 Word form Frequency 34 10 26 19 15 10 17 51 427 10 21 38 20 81 30 72 20 75 25 37 . - - Word form Frequency 73 53 174 93 14 397 14 139 26 2414 34 50 20 2555 10 18 398 19 23 4260

Figure 1. Part of a list frequent, transferred to an Excel spreadsheet Reconciling the alternative spellings was not easy, and we had to rely on the experience of the lexicographers in recognising that alternatives could exist, checking for all alternatives when compiling an entry, and notifying the other lexicographers that this was being done if some other lexicograher might have the variant in his list. We could envisage the common misspellings being captured in a set of rules, and a small programme could check the lists for these and combine the entries, and thus help expedite the lexicographic process. We did not do this during

version 3, 13th March 2008

10

Bhasha Sanchar Project

Madan Puraskar Pustakalaya

initial compilation, but did have tool like to help during editing ­ see the next section. In compiling entries, we also saw the need to produce entries for "common words" that may not have appeared in the corpus. We produced a list of common words by consulting English dictionaries and ontologies and translating these words. These words were added in towards the end, after we had accumulated a reasonably large corpus.

3.2 Compiling an entry

When a lexicographer starts to compile a new entry, what does he or she do? The entry will be the next alphabetical entry on the list assigned to the lexicographer, and the first thing to do is to look at the examples in the corpus. For this we used the concordance facility of Xaira, which enables to list all the uses of the word (and its spelling variants), so that the meaning of the word can be determined by analysing the examples. Figure 3 shows an example for the word "" using a web-based concordance lister similar to Xaira. The passages retrieved could equally well have been displayed with the search word down the centre.

Figure 3. Example concordance listing obtained using a web search engine similar to Xaira.

version 3, 13th March 2008 11

Bhasha Sanchar Project

Madan Puraskar Pustakalaya

Ideally the concordance software should be able to list all words in their variants spellings within a single list. Xaira could have helped here by using a regular expression as the query, but our lexicographers were not familiar enough with Xaira to do this. A major issue when using concordance listings arises with the potential multiple meanings or senses of homographs, words that are spelt the same. Where the different senses belong to different parts of speech, the POS tagging easily disambiguates them. However where the different word senses have the same PoS tag, these different word senses need to be recognised by the lexicographer. Here again we looked to the linguistic experience of the lexicographers to distinguish the different senses of a word, and to compile entries for each of these senses. Some words may have as many as twenty senses, but this is very unusual, and most words do only have one sense. We could in principle have looked for software support from within Xaira since word sense disambiguation has been well studied computationally (see for example the review by Ide and Veronis 1998). The sense of a word is signaled by the words surrounding it, with typically up to five words on either side being viewed as sufficient. These could be clustered by well-known methods, and distinct clusters could indicate distinct senses. Xaira can help by using more complex queries, but once again our lexicographers were not familiar enough with Xaira to use these facilities. A second issue lay in the construction of succinct definition. Here we aimed at following standard lexicographic practice of using a small set of words with which to express this meaning. The list of the 3093 words that we used is shown in Appendix 2. This was constructed in major part by referring to the equivalent lists used in English dictionaries; we started by translating the Oxford list, and then consulted other dictionaries and ontologies. We had hoped to enforce the use of this list using software, but did not carry through this plan ­ the software developers were too busy constructing the software for compiling the entries in the dictionary. A third issue was the selection of good examples. If possible we wanted to use an actual example, a full sentence, from the corpus. In many cases this could be done, but sometimes we had to shorten the example by deleting extraneous words or phrases. We did not construct any artificial examples. By mid 2007, so after 20 months of dictionary compilation using two full-time and three half-time lexicographers, so three and a half full-time equivalent lexicographers, we had prepared 20,000 entries. We were producing entries at a rate of around 16 per person day. This was slower than we would have hoped, but understandable given the fact that the corpus and the lexicographic software were both being developed at the same time, and that the lexicographers themselves were learning to do corpus-based dictionary compilation.

version 3, 13th March 2008

12

Bhasha Sanchar Project

Madan Puraskar Pustakalaya

20,000 to 25,000 entries would be perfectly adequate for an Advanced Learners Dictionary, and we thought that this would be a good contribution to Nepali literature. However when we came to edit out entries we had to adjust our expectations.

version 3, 13th March 2008

13

Bhasha Sanchar Project

Madan Puraskar Pustakalaya

4. Editing

Having created the dictionary entries using a number of lexicographers, we appreciated that there was a need to harmonise the entries and make an editorial pass through the entries. At initial compilation the forms of words that came up might not have been the head word but some derived form ­ these needed to be collected together and placed in the proper relationship. Some multiple entries will inevitably have been created based on variant spellings, and these need to be combined. A number of entries, particularly early one, had been left incomplete and needed completion. Definitions needed to be given in a consistent way ­ some entries used synonyms and those needed to be replaced by our preferred method of explaining the word in simple terms. Proper nouns, including surnames, needed tobe deleted ­ explaining them fully would take significant effort. Initially we assigned an experienced linguist to the job of editing, arranging weekly meetings with two linguistic experts, two professors from Tribhuvan University, for him to raise difficult issues with the expert. One issue we wanted to resolve during this editorial process was what spelling to adopt as the "standard" spelling. This spelling would be set as the headword for the definitions and examples, with other spellings marked as variants. The two experts we had hired had been responsible for determining the approved spellings to be used in schools, and after much debate as to whether we should take a descriptive approach and adopt the most common spelling, we accepted a prescriptive approach and adopted the preferred spelling as advised to schools by our experts. However when we came to actually edit entries we found that all entries had to be examined in depth ­ this meant looking into the corpus using Xaira as well as updating the dictionary, and this took a long time. To help expedite the work a special editing facility was created to create the potential variant spelings for an entry being worked on. Effectively every entry was being rewritten, though starting from the existing entry. We increased our editing effort, to use our four most experienced linguists. At this stage we also added the use of guide-words to differentiate between different senses of word. The list of guide-words used is given in Appendix 4. This use of guide-words is new to Nepal. We found that the editing process did proceed faster that the intial entry ­ 2 to 3 times faster ­ but that this was not fast enough to edit the complete set of entries. Over six months the team of three part time editors could only produce 7,600 finalised entries, enough for a small on-line dictionary, but not large enough to print.

version 3, 13th March 2008

14

Bhasha Sanchar Project

Madan Puraskar Pustakalaya

5. Conclusion

Compiling a corpus-based dictionary had been a new experience for the team of both lexicographers and software engineers, hence time consuming as the team learned how to compile a dictionary from a corpus using a sophisticated tool like Xaira. An online version with 7,600 entries was produced and placed on the web, while plans are being made to edit the remaining compiled entries and bring out the dictionary in a book form, keeping in mind the fact that such an authentic dictionary in which words are defined in plain terms and illustrated with actual examples can be useful for school level learners of Nepali. Besides, adding more entries, the enlarged version will also include various forms such derivatives and inflections, idioms and proverbs. We still feel that the use of two phases, a first to sketch the entries and a second to finalise them, is appropariate. But we would recommend that the corpus is gathered completely before dictionary compilatin starts, and that mature software, such as that which we have created, is used, perhaps with some customisation to fit the particular language and group of lexicographers being used.

version 3, 13th March 2008

15

Bhasha Sanchar Project

Madan Puraskar Pustakalaya

References

Acharya, J. 1991. A descriptive grammar of Nepali. Washington, D.C.: Georgetown University Press. Adhikari, HR (1998) Samasmayik Nepal Vyakarana (`Contemporary Nepali Grammar'). Kathmandu: Vidyarthi Pustak bhandar. Adhikari, HR .1998. Samasmayik Nepal Vyakarana (`Contemporary Nepali Grammar'). Kathmandu: Vidyarthi Pustak bhandar. Hardie, A (2005) Automated part-of-speech analysis of Urdu: conceptual and technical issues. In: Yadava, Y, Bhattarai, G, Lohani, RR, Prasain, B and Parajuli, K (eds.) Contemporary issues in Nepalese linguistics. Kathmandu: Linguistic Society of Nepal. Hardie, A, Lohani, RR, Regmi, BN and Yadava, YP (forthcoming) A morphosyntactic categorisation scheme for the automated analysis of Nepali. Ide, Nancy and Jean Veronis (1998) Word Sense Disambiguation: The State of the Art Computational Linguistics vol 24 No 1. Pages 1 to 40. McEnery, T., and Richard Xiao and Yukio Tono.2007.Corpus-based Language Studies, London: Routledge. McEnery, T., and Xiao, Z. (2005). Character encoding in corpus construction. In Wynne, M. (ed) Developing Linguistic Corpora: A Guide to Good Practice. AHDS Literature, Languages and Linguistics, Oxbow Books. Nelralec (2008) Nepali National Corpus, Report 32a, Nelaralec project Newell, Leonard.1995.manual on Lexicography, Manila: Linguistic Society of the Phillipines. OUCS Oxford University Computer Services, Xiara http://www.oucs.ox.ac.uk/rts/xaira/ Royal Nepal Academy.1983. , Kathmandu: Royal Nepal Academy Schmidt, Ruth Laila, ed. 1993. A Practical Dictionary of Modern Nepali, New Delhi: Ratna Sagar. Sinclair, J.M. ed.1987. Looking Up: an Account of the Cobuild Project, London; Collins ELT. Sinclair, John.2000. Introduction, ix-xiii. Collins Cobuild English Dictionary for Advanced Learners, Glasgow: Harper Collins. Singh, Ram Adhar.1982. An Introduction to Lexicography, Mysore; CIIL. Yadava, Y.P. and T.R. Kansakar, eds.1997. Lexicography in Nepal, Kathmandu: Royal Nepal Academy.

version 3, 13th March 2008

16

Bhasha Sanchar Project

Madan Puraskar Pustakalaya

Appendix 1: Inventory of part of speech

- ­ - , ,

, , , , ,

,

( ) ( ) , version 3, 13th March 2008

17

Bhasha Sanchar Project

Madan Puraskar Pustakalaya

( ) , , , , , , , , , , , , , , , , , , , , , , , , , ,

version 3, 13th March 2008

18

Bhasha Sanchar Project

Madan Puraskar Pustakalaya

- - - - - - - -

( ) , , , , , , , ( ) , , ( ) , , ( ) , / /, ,/

- -

( ) ( ) ( )

version 3, 13th March 2008

19

Bhasha Sanchar Project

Madan Puraskar Pustakalaya

Appendix 2: Nepali defining vocabulary

version 3, 13th March 2008

20

Bhasha Sanchar Project

Madan Puraskar Pustakalaya

version 3, 13th March 2008

21

Bhasha Sanchar Project

Madan Puraskar Pustakalaya

version 3, 13th March 2008

22

Bhasha Sanchar Project

Madan Puraskar Pustakalaya

version 3, 13th March 2008

23

Bhasha Sanchar Project

Madan Puraskar Pustakalaya

version 3, 13th March 2008

24

Bhasha Sanchar Project

Madan Puraskar Pustakalaya

version 3, 13th March 2008

25

Bhasha Sanchar Project

Madan Puraskar Pustakalaya

version 3, 13th March 2008

26

Bhasha Sanchar Project

Madan Puraskar Pustakalaya

version 3, 13th March 2008

27

Bhasha Sanchar Project

Madan Puraskar Pustakalaya

Appendix 3: Pragmatic terms

/ (Stylistic & Pragmatic categories): (informal) (informal) (Offensive) (Rude) (taboo) (vagueness) (disapproval) (Formal) (computing) (Legal) (emphasis) (Colloquial) (Technical) (court/royal) (politeness) (Journalism) (Old-fashioned) (baby talk) (Spoken) (feelings) (dialect) (regional) (social) (diminutive) (Written) (business) (Literary) (Military) (formulae) (Slang) (Medical) (approval) (Humorous) (pejorative)

version 3, 13th March 2008

28

Bhasha Sanchar Project

Madan Puraskar Pustakalaya

Appendix 4: Guide words

Guide Word Guide Word

version 3, 13th March 2008

Guide Word

/

29

Bhasha Sanchar Project

Madan Puraskar Pustakalaya

Guide Word

version 3, 13th March 2008

Guide Word -

Guide Word

30

Bhasha Sanchar Project

Madan Puraskar Pustakalaya

Guide Word /

version 3, 13th March 2008

Guide Word

Guide Word

31

Bhasha Sanchar Project

Madan Puraskar Pustakalaya

Guide Word

version 3, 13th March 2008

Guide Word ( )

Guide Word

32

Bhasha Sanchar Project

Madan Puraskar Pustakalaya

Guide Word

version 3, 13th March 2008

Guide Word -

Guide Word -

33

Bhasha Sanchar Project

Madan Puraskar Pustakalaya

Guide Word -

version 3, 13th March 2008

Guide Word

Guide Word

-

34

Bhasha Sanchar Project

Madan Puraskar Pustakalaya

Guide Word

version 3, 13th March 2008

Guide Word

Guide Word -

35

Information

Microsoft Word - Lexicography Manual v3.doc

35 pages

Report File (DMCA)

Our content is added by our users. We aim to remove reported files within 1 working day. Please use this link to notify us:

Report this file as copyright or inappropriate

354278