Read The On/Off (LMCA) Dual Arabic Handwriting Database text version

The On/Off (LMCA) Dual Arabic Handwriting Database

Monji Kherallah Abdelkarim Elbaati Haikal El Abed Adel M. Alimi REGIM: Research Group REGIM: Research Group Technical University REGIM: Research Group on Intelligent Machines on Intelligent Machines Braunschweig, Germany on Intelligent Machines [email protected] [email protected] [email protected] [email protected]

Abstract

For many years ago, handwriting is a challenger interest, both online and off line recognition of the handwriting are largely presented in literature. Latin script studies were largely presented using UNIPEN or IRONOFF databases. But Arabic studies were poorly presented in literature because there are no common Arabic data bases. In this paper, we present an Arabic data base which called LMCA (in Frensh it is "Lettres, Mots et Chiffres Arabe") and composed of letters, words and digits. We present also some related works using this database. Note that LCMA database can be used in online or in off line recognition of the Arabic handwritten script. Keywords: Online off line handwriting, LMCA database.

special pen that offers an interactive dynamic information as a sequence of points coordinates. Whereas, using the scanner, that offers static information as a pixels (see figure 1).

Figure 1 . On/Off line handwriting recognition process

1. Introduction

In few last years, the handwriting analysis and recognition is a paramount subject of the researchers interest. The validation of the works done in this area was successfully established thank to the databases use. Two sorts of databases are considered. One interests the on line studies like UNIPEN and the other interests the off line studies like (CEDAR, IRONOFF, NIST, FENIT, etc.). All these databases are important for the research community in order to test new ideas and algorithms and to perform benchmarks and thereby measure progress and general tendencies. Our paper is written as follow: Section 2 deals with the on/off line studies, section 3 presents briefly the Arabic script. Section 4 presents in details our LMCA database and some related works.

2. On/Off line Handwriting recognition

Two axes of research are available in handwriting recognition. The first one is called on line and the second off line. According to figure 1, using a digital tablet and a

The recognition concerns handwritten characters or handwritten words. According to figure 1, three phases are needed for the recognition system approval, pre processing, feature extraction and classification phases. The advantage of IRONOFF database is the offline image and the online trajectory is available. One interest concerns the evaluation of skeleton algorithms. Here, the online data could provide a way to compare the skeleton points (off line image) to an objective trajectory (online coordinates). It also becomes possible to study the correlation that could exist between the speed of the pen and the gray level distribution or the width of the corresponding strokes. If the online data is jointly accessible with the off line images, it can be used to recover the temporal order of strokes from the off line images and thereby guide and train the segmentation to provide a relevant frame description [10]. In that sense, such approaches bridge the gap between online and off line character recognition methods [7], [8] which is very attractive since it has been shown that online handwriting exhibits superior results compared to off line recognition [9]. This paper presents a methodology for the construction

of a dual on/off database which has been intended for research on the use of online information for the design and training of an off line handwriting recognition system. However, we are confident that it will enable many other experiments. A large number of samples of isolated characters, digits and Arabic words have already been collected. We briefly present the content of the resulting database (ON/OFF LMCA database).

3. Arabic script presentation

Arabic handwriting is a consonantal and cursive writing. This property is met in two forms: printed or handwritten documents. The Arabic alphabet is composed of 28 main characters (with diacritics and in isolated form) and is written from right to left (see figure 2).

Diacritical symbols are positioned at a certain distance from the character. In fact, this makes some difficulties in separating the border of a text line. Indeed, diacritical symbols can generate some redundant separate lines [2]. We count 15 among the 28 letters of the alphabet, which contain dots. Some letters present a zigzag shape said `Hamza'. It takes the same shape of letter "Ain" (first letter of second line of figure 3) but it is located above the letter "Alif" (See the first letter in figure 2). The letter "Hamza" is considered as an accent "vowel" in the Arabic alphabet. Most of the cases, the Arabic writing does not use vowels. The sense of the word is often determined by the context of the sentence. All vowels are not considered in our work.

Figure 3. The 56 different shapes of the Arabic letters. Figure 2. Isolated Arabic letters.

4. Database formulation

Database for character recognition algorithms is of fundamental interest for the training of recognition method. We developed our own database which contains 30.000 digits, 100000 Arabic letters and 500 Arabic words. This database was developed in our laboratory REGIM (REsearch Group on Intelligent Machines). Both on/off line handwritten characters and words are considered. The online procedure is based on collection of coordinate (x, y) of the handwritten trajectory. Whereas, off line procedure is based on collection of images of the handwritten trajectory (see figure 4). These two types of information should be available within the same coordinate system, with the same origin and the same resolution and orientation. 55 participants were invited to contribute to the development of the handwritten LMCA database. The dataset of words of each participant is stored in one data

Most characters have four different shapes. The difference between these letters lies in their positions in the word, the number and the position of the diacritic dots and the presence of the "hamza" and Vowels. In fact, the majority of letters change slightly in shape according to their position in the word (initial, medium or final). This change occurs when the letter is either joined one or an other isolated. Therefore, we have 56 Arabic letters without diacritics as shown in figure 3.

3.1. Diacritical symbols influence

Some Arabic letters have the same form, however, they are distinguished from each other by the addition of dots in different positions relative to the main stroke. Some Arabic characters use special marks to modify the character accent. When diacritical symbols (dots, specials marks) are used, they appear above or below the characters and they are drawn as isolated entities as shown in figure 2.

file. When producing the data file, each participant was asked to write some Arabic words.

Figure 6. The hand writer interface

Figure 4. On/ Off line data set collection

We collected 500 words written by different writers. The data for each participant are stored in one data file. For the digits dataset construction, some participant was asked to write a set of all digits (1000 to 1500 samples of digits).

The same procedure was applied to prepare 100000 Arabic letters. About two thirds of the writers were male, about 90% were right handed, the youngest writer was 8 years old, and the oldest was 66. In the online domain, the forms have been sampled with a spatial resolution of 200 dpi and a sampling rate of 100 points/s (Wacom UltraPad A4) and were stored using the UNIPEN format. To collect data, a graphical user interface which is called "Handwriter" has been developed on a PC/NT window environment. The online information of the handwriting is kept in text file as shown in figure 6. The pen position up down is detected respectively by 0 and 1 values. The trajectory of handwritten script is collected as a coordinates of x and y from the digital tablet (see figure 6).

(a): Example of some handwritten words

4.1. Online handwriting recognition system based on trajectory and velocity modeling using LMCA database

Digit recognition was studied ten years ago and, conveyed that the fuzzy approach enhance the classification performance. In this study, the feature extraction system was based on "Betaelliptical" representation [4]. One of the main classification problem is the variability of the feature vector size (35, 42,...63) depending of each digit number of strokes. The recognition process is divided into preprocessing steps and subsequent classification. Facing up to the complex problems of the handwriting recognition, the use of the multiple, hybrid and an association of classifier systems proves an increasing interest during the last years [4]. Based on their complementarities, the association of classifiers increases the performance of the recognition system while limiting the error bound to the use of a unique classifier. The use of the multiple classifier systems benefits from the strong

(b): Example of some handwritten digits Figure 5. Example of scanned handwritten words and digits extracted from LMCA database

We imposed to the writer just to write ten times the same digit, from 0 to 9 in the same page. One page contains 100 digits. He asked to prepare only one page per day. We have collected 30.000 digits in total. More than half of them are regularly written. The remaining ones are those either with noise in the data, poorly written or deliberately written in strange and unusual ways. Figures (5a and 5b) present an example of scanned words and digits. They are presented as an image in JPG format.

points of every classifier. In (Kherallah et al.), the recognition system was based on the use of neural networks developed in a fuzzy concept [4]. The desired outputs of MLPNN are formed using SOM and FKNN algorithm (see figure 7). Therefore, this system is about neurofuzzy networks based on SOM and FKNNA association used in the learning process [4]. The global recognition rate obtained is about 95.08 %. When testing our system, the global average squared error obtained is about 0.065.

the order of these letters is considered. We attribute Nthe number of basic letters extracted from a cursive word.

Table 2. Visual codes of handwritten trajectory Meaning Valley Loop Right open curvature Left open curvature Arabic letter "Ain" Arabic letter "Sad" Space Arabic letter Alif Ascender Left oblique ascender Right oblique ascender Descender Visual code Va Lo Roc Loc Ain Sad # Al As Loa Roa Des Form and position Indices 1 2 7 8 11 12 13 4 6 9 10 5

Learning database Desired Y

SOM

ERROR

Fuzy KPPV

KPPVF PMC

Y found

SOM

PMC

Test database

Test database

Figure 7. On-line recognition system of the handwritten digits

In this study, our aim is to validate the use of digit set extracted from LMCA database. It is known that the MLP and SVM techniques give the same performance. The first experience was based on the use of LMCA digit dataset, whereas the second one was based on the use of UNIPEN digit dataset. The results obtained were similar, which proves that the developed LMCA digit dataset has a correct format benchmark (see table 1).

Table 1 . Comparative study between UNIPEN digit dataset and LMCA digit dataset Classifier MLP SVM Modeling system Betaelliptical Betaelliptical Dataset 30000digits LMCA Digit set of UNIPEN Recognition rate 94.14% 94.78%

Pocket

Po

3

4.2. Online recognition of the handwritten Arabic word based on visual encoding and GA using LMCA database

A handwritten word is represented by a continuation of visual codes of Arabic letters (see table 2) [11]. In this case

Therefore, every gene of the population has N chromosomes and every chromosome has one of the 58 possible values (1 to 57 for the basic Arabian characters and the value 0 for characters with more than one visual indication) numbered from the right to the left. The extraction rate obtained is about 72 %. However, the second stage, which consists in correcting the weaknesses of the previous method, we developed a GA in order to select the best combination of visual codes extracted from a word by the heuristic method [11]. The GA approach here permits the recognition of cursive handwriting without the limitation of a lexical dictionary (see figure 8). Therefore the convergence of GA algorithm is assured by the technique given in the fitness function which consists in the use of the visual codes of Arabic words and the comparison method established between the visual indices strings according to the table 3 (See the example given in table 4). The number of generations (500) and the

fitness value (0.5) were fixed as a convergence condition criterion. According to figure 15, if the population size was fixed to 100 individuals, the recognition rate is about 99.85 %. These results obtained are encouraging.

In this experiment we used the 500 words and the 57 Arabic letters extracted from the LMCA database. 200 words were used as data prototypes for the selection of the initial population of the GA, the others were used for testing our system. Figure 5a shows an example of some handwritten Arabic words extracted from LMCA database.

Input word

Pre processing

4.3. Order temporal reconstruction from Arabic image word using LMCA databases

The word image captured in gray level with a resolution of 300 dpi will be preprocessed by four stages: binarisation, filtering, extraction of the skeleton and elimination of the diacritical signs (see figure 9). A suitable algorithm segments the skeleton in three types of segments: segments of connection, occlusion and segments of end of stroke. The starting segment is localized by sweeping the image of skeleton from the right to the left and more tests are applied.

Visual encoding

GA

Proposed word

Figure 8. On-line recognition of the handwritten Arabic word Table 3. Fitness values of every two visual codes

IV Va Oc Po Al Ja Ha Cod Cog Hog Hod Ain Sad # Va Oc Po Al Ja Ha Cod Cog Hog Hod Ain Sad #

0 1 1 1 1 1 1 1 1 1 1 1 1

1 0 1 1 1 1 0.5 0.5 1 1 0.5 0.5 1

1 1 0 1 0.5 1 1 1 1 1 1 1 1

1 1 1 0 1 0.5 1 1 0.5 0.5 1 1 1

1 1 0.5 1 0 1 1 1 1 1 1 1 1

1 1 1 0.5 1 0 1 1 0.5 0.5 1 1 1

1 0.5 1 1 1 1 0 1 1 1 0.5 0.5 1

1 0.5 1 1 1 1 1 0 1 1 0.5 0.5 1

1 1 1 0.5 1 0.5 1 1 0 0.5 1 1 1

1 1 1 0.5 1 0.5 1 1 0.5 0 1 1 1

1 0.5 1 1 1 1 0.5 0.5 1 1 0 0.5 1

1 0.5 1 1 1 1 0.5 0.5 1 1 0.5 0 1

1 1 1 1 1 1 1 1 1 1 1 1 0

Figure 9. Image of the Arabic word KASSIRON means(short) before and after the preprocessing

Table 4. Example of fitness function calculation between two Arabic words

1erWord Visual indices 2 Word

er

0 5 0 0 0 2 1 #

0 0 0 0 0 2 1 1 4

1 1 0

0 0 1 0 7

0 0 0 0 1 0 1 4 0 0

0 0 1 0 0 0 7 0

Visual indices 5

13 5 2 1 1 0

Another algorithm makes it possible to order these segments while being based on a whole of the heuristic rules. These rules count on the fact that arabic script is written from the right to the left and take into account the natural order of strokes generation [10]. To validate this approach we tested it on a whole of the words extracted from the on/off LMCA database. The temporal order signal which is reconstructed (see figure 10), will be compared with its original online trajectory signal (see figure 11).

Fitness value= 1 0,08 4.075

References

[1] Kherallah M, Haddad L, Mitiche A and Alimi M.A. (2004) Towards the design of handwriting recognition system by neurofuzzy and Beta Elliptical approaches. Proc. AIAI. 18th IFIP World Computer Congress, 2004, pp. 187­196. Jouini B, Kherallah M and Alimi M. A. "A new approach for online visual encoding and recognition of handwriting script by using neural network system". In: David W. Person. Ed. Artificial Neural Nets and Genetic Algorithms. Springer, Wien, 2003, pp. 161­167 Gader P., Mohamed M., and Chiang J.H.: "Comparison of Crisp and Fuzzy character neural networks in handwritten word recognition", IEEE Trans. on Fuzzy Systems, Vol 3, 1995, pp. 357363. Kherallah M., Hadded L., Mitiche A., Alimi A. M. "OnLine Recognition Of Handwritten Digits Based On Trajectory And Velocity Modelling". International journal of Pattern Recognition Letter. Vol. 29. pp. 580594. 2007. Coté, M., Cheriet, M., Leconet, E., Suen, C.Y.,. "Building a perception Based Model for Reading Cursive Script", Proc. 7th Int. Conf. on Document Analysis and Recognition ICDAR, vol I, 1995, pp 898901. Lorigo LM, Govindaraju V, Offline Arabic handwriting recognition: a survey. IEEE Trans Pattern Anal Mach Intell, May 2006, 28(5) pp 712724. Jäger, S. "Recovery dynamic information from static, handwritten word images", Ph D.Thesis , DaimlerBenz AG Research and Tech., Verlag Dietmar Fölbach, 1998. Lallican, P.M., ViardGaudin, C. "Offline handwriting modeling as a trajectory tracking problem", IWFHR'6, Taejon, Korea, Aug. 1998, pp 347356. R. Seiler, M. Schenkel, F. Eggimann, "Offline cursive handwriting recognition compared with On line recognition", ICPR'96, Vienna, pp 505509. Elbaati A., Kherallah .M , Alimi .M.A, Ennaji .A, "De HorsLigne Vers un Système de Reconnaissance EnLigne : Application à la Modélisation de l'Écriture Arabe Manuscrite Ancienne" , ANAGRAM 2006, Fribourg Suisse. Kherallah M., Bouri F., and Alimi M. A. " Toward an OnLine Handwriting Recognition System Based on Visual Coding and Genetic Algorithm ". Book chapter, David W. Person, Spring 2005, pp. 502505. Coimbra, Portogal.

[2] Figure 10. Restoration of the temporal order of the off-line Arabic word [3]

4 5 0 0

4 0 0 0

3 5 9 0

3 0 9 0

3 5 8 0

3 0 8 0

3 5 7 0

3 0 7 0

3 5 6 0

3 0 6 0 1 0 20

1 0 4 0

1 0 6 0

1 0 80

2 0 00

2 0 20

20 40

20 60

20 8 0

[4]

4 5 0 0

4 0 0 0

3 5 9 0

3 0 9 0

3 5 8 0

3 0 8 0

3 5 7 0

3 0 7 0

3 5 6 0

[5]

1 0 4 0 1 0 6 0 1 0 8 0 2 0 0 0 2 0 2 0 2 0 4 0 2 0 6 0 2 0 8 0

3 0 6 0 1 0 2 0

Figure 11. Original on-line trajectory signal of the same word showing the correct order

5. Conclusion

The different works and their results prove that our LMCA database can be used in both modeling and recognition system of the Arabic handwriting. The related works presented prove also that LMCA database is a standard database and it has the same format of the common UNIPEN or IRONOFF database. Our perspective is to increase the number of writers of the LMCA database which makes it more perform for any techniques of modeling and classification of the handwritten Arabic script.

[6]

[7]

[8]

[9]

Acknowledgement

The authors thank all participants' contribution to LMCA database formulation. They sincerely appreciate Dr. Volker Margner for his suggestions and contributions. In addition, they acknowledge the financial support of this work by grants from the project of German Academic Exchange Service (DAAD) and the General Direction of Scientific Research and Technological Renovation (DGRST), Tunisia, under the ARUB program 01/UR/11/02.

[10]

[11]

Information

The On/Off (LMCA) Dual Arabic Handwriting Database

6 pages

Report File (DMCA)

Our content is added by our users. We aim to remove reported files within 1 working day. Please use this link to notify us:

Report this file as copyright or inappropriate

895638

You might also be interested in

BETA
The On/Off (LMCA) Dual Arabic Handwriting Database
Handwriting Copybook Style Analysis Using Pseudo-Online Data