Abstract basic elements which these characters are composed

Abstract

This
paper presents a novel approach for Sinhala handwritten character recognition
using a part based matching technique. Sinhala is a language used by Sinhalese
the major ethnic group in Sri Lanka. The Sinhala character set consist of some
common parts. Therefore the characters can be split in to its parts. Each part
in turn can be considered as an atomic element which these characters are
composed of. The proposed method splits the characters to their atomic parts
and then conduct the recognition process. Template matching is used to compare
the character parts and characters. To improve the recognition process the
global characteristics of the characters are used.

 

Keywords: Sinhala characters,  Part-based approach

 

Introduction

Character
recognition is a procedure of converting images of handwritten, typewritten or
printed text into machine encoded code or text (Schantz, 1982). Most of the hand written character recognition
methods has been proposed for the recognition of scripts such as English and Chinese.
Few attempts has been taken on the Asian languages such as Sinhala. Sinhala is the
language used by Sinhala people, the major ethnic group in Sri Lanka. The
characters in the Sinhala alphabet share some common parts. Hence Sinhala
characters can be split to a set of parts which can be considered as the basic
elements which these characters are composed of. Each Sinhala character can be
formed by a set of distinct parts.

Part-based
approaches have been experimented for object recognition with promising results.
Most of the work consider the problem of matching corresponding parts of
objects across different images. The part-based methods can be applied for the
experiments in character recognition as well. There is no research done in the
area of evaluating the influence of decomposing Sinhala characters, recognizing
them in a part wise manner and combining the results to character recognition.

Most
of the related work for decomposing characters has been experimented on Chinese
characters (Cao,
R. & Tan, C. L., 2000), (Lin, F. & Tang, X., 2002) 5, 6. The reason is that the Chinese characters can be easily
decomposed to a set of basic character parts. Because of the nature of using
straight lines in Chinese characters some researches have used the method of
identifying strokes of characters  (Lin, F. & Tang, X., 2002), (Su, Y. M.
& Wang, J. F. , 2003). This method is not
applicable for Sinhala characters as the straight lines are almost non-existent
in the Sinhala character set.

The
work of Matsuo, Takafumi et al.  (Matsuo, T., Song, W., Feng,
Y., & Uchida, S., 2013) is more relevant to the
work of the present research. They work on Chinese characters but they
decompose the characters in to parts which are short segments of an entire
character. Then they represent each part as a segment comprised of (2k+1)
consecutive points, where k is the radius of the part. They have used 80
elementary Chinese character classes, and extracted 25 representative parts.
Each handwriting character was then resampled to have 50 points and through
that, each character was converted in to a set of 50 parts. Those parts are
then represented as a bag-of-features, which is a histogram showing how many
parts similar to a specific representative part exist in the character. They
show that without any global structure information a 50-60% accuracy can be
attained for 80 Chinese character classes.

In
digit recognition, it is only a matter of recognizing between 10 different classes
from 0 to 9 whereas in character recognition, the number of different classes
are much higher in number.  (Wang, S., Uchida, S.,
Liwicki, M., & Feng, Y., 2013)
has presented a study of the behavior of several
part-based methods for handwritten digit recognition. According to them, even
without the usage of the global structure of the digits, the part-based method
can achieve promising recognition rates for digit recognition.

Template
Matching is a technique used to map one template image into another. It
searches for the most similar image pattern in the image for the template
image. Template matching is used for character recognition in the work of (Kumar, S. & Sharma, P.,
2013) for offline handwritten and typewritten character
recognition.  (Qatran, 2011) has also used a
template matching method, to recognize Musnad alphabet, which is considered the
basic alphabet of the modern Arabic language.

 

Methodology

A
set of 24 characters from the Sinhala alphabet are selected for the investigation.
The selected character set excludes the modifier symbols and less frequently
used characters.

A
dataset for the characters were created, considering of two sets; a set for
character parts and a set for characters. To create the dataset, several
samples from each character were written on a blank A4 sheet and scanned using
300dpi resolution to create an image in jpeg format. Character parts and
characters were taken from the scanned images. When creating all the images, it
was made sure that the character images were larger than the largest part
image. All the images were then thresholded to create a binary image.

Because
of the concept of the character part assembling in to complete characters, a
set of patterns can be identified which helps to increase the accuracy of an
identified character. An example of a pattern is that if part no. 03 is present
it is always the character ?.
The set of character parts are shown in Table (i).

 

Table (i). Set of character parts

Part No.

Part

Part No.

Part

Part No.

Part

Part No.

Part

01

08

15

22

02

09

16

23

03

10

17

24

04

11

18

25

05

12

19

26

06

13

20

 

 

07

14

21

 

 

 

 

In order to get a rating based on the match
between a particular character and a character part, template matching is used.
Template Matching searches for the most similar image pattern in the image for
the template image. Template matching requires a template image and a source
image. In this case the template would be the character part image and the
source image would be the character image.

In recognition, all the parts are compared
with a particular character and a rating based on the match was obtained. Then
for each character class the matching score can be calculated using the
individual part matching ratings for a particular character class.

The above
matching score using the simple average does not use the character rules
defined earlier. Now a better matching score can be calculated using the
character rule set and the position of the character parts. There were three
horizontal and two vertical regions that was considered in using matching the
character parts. Therefore considering all the combinations of horizontal and
vertical regions, there were six possible character regions. Table (ii) shows
the identified regions of a character.

 

Table (ii). Regions of a character

 

Example
Image

Region

Example
Image

Region

Upper region
(Horizontal)

Left region
(Vertical)

Middle region
(Horizontal)

Right region
(Vertical)

Lower region
(Horizontal)

 

 

 

Results and Discussion

200
handwritten characters were compared for each character part. Thereafter,
weighted average calculations were obtained per each test character for each
character class.

The individual accuracies of the final results
are shown in table (iii).

 

Table (iii). Individual
character class matching accuracy

Character

Percentage

Character

Percentage

?

100%

?

44%

?

44%

?

22%

?

67%

?

56%

?

11%

?

22%

?

22%

?

11%

?

77%

?

67%

?

89%

?

44%

?

67%

?

66%

?

89%

?

22%

?

55%

?

11%

?

55%

?

11%

?

11%

 

 

 

 

Conclusion

We
have experimented a simple approach for Sinhala handwritten character
recognition using a part based matching technique. The matching score was
computed using template matching to identify similar pattern between the test
character and the character part. Experimental results show that the proposed
method gives average results.

Author: