AbstractThispaper presents a novel approach for Sinhala handwritten character recognitionusing a part based matching technique. Sinhala is a language used by Sinhalesethe major ethnic group in Sri Lanka. The Sinhala character set consist of somecommon parts. Therefore the characters can be split in to its parts. Each partin turn can be considered as an atomic element which these characters arecomposed of. The proposed method splits the characters to their atomic partsand then conduct the recognition process. Template matching is used to comparethe character parts and characters. To improve the recognition process theglobal characteristics of the characters are used.
Keywords: Sinhala characters, Part-based approach IntroductionCharacterrecognition is a procedure of converting images of handwritten, typewritten orprinted text into machine encoded code or text (Schantz, 1982). Most of the hand written character recognitionmethods has been proposed for the recognition of scripts such as English and Chinese.Few attempts has been taken on the Asian languages such as Sinhala. Sinhala is thelanguage used by Sinhala people, the major ethnic group in Sri Lanka. Thecharacters in the Sinhala alphabet share some common parts. Hence Sinhalacharacters can be split to a set of parts which can be considered as the basicelements which these characters are composed of. Each Sinhala character can beformed by a set of distinct parts. Part-basedapproaches have been experimented for object recognition with promising results.
Most of the work consider the problem of matching corresponding parts ofobjects across different images. The part-based methods can be applied for theexperiments in character recognition as well. There is no research done in thearea of evaluating the influence of decomposing Sinhala characters, recognizingthem in a part wise manner and combining the results to character recognition.
Mostof the related work for decomposing characters has been experimented on Chinesecharacters (Cao,R. & Tan, C. L., 2000), (Lin, F. & Tang, X., 2002) 5, 6.
The reason is that the Chinese characters can be easilydecomposed to a set of basic character parts. Because of the nature of usingstraight lines in Chinese characters some researches have used the method ofidentifying strokes of characters (Lin, F. & Tang, X., 2002), (Su, Y. M.
& Wang, J. F. , 2003). This method is notapplicable for Sinhala characters as the straight lines are almost non-existentin the Sinhala character set.
Thework of Matsuo, Takafumi et al. (Matsuo, T., Song, W., Feng,Y.
, & Uchida, S., 2013) is more relevant to thework of the present research. They work on Chinese characters but theydecompose the characters in to parts which are short segments of an entirecharacter. Then they represent each part as a segment comprised of (2k+1)consecutive points, where k is the radius of the part. They have used 80elementary Chinese character classes, and extracted 25 representative parts.Each handwriting character was then resampled to have 50 points and throughthat, each character was converted in to a set of 50 parts. Those parts arethen represented as a bag-of-features, which is a histogram showing how manyparts similar to a specific representative part exist in the character.
Theyshow that without any global structure information a 50-60% accuracy can beattained for 80 Chinese character classes.Indigit recognition, it is only a matter of recognizing between 10 different classesfrom 0 to 9 whereas in character recognition, the number of different classesare much higher in number. (Wang, S., Uchida, S.,Liwicki, M., & Feng, Y., 2013)has presented a study of the behavior of severalpart-based methods for handwritten digit recognition.
According to them, evenwithout the usage of the global structure of the digits, the part-based methodcan achieve promising recognition rates for digit recognition.TemplateMatching is a technique used to map one template image into another. Itsearches for the most similar image pattern in the image for the templateimage. Template matching is used for character recognition in the work of (Kumar, S. & Sharma, P.
,2013) for offline handwritten and typewritten characterrecognition. (Qatran, 2011) has also used atemplate matching method, to recognize Musnad alphabet, which is considered thebasic alphabet of the modern Arabic language. MethodologyAset of 24 characters from the Sinhala alphabet are selected for the investigation.The selected character set excludes the modifier symbols and less frequentlyused characters. Adataset for the characters were created, considering of two sets; a set forcharacter parts and a set for characters. To create the dataset, severalsamples from each character were written on a blank A4 sheet and scanned using300dpi resolution to create an image in jpeg format. Character parts andcharacters were taken from the scanned images.
When creating all the images, itwas made sure that the character images were larger than the largest partimage. All the images were then thresholded to create a binary image. Becauseof the concept of the character part assembling in to complete characters, aset of patterns can be identified which helps to increase the accuracy of anidentified character. An example of a pattern is that if part no. 03 is presentit is always the character ?.The set of character parts are shown in Table (i).
Table (i). Set of character parts Part No. Part Part No. Part Part No. Part Part No. Part 01 08 15 22 02 09 16 23 03 10 17 24 04 11 18 25 05 12 19 26 06 13 20 07 14 21 In order to get a rating based on the matchbetween a particular character and a character part, template matching is used.
Template Matching searches for the most similar image pattern in the image forthe template image. Template matching requires a template image and a sourceimage. In this case the template would be the character part image and thesource image would be the character image. In recognition, all the parts are comparedwith a particular character and a rating based on the match was obtained. Thenfor each character class the matching score can be calculated using theindividual part matching ratings for a particular character class. The abovematching score using the simple average does not use the character rulesdefined earlier.
Now a better matching score can be calculated using thecharacter rule set and the position of the character parts. There were threehorizontal and two vertical regions that was considered in using matching thecharacter parts. Therefore considering all the combinations of horizontal andvertical regions, there were six possible character regions. Table (ii) showsthe identified regions of a character. Table (ii). Regions of a character Example Image Region Example Image Region Upper region (Horizontal) Left region (Vertical) Middle region (Horizontal) Right region (Vertical) Lower region (Horizontal) Results and Discussion200handwritten characters were compared for each character part.
Thereafter,weighted average calculations were obtained per each test character for eachcharacter class.The individual accuracies of the final resultsare shown in table (iii). Table (iii). Individualcharacter class matching accuracy Character Percentage Character Percentage ? 100% ? 44% ? 44% ? 22% ? 67% ? 56% ? 11% ? 22% ? 22% ? 11% ? 77% ? 67% ? 89% ? 44% ? 67% ? 66% ? 89% ? 22% ? 55% ? 11% ? 55% ? 11% ? 11% ConclusionWehave experimented a simple approach for Sinhala handwritten characterrecognition using a part based matching technique.
The matching score wascomputed using template matching to identify similar pattern between the testcharacter and the character part. Experimental results show that the proposedmethod gives average results.