Abstract:
Selection of features for extraction and classification
are the essential factors in achieving high performance in
character recognition. Feature extraction process produces
feature vectors that define the shape and characteristics of the
pattern to identify them uniquely. Many feature extraction
and classification approaches are available for Tamil and
other languages, but there is still room to identify a better set
of features for extraction to obtain higher recognition rate of
Optical Character Recognition (OCR) for Tamil printed text.
This research aims at producing an efficient set of features for
extraction, which is capable of increasing the accuracy and
reducing the runtime to improve the performance of the best
OCR system to classify isolated Tamil printed characters. The
proposed set of features is experimented on a large dataset using
One-versus-All (OVA) Support Vector Machine (SVM). Two
types of the pool of different feature vectors are created with
features used in this study such as basic, density, histogram
oriented gradients (HOG), and transition. In comparison
with the current best approach, the testing results of Pool 1
gives better recognition accuracy of 94.87 % for OVA SVM
and 97.07 % for the Unbalanced Decision Tree (UDT) SVM
algorithms, but could not reach an improved recognition speed.
Likewise, the results of Pool 2 improves the performance of
the system by giving not only better recognition accuracy
of 94.30 % for OVA SVM and 96.35% for the UDT SVM
algorithms but also reached an improved recognition speed
than the selected best OCR approach. The proposed set of
features improves the recognition rate by 2.57–3.14% on OVA
SVM and 3.22–3.94% on UDT SVM.