The Chars74K dataset

Character Recognition in Natural Images

[ jump to download ]

Character recognition is a classic pattern recognition problem for which researchers have worked since the early days of computer vision. With today's omnipresence of cameras, the applications of automatic character recognition are broader than ever. For Latin script, this is largely considered a solved problem in constrained situations, such as images of scanned documents containing common character fonts and uniform background. However, images obtained with popular cameras and hand held devices still pose a formidable challenge for character recognition. The challenging aspects of this problem are evident in this dataset.

In this dataset, symbols used in both English and Kannada are available.

In the English language, Latin script (excluding accents) and Hindu-Arabic numerals are used. For simplicity we call this the "English" characters set. Our dataset consists of:

This gives a total of over 74K images (which explains the name of the dataset).

The compound symbols of Kannada were treated as individual classes, meaning that a combination of a consonant and a vowel leads to a third class in our dataset. Clearly this is not the ideal representation for this type of script, as it leads to a very large number of classes. However, we decided to use this representation for our baseline evaluations present in [deCampos et al] as a way to evaluate a generic recognition method for this problem.


The following paper gives further descriptions of this dataset and baseline evaluations using a bag-of-visual-words approach with several feature extraction methods and their combination using multiple kernel learning:

T. E. de Campos, B. R. Babu and M. Varma. Character recognition in natural images. In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP), Lisbon, Portugal, February 2009.
Bibtex | Abstract | PDF

Follow this link for a list publications that have cited the above paper and this link for papers that mention this dataset.


Disclaimer: by downloading and using the datasets below (or part of them) you agree to acknowledge their source and cite the above paper in related publications. We will be grateful if you contact us to let us know about the usage of the our datasets.

Sample usage

For experiments with Chars74K-15, i.e., train with 15 samples per class and test with other 15 samples per class for the images of "English" characters in the wild:
  1. Download and unpack EnglishImg.tgz and ListsTXT.tgz
  2. In Octave (or MatLab), run list_English_Img to load the lists to the memory.
  3. The training set for this particular experiment will be indexed by list.TRNind(:,end) (so please copy the result of this to a variable), i.e., it is defined by the last split of training data. Note that sum(list.TRNind(:,end)>0)/list.NUMclasses results in 15, confirming that there are 15 training samples per class.
  4. The test set will be defined by list.TSTind(:,end) (Note that here the index end can be replaced by any valid number because the columns of list.TSTind are repetitions of the same list. This is because we have fixed the test set for all the experiments with 15 samples per class.)
  5. The class labels (ground truth) are obtained with list.ALLlabels(list.TRNind(:,end)) for the training set and list.ALLlabels(list.TSTind(:,end)) for the test set. Again, note that for each class label, there should be 15 samples, i.e., sum(list.ALLlabels(list.TSTind(:,end))==x) results in 15 for x=1:list.NUMclasses
  6. You can then proceed with your experiments by selecting training files following this: list.ALLnames(list.TRNind(c,end),:) for training and list.ALLnames(list.TSTind(c,end),:) for testing, where c is the index that iterates through the samples (from 1 to 930 in the case of this experiment, but it can be bound by class labels, depending on how you implement your experiment). Note that list.ALLnames does not include the file extension (png) and the absolute path of the images, you need to append these to the string.
Note that the above can also be done for the validation and vocabulary lists if they are used in the same experiment (list.VALind and list.TXNind)

Sample characters

(In addition to the ones shown following the links above)

Credits and Acknowledgements

This dataset and the experiments present in the paper were done at Microsoft Research India by T de Campos, with the mentoring support from M Varma. Additional SVM and MKL experiments were performed by BR Babu.

We would like to acknowledge the help of several volunteers who annotated this dataset. In particular, we would like to thank Arun, Kavya, Ranjeetha, Riaz and Yuvraj. We would also like to thank Richa Singh and Gopal Srinivasa for developing some of the tools for annotation (one of the tools used is described here). We are grateful to CV Jawahar for helpful discussions.

We thank the CVSSP/Surrey for hosting this web page. T de Campos also thanks Xerox RCE for the support while he finalized the paper.

Main contact: T.deCampos
Last modified: Mon Oct 15 10:47:12 BST 2012