Skip to content

Cyrillic-oriented MNIST. A dataset of Latin and Cyrillic letter images for text recognition.

Notifications You must be signed in to change notification settings

DataScienceRetreat/CoMNIST

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Cyrillic-oriented MNIST

CoMNIST services

A repository of images of hand-written Cyrillic and Latin alphabet letters for machine learning applications.

The repository currently consists of 28,000+ 278x278 png images representing all 33 letters of the Russian alphabet and the 26 letters of the English alphabet. These images have been hand-written on touch screen through crowd-sourcing.

The dataset will be regularly extended with more data as the collection progresses

An API that reads words in images

CoMNIST also makes available a web service that reads drawing and identifies the word/letter you have drawn. On top of an image you can submit an expected word and get back the original image with mismtaches highlighted (for educational purposes)

The API is available at this address: http://35.187.34.5:5002/api/word It is accessible via a POST request with following input expected:

{
    'img': Mandatory b64 encoded image, with letters in black on a white background
    'word': Optional string, the expected word to be read
    'lang': Mandatory string, either 'en' or 'ru', respectively for Latin or Cyrillic (russian) alphabets
    'nb_output': Mandatory integer, the "tolerance" of the engine
}

The return information is the following:

{
    'img': b64 encoded image, if a word was supplied as an input, then modified version of that image highlighting mismatches
    'word': string, the word that was read by the API
}

Participate

The objective is to gather at least 1000 images of each class, therefore your contribution is more that welcome! One minute of your time is enough, and don't hesitate to ask your friends and family to participate as well.

English version - Draw Latin only + common to cyrillic and latin

French version - Draw Latin only + common to cyrillic and latin

Russian version - Draw Cyrillic only

Find out more about CoMNIST on my blog

Credits and license

A big thanks to all the contributors!

These images have been crowd-sourced thanks to the great web-design by Anna Migushina available on her github.

CoMNIST logo by Sophie Valenina

Creative Commons License
CoMNIST by Gregory Vial is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

About

Cyrillic-oriented MNIST. A dataset of Latin and Cyrillic letter images for text recognition.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 96.2%
  • Dockerfile 2.9%
  • Shell 0.9%