Wikisource:Tesseract OCR

From Wikisource
Jump to navigation Jump to search

The Tesseract OCR tool adds a Page-namespace toolbar button(clarify) that will derive text from the current page's image, via Tesseract.js OCR engine, which is pure Javascript port of the popular Tesseract OCR engine.

Note that only some languages are supported.

Setting up[edit]

You can load Tesseract OCR for yourself by adding the following to your common.js:

mw.loader.load( '//wikisource.org/w/index.php?title=User:Putnik/TesseractOCR.js&action=raw&ctype=text/javascript' );

If you also want to translate the messages of the gadget into your language, then you can add it like this:

var tesseractOcrI18n = {
	'loading tesseract core': 'Loading Tesseract core',
	'initializing tesseract': 'Initializing Tesseract',
	'loading language traineddata': 'Loading language traineddata',
	'initializing api': 'Initializing API',
	'recognizing text': 'Recognizing text',

	'no text': 'No text retrieved from Tesseract',
	'image not found': 'No image found on this page',
	'button label': 'Get text via Tesseract OCR',
	'loading indicator': 'Animated loading indicator',
};

mw.loader.load( '//wikisource.org/w/index.php?title=User:Putnik/TesseractOCR.js&action=raw&ctype=text/javascript' );

Development[edit]

See also[edit]