User:Seb35/Reverse OCR

From Wikisource
Jump to navigation Jump to search

Reverse OCR[edit]


By discussing with fr:User:Jean-Frédéric and others, the idea of reinjecting the corrected wikitext of Wikisource in the DjVuS appeared. This is not very complicated (also there is some work to handle properly the -- poor compared to Wikipedia -- wikitext of Wikisource), but, related to my PhD thesis, I thought we can make more than that.

We have three sources of data:

  • the corrected wikitext of Wikisource, which is probably a very good approximation of the true text (humans can do errors, but less than the machines in this domain);
  • the original OCR, if this one has a presumed good quality;
  • the layout of the page and particularly the position of each word, when the OCR gives this type of information (the XML ALTO widely used by the BnF and probably by the LoC, for instance) or by using a specialized program (document segmentation and so on in research subjects).

So we can do three types of delivrables:

  • retrieve the position of each word/character on the page (and so reinjected properly the wikitext into DjVu text layers, and/or the XML ALTO, etc.);
  • give a score to each word/page/book by matching the image of each WS-word and/or OCR-word on the original image to give a score to the quality of the WS-book and/or to the quality of the OCR;
  • try to correct/suggest errors on the remaining/non-matching words, either by using the original OCR if this one match better either by re-do an OCR pass on these regions/words, but at least point out the non-matching words.

Finally, such a thing would help 1) to evaluate the quality of the corrected WS book and the quality of the OCR; 2) to point out and/or correct the remaining "errors"; 3) to reinject the triple-checked text (double-check by WS and one check by this process) into the XML ALTO and/or DjVu books.

Another type of use of a Reverse_OCR is the possibility of coupling this with an OCR by doing a loop between OCR and Reverse_OCR until convergence on a word: the OCR proposes one or many solutions, the Reverse_OCR checks and either gives a better answer either refuses all solutions, and this information could be given to the OCR (perhaps coupling this with a neural network for automatic learning but I don’t know this field of research).


Layout analysis[edit]

Fields of research linked and keywords: classification, automatic document processing, pattern recognition (too large field), computer vision (too huge field)

Example of work:


I thought particularly to methods I use in my PhD thesis, although it need some adaptation compared to my thesis: probabilistic methods with particles filters, Hidden Markov Models, etc. In fact particularly particle filters [1] which is very flexible in the model you use.

Basically you have N particles which represent possibilities/eventualities, each of them is more or less probable.

  1. Initialization: initialize the position of the first word and the first word, this could be the more tricky operation
  2. Iterate on the words
    1. Forecast: predict roughly the position of the next word
    2. Correction/update: weight the particles according to the likelihood of the observation; resample the set to remove lesser probable particles

It needs a model in the prediction. In first approximation this would be the position (x,y) of the previous word ("current state") plus the length (in pixels) of the previous word. This is a (very bad) approximation because of: 1) linebreaks; 2) typography (font, bold, italics, subpixel, kerning, etc.); 3) caesurae (the current "word" could be only a part of the word); 4) columns; 5) "orphan" words (header, footer, nonlinear texts like exotic poetry, etc.)

It needs a model of noise on the dynamical model. It could be a gaussian noise as usual when one don’t know exactly the type of noise we are face to, zero-mean and some pixels of standard deviation (or estimated from the uncertainty of the previous word). The std could be more or less high for certain parameters: for instance generally the font isn’t supposed to change frequently.

It needs an observation model. This could be the image of the estimated word given all parameters of the state vector, which would be checked against the original image.

Finally it needs an observation noise, which could be a gaussian noise. This noise represents the noise of the digitization. It is probably hightly decorrelated, or perhaps correlated only with the immediate neighbour pixels.

State space[edit]

Very basically the state space contains the word and its position, but whether it is the better one will probably be for the results.

  • Word length in letters
  • Word
  • Position x
  • Position y
  • Font
  • Is bold?
  • Is italics?
  • Caesura state (no one, first part, second part)
  • Caesura index (index of the letter in the word)

Tools I intend to use[edit]

  • Language: C++
    I began with Python, but FreeType doesn’t have a proper Python interface (I aggree there exists some tools (I saw in Boost libraries) to automatically use C/C++ libraries in Python), but I’m not really comfortable with Python, but I’m comfortable with C/C++ and a lot of libraries I intend to use are in C/C++);
  • Unicode handling: ICU [2]
  • Particle filter: probably SMCTC [3]
  • Image handling: CImg with FreeType plug-in [4], an alternative is OpenCV [5]
    I know CImg and it is quite easy to use, although I agree it could have limitation on the long term
    I don’t know OpenCV but it is the right tool to handle powerfully images
  • Image creation of text: FreeType [6], there exists also Pango [7] and Cairo [8]
    FreeType is the leader in image creation of text (your Unix system probably use it in lower layers) and is very well documented; since it is a low-layer library it doesn’t handle complex layouts (which must be done in upper layers): bold, italics (the face must be chosen), color, BiDi and i18n (I think and these types of features)
    Pango is the GNOME layout layer and could depend of FreeType (or other font rendering engine), it handles complex layouts, but very seriously lacks of documentation (or I didn’t find it)
    I don’t know Cairo, it could also depend of FreeType (or other font rendering engine), but at first sight it seems to be quite easy and powerful


If you are interested or have some idea about a such thing, please let me know either by writing below either by sending me an email (I am rarely here, so I could not see your answer).

State of art[edit]

We are just exploring djvu files into it.source, there is so much to develop! One of ideas is to replace djvu text layer with wikisource reviewed text. From preliminary tests, consider that you could inject into djvu layer both "pure text" and wiki markup too! en:User:Hesperian runs routines to:

  1. extract mapped djvu layer;
  2. parse words;
  3. select words which don't match with a list of common words;
  4. edit them manually;
  5. re-inject edited words into djvu file.

Just to test if it's possible, we developed a python routine, using KompoZer as a wysiwyg editor, to:

  1. extract mapped djvu layer of a page;
  2. build a html page, simply as the list of words with a trick to save their coordinates as a "fingerprint";
  3. edit html page with KompoZer;
  4. parse edited words again, and replace the fixed words into djvu text layer, using "fingerprints".

For any word, associated with its coordinates, it's possible to extract exactly the segment of djvu image layer containing it as a tiff file (the image of the word). This could be the basis of a "wikicaptcha" or of a different editing tool, coupling selected words and their image.

When trying to match a djvu, mapped text and a fixed, not mapped text a great help can be gained from difflib python routines; nevertheless to build an algorithm to replace exactly edited words into djvu text layer, saving existing mapping, is far from simple, since there are many kinds of possible edits:

  1. changing one or more characters into a word (simple case);
  2. merging words; (not so difficult)
  3. splitting words; (difficult!)
  4. deleting words; (not so difficult)
  5. moving words or sentences (the very hard case of references using ref tag, where large portions of text is moved into the page!).

--Alex brollo 09:30, 21 February 2011 (UTC)

Thanks for your interest! I just see your message as I intend to work on my project during Developer Days during Wikimania (I warned I don’t visit often this page, I think a saw an email saying this page was modified but I forgot it)
Although my goal is a bit different as yours, we have very common issues. From what I understand you want to check an unproofread text by using Internet users (captchas) and inject that in DjVu (the last is easy by using DjVuLibre library or binaries); and I try to check proofread texts against image; in another words you are doing OCR (in a unusual way) and I am doing Reverse OCR.
Some of our (quite) common issues are:
  • Need of understand in a certain way wikitext (linked to a WYSIWYG counterpart), at least in a minimal way (bold, italics, refs, noinclude header and footer, columns) but the better we understand the better it is (see also, wikitext-l mailing-list, parser review and Peter17’s GSoC between other active things)
  • Creating an image of some text (you don’t need such an operation): from a distant point of view it seems easy, by viewing in details (in my case) it is quite difficult to recreate an image of a text (in order to compare with the original image) because there are many parameters:
    • font
    • size
    • subpixel positionning (the original paper is a "continous" medium digitized a posteriori during digitisation, so the re-creation of a word needs to fit this criterion and the word needs to begin in a (possibly) non-integer pixel); this constraint leads to the fact that the same word could have different images (in other words the function f: word -> image(word) is not injective)
    • kerning [9] [10]
    • caesurae of words (and possibly false positives/false negative [11]: for instance sub-\nmersion, assumed as being a caesura, is a true positive; under-\nstand, assumed not to be a caesura, is a false positive; under-\nwater, assumed not to be a caesura, is a false negative; under-\nstand, assumed not to be a caesura, is a true negative; non-\nlinear is ill-posed since it could be "nonlinear" or "non-linear" (the form "non linear" exist also, and I can certify the three forms exist).
      Recognized asCaesuraNon-caesura
      Is a caesuratrue positivefalse positive
      Isn’t a caesurafalse negativetrue negative
  • Splitting words: in strict generality (all human languages), this is a difficult problem. There exists algorithms which handle that, see the ICU library and possibly Unicode standards behind; their algorithms split accurately words in almost every languages apart one or two (thai and/or vietmamese, I cannot remember) where they use tables
  • Replacement of words: if you need to replace one or many words in an existing text, depending of your problem it could range from quite easy (with careful implementation) to very hard. Easy if you pick up one word and save its position in the text (for instance line number and either column either word (refer to splitting section)). Very hard and probably ill-posed (many solutions without possibility of deciding what is the "good" solution) if you don’t save the place of your replacement (e.g. if you want to replace the word "and" which appears many times in a text). You could have a look to Levenshtein distances related to this fact or others.
  • I cannot remember: TIFF have coordinates of words? I think yes because DjVu and TIFF are quite similar. If so you don’t have problems to extract words, but be careful because if you lost coordinates of words in some operation it will be very difficult to retrieve them (it my main problem), and so you MUST have coordinates at the beginning of your process. Be careful also because coordinates could have a systematic biais (I observed that on books provided by BnF) and this fact could lead to the fact all words are misread (even by humans) and so lead to an entirely false interpretation of the text. Even more difficult is if some words are mispositioned and others correctly positioned (which, I suppose, is very possible with automatic OCR), which could lead to a half-false interpretation (where the "half" could range from 0% to 100%).
  • Manipulation of XML files, even HTML files (be careful that HTML isn’t an XML dialect, although very near but not as much strict as XML, but XHTML is); in addition, strictly speaking, some "XML files" could not be real XML files if they fail in one or more rules of XML (bad encoding of the text file, bad tags, bad order of tags <t1><t2></t1></t2>); I’m not an XML expert but it seems XML experts are quite frightened by this fact so presumably it appears quite often.
  • Manipulation of Unicode strings: must be careful about that. Python correctly handle that (but always add u in front of string u"string"), C/C++ (my language) needs ICU.
  • I don’t know precisely the world of captchas but it seems the current level of automatic recognition of machines is very high (at the point it is sometimes easier for a machine to recognize a word than for humans), so I have some doubts about the use of a Wikisource-based captcha because presumably it will be very easy for a machine to recognize words and the captcha will not protect the blog/forum/wiki account creation/whatever you want; instead you could use it as an honeypot for captcha-spammers :-) but I’m not sure it is a good long-term alternative (or with the help of spam-specialized organisations like a "Spamhaus" specialized in captchas (I don’t know such an organization)).
~ Seb35 [^_^] 20:16, 26 July 2011 (UTC)