Wikisource:Djvu text layer

From Wikisource
Jump to navigation Jump to search

The aim of this page and of its subpages is to collect ideas, details, scripts, examples about djvu text layer. Any example, screenshot, script will link with a locally uploaded, brief djvu file: File:Poems.djvu coming from Internet Archive.

All examples and tests will use DjvuLibre routines, by prompt line or by python scripts. DjView will be used to browse djvu file, and to explore him in any detail.

Please assign any file, page or subpage you'll create to be used here to Category:Djvu text layer.

Brief introduction of djvu files[edit]

Djvu files could be viewed as a open, fully editable version of pdf files, specifically designed for paginated, highly compressed, mixed contents (images and text). Page content is structured in layers; text layer is hidden in usual visualization, but is used to allow full search of words into the document; it can be copied and pasted after selection of an area of the page.

Djvu files can be built in two different formats: the "bundled" format and the "indirect" format. The first one collects all needed files into a single file; the second one has a main, small index file, pointing to individual djvu files for pages. MediaWiki supports bundled files only.

Open source tools exist both to view DjVu files (with stand-alone viewers or with browser plugins), and to build/edit anything of them.

Here will be used DjView as viewer, and DjvuLibre command line routines ad edit tools; you can download them from sourceforge, where you'll find too some documentation. Any example, script, screenshot, text will be built using File:Poems.djvu; so please download it into a folder, and install into your PC DjView and DjvuLibre routines. Python is too needed, if you like to run scripts to build some simple application based on DjvuLibre routines call.

Mediawiki and its use of djvu files[edit]

Djvu files are widely used into source projects only. They are by now the recommended file for proofread procedure, where both their high compression and the presence of a text layer are currently used. Only bundled djvu files can be uploaded into Commons; when a bundled djvu file with a text layer is uploaded into Commons:

  1. Index page is build accordingly with the page number of the djvu file that matches its name;
  2. Every page is individually linked by own original page number or by an "alias", defined with Index pagelist tag parameters;
  3. as soon as a new Page: is created by an Index page link, the image layer of djvu page is converted into a jpg image and showed in the right box of Page: page;
  4. when a new Page: page is created, djvu text layer (if any) is automatically extracted and uploaded into the text box of Page: page. There's no other procedure to extract djvu text layer but this one.

Images from djvu files can be used too with usual wiki markup for images; the first page is displayed by default, but any page can be used adding a page= parameter. Gallery tag doesn't support page= parameter. O the right, the default image for File:Poems.djvu (top) and an image using page=11 parameter (bottom).

While exploring djvu page structure and hidden text layer here, we'll use mainly page 11 of File:Poems.djvu (the latter one).




Exploring djvu layers with DjView[edit]

Djvu text layer structure: Lisp and XML representation[edit]

Djvused routine: extracting and uploading text layer[edit]

Image segments of words[edit]

reCAPCTCHA and (?) wikicaptcha[edit]