Wikisource:ProofreadPage/Improve index pages

From Wikisource
Jump to navigation Jump to search

This page is a work in progress !

The index namespace contain metadata about scanned books and their proofreading. There is one page for each scanned book, and, for PDF and DjVu, the name of the page is the same as the name of the file.

The WikiText content of the index pages is a template call. This template is Mediawiki:Proofreadpage index template. This Wikicode is modified using a form that is generated by a JavaScript generator from a configuration message.

There is some issues with the current system:

  1. People without JavaScript can easily break the WikiText content of the pages, and, by consequence, the transclusion system for the book.
  2. The JavaScript form generator is loaded after edit toolbar: it's pretty confusing for users to see before WikiCode and then a form.
  3. Data are stored as a template call, so there is some issues with "|" that are very confusing for the user who doesn't know how data are stored.
  4. This template-based storage prevent used of multiple-value field that can increase usability of the form and reusability of content.
  5. There is no API to get this content.
  6. The name of the field are different between Wikisources: writing a generic metadata harvest system is difficult.

There is some improvements we can do:

  1. Rewrite the index form generator in PHP in order to solve issues 1 and 2 and improve its usability.
  2. Use a json-based storaged system in order to solve issues 3 and 4.
  3. Make a mapping between fields used by Wikisource and standardized set of properties like Dublin Core (issue 6).
  4. Create an API.

Improvement of the index form[edit]

A patch that rewrite the index form generator in PHP is already written and will be deployed Wednesday, October 31 with Mediawiki 1.21-wmf3. It add some new features like the capacity to show an help text for each fields and the capacity to add select fields. This patch is already live on labs.

In order to add theses new features, the configuration system of the index namespace will be change. Here is a short description of the new configuration system :

The configuration pages Mediawiki:proofreadpage_index_attributes and Mediawiki:proofreadpage_js_attributes are now merge in a new page MediaWiki:Proofreadpage index data config that is a json array. Proofreadpage will use this new configuration page only if MediaWiki:Proofreadpage index data config is created. If not, the two old configuration pages will be used.

In the futur a form will be added in order to edit easily the new configuration file.

New configuration  : MediaWiki:Proofreadpage index data config[edit]

The configuration is a json array of properties. Here is the structure of a property in the array, all the parameters are optional, the default value are set :

  "ID": { //id of the metadata (first parameter of proofreadpage_index_attributes)
    "type": "string", //the property type (for compatibility reasons the values have not to be of this type). Possibles values: string, number, page
    "size": 1, //only for the type string : number of lines of the input (third parameter of proofreadpage_index_attributes)
    "values":  {"a":"A", "b":"B","c":"C", "d":"D"}, //an array values : label that list the possible values (for compatibility reasons the stored values have not to be one of these)
    "default": "", //the default value
    "header": false, //add (true) or not (false) the property to Mediawiki:Proofreadpage_header_template template
    "label": "ID", //the label in the form (second parameter of proofreadpage_index_attributes)
    "help": "", //a short help text

An example is available here.

You can generate the new configuration system from the new one using this tool.

If you reference those items (e.g. at MediaWiki:Proofreadpage header template for their use as metadata for EPUB export with WSexport) you'll have to reference them in lowercase (e.g. {{auteur}} instead of {{Auteur}}).

Json-based storage system[edit]

Implement a new storage system based on ContentHandler and a adapted diff system with Extension:Diff.

Template based system Json based system
|Title=[[The Original Fables of La Fontaine|The Original Fables of La Fontaine rendered into English prose]]
|Author=[[Author:Jean de La Fontaine|Jean de La Fontaine]] 
|Translator=[[Author:Frederick Colin Tilney|Frederick Colin Tilney]]
|Publisher=J. M. Dent; E. P. Dutton
|Address=London; New York
|Key=Original Fables of La Fontaine
|Pages=<pagelist 1to6="–" 7=1 
21="image" 22="–" 23=15
43="image" 44="–" 45=35 
57="–" 58="image" 59=47 
87="image" 88="–" 89=75
103="image" 104="–" 105=89
115="image" 116="–" 117=99
129="–" 130="image" 131=111 
148to152="–" />
|Remarks=<div style="margin:5%;">
{{Page:La Fontaine - The Original Fables Of, 1913.djvu/13}}
{{Page:La Fontaine - The Original Fables Of, 1913.djvu/14}}
    "Type": "book",
    "Title": "[[The Original Fables of La Fontaine|The Original Fables of La Fontaine rendered into English prose]]",
    "Volume": "",
    "Author": "[[Author:Jean de La Fontaine|Jean de La Fontaine]]",
    "Translator": "[[Author:Frederick Colin Tilney|Frederick Colin Tilney]]",
    "Editor": "",
    "School": "",
    "Publisher": ["J. M. Dent", "E. P. Dutton"],
    "Address": ["London", "New York"],
    "Year": 1913,
    "Key": "Original Fables of La Fontaine",
    "Source": "djvu",
    "Image": 1,
    "Progress": "V",
    "Pages": "<pagelist 1to6=\"–\" 7=1 
21=\"image\" 22=\"–\" 23=15
43=\"image\" 44=\"–\" 45=35 
57=\"–\" 58=\"image\" 59=47 
87=\"image\" 88=\"–\" 89=75
103=\"image\" 104=\"–\" 105=89
115=\"image\" 116=\"–\" 117=99
129=\"–\" 130=\"image\" 131=111 
148to152=\"–\" />",
    "Volumes": "",
    "Remarks": "<div style=\"margin:5%;\">,
{{Page:La Fontaine - The Original Fables Of, 1913.djvu/13}},
{{Page:La Fontaine - The Original Fables Of, 1913.djvu/14}},
    "Width": "",
    "Css": "",
    "Header": "",
    "Footer": "<references/>"


The goal of this project is to define a set of standard properties types in order to say to Proofread Page "this field in index form is this kind of value". A field in index page can, of course, be related to no ProofreadPage property.

Here is the current state of work of the mapping project:

The Simple Dublin Core set consistes in these 15 elements:

  1. Title
  2. Creator
  3. Subject
  4. Description
  5. Publisher
  6. Contributor
  7. Date
  8. Type
  9. Format
  10. Identifier
  11. Source
  12. Language
  13. Relation
  14. Coverage
  15. Rights

Each Dublin Core element is optional and may be repeated. That second feature should be implemented (if possible), in the system with a new parameter in the configuration that list the possible delimiters between parts.

    "delimiter": [] //list of delimiters between two part of values. By example ["; ", " and "] for strings like "J. M. Dent; E. P. Dutton and A. D. Robert"

A new configuration parameter will be added to Proofreadpage_index_data_config in order to provide the mapping to the extension:

    "data": "", //proofreadpage's metadata type that the property is equivalent to

Here is a beginning of mapping between Index fields and proofreadpage's metadata types :

Proofread Page property oldwikisource en fr it Dublin Core Book Type
type Type Type libro, rivista, raccolta, Tesi di Dottorato, dizionario[1] type? book, journal, collection, phdthesis, dictionary
language en fr it language language langcode
title Title Title Titre Titolo title Title
Volume Volume Volume
author Author Author Auteur Autore creator Author
translator Translator Translator Traducteur Traduttore[1] contributor Translator
editor Editor Editor Editeur_scientifique Editore?[1] ? Editor
illustrator Illustrator Illustrateur Illustratore[1] contributor
school School School Scuola[1] ? -
year Year Year Annee Anno date Date number
publisher Publisher Publisher Editeur Editore publisher Publisher
place Address Address Lieu Città di edizione spatial City
Bibliotheque Biblioteca[1] -
Key Key Cle Chiave[1] -
Source Scans Fac-similés Fonte[1] format
Image Image Image Immagine[1] Image
progress Progress Progress Avancement Stato Avanzamento Qualità ? "", "L", "X", "OCR", "MS", "C", "V", "T"
Volumes Volumes Tomes Volumi Volume
Pages Pages Pages Sezione indice
Remarks Remarks Sommaire Sommario
Epigraphe -
Width Width Ampiezza[1]
Css Css Css[1]
Header Intestazione
Footer Piè di pagina
subject[1] subject[1] subject[1] Argomento[1] subject[1]


  1. 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 1.10 1.11 1.12 1.13 1.14 1.15 Not implemented at the present moment


We will add also a support of usual libarie's identifers. Each ones should be tag in the config with "data": "identifier".

Here is a not exhaustive list of identifiers type that can be implemented.

type in the data_config Notes exemple Format of URI
ISBN isbn ISBN 10 or 13. 978-2-86889-006-1 urn:ISBN:XXXXXXXXXXXXX
ISSN issn Eight digit number, divided by a hyphen into two four-digit numbers. 2049-3630 urn:ISSN:XXXX-XXXX
LCCN lccn height or ten digits without hyphen. 44033150
OCLC oclc number 3415579
ARC arc number 341579
ARK ark If the index field is restricted to only one libary, the field shoud only contain the id part of the ark URL and the NAAN part is set in data_config with the "naan" property (list of all NAAN). If not put in this field all the URL without "ark:/" part. cb30821485q with "naan": 12148
or 12148/cb30821485q


User:Tpt is working on an OAI-PMH api in order to export Index pages content. This API will publish data in two formats : Simple Dublin Core (format required by OAI-PMH specification) and Qualified Dublin Core with some custom elements for Wikisource-related data (number of page proofread, progress...).

You can find a demo of an early version of this API on labs.