Index talk:Vocabolardlladinleterar.pdf

From Wikisource
Jump to navigation Jump to search

Test case[edit]

@Mizardellorsa: Pdf will be worked by pdftohtml routine (xml output), then postprocessed by python (mainly by bs4.BeautifulSoup); the aim is, to get as much formatting details as possible (bold, italic, paragraphs splitting). A few pages will be uploaded, just to evaluate the results and to test too a templatestyles approach. A try with html output of pdftohtml routine has been tested with unsatisfactory results (text lines are not orderly listed in columns). Please try you too, if you have the needed tools & skills, but please don't upload OCR! :-) --Alex brollo (talk) 22:35, 14 March 2020 (UTC)

Two new templates (Template:TemplateStyle with its redirect Template:Tst and Template:Lemma) are being tested to add a little bit of formatting and to add anchors to lemmas. --Alex brollo (talk) 08:23, 15 March 2020 (UTC)

First uploading tries[edit]

Pages 40-46 have been uploaded (and re-uploaded afrer debugging) by my bot.

Please note that they need a good review, and some manual editing; all pages need at least a review of one or two words at the top of text, since I found difficult to ignore text coming from header (usually a lemma and page number). They need to be deleted or moved into the right parameter of Rh template.

Please let me know if any of you finds other extraction errors. --Alex brollo (talk) 17:11, 15 March 2020 (UTC)

First global upload[edit]

User:BrolloBot is uploading now the whole book. The result is good IMHO, but needs some touches:

  1. some empty rows have to be deleted
  2. when a page begins with a lemma, two empty rows are needed before the lemma (sometimes there's only one or three empty rows);
  3. when a lemma is broken in two pages, no or one empty row is needed at the top of the page;
  4. Small caps style have to be added where needed;
  5. hyphenated words between two pages need to be fixed by tl|Pt (itwikisource style) or by tl|Hyphenated word start + tl|Hyphenated word end (enwikisouce style);
  6. a tl|tst is needed before pages tag into ns0 transclusion (see example Vocabolar dl ladin leterar/a);
  7. .... (please add here other remarks/suggestions)

The python code used to extract text and formats will be published here: User:Alex brollo/; it runs pdftohtml (xpdf library) and some pure python modules. --Alex brollo (talk) 23:24, 16 March 2020 (UTC)

Transclusion issue[edit]

Four ns0 subpages have beel splitted in two, since they raise an error (too many templates): subpages C, S, Indice italiano-ladino, Indice tedesco-ladino. --Alex brollo (talk) 14:26, 18 March 2020 (UTC)