wikimedia-l | wikisource-l | meta | mediawiki | Phabricator | feedback

Proofreading flow of work


Usually, when djvu files have a decent OCR text layer, and the first wikisource user doesn’t know all the tricks of advancerd formatting, the first step is to create the Page, uploading its OCR, to fix scannos and to apply some decent formatting code (level 1); then some expert user applies a fine-tuned formatting (level 3), finally another user pushes the page into level 4.

We (nap.source users) are testing into mul.source a different strategy: to get offline the OCR text, and to apply offline most of formatting (somehow similar to fr.source mise en page), then uploading into nsPage a pre-formatted text by a bot; the pre-formatted text almost needs a careful review for scannos only. Here an example.

Another possible strategy is to upload back the offline pre-formatted text into djvu text layer, so that it will pop up when a user creates the nsPage page; this would be the next test, a djvused .dsed file being edited in its text contents by python.


I’ve often wondered about this idea of editing the text layer in the djvu file. It seems an offline desktop application for proofreading would be cool, and it could work entirely with djvu files which could then be uploaded to Wikisource. There would be issues around templates and previewing, but for the bulk of stuff it’d be great.

Do you know if there are any such djvu editors? Have you experimented with editing the text layer directly?


Yes, I know at least three editors, but I know better one of them, since I wrote it (if you are so bold to browse the python code of a DIY “programmer” as I am, I could send you the zipped code; please don’t ask me to post it into GitHub for the simple reason that… using Git and GitHub needs skills that I haven’t!)

The editor is based on a local python server/client ajax architecture and GUI is simply a local html page - very similar to usual source nsPage; the GUI has some simple javasctipt editing tools.

here a screenshot:


that’s pretty slick
would you then upload the edited text layer to internet archive?

pre-editing text layer would expedite much proofreading speed bumps at WS
since pre-1870 works text layers are lossy, and tend to get put on the back burner


Thanks. The editor needs only a djvu file with its text layer - no more than this. Page for page, the text is extracted as xml by DjvuLibre routines, converted into a “hOCR-like” html, sent to browser, edited, set back, re-converted into xml and re-uploaded into djvu text layer.

Tha whole thing is fast, but not faster than a normal edit into wikisource, so I didn’t develop it any more.

Presently I’m thinking to a different idea, t.i. a “mise en page” of djvu text layer with no user edit by a set of regex; I’d like to find a trick to wrap header and footer code.