wikimedia-l | wikisource-l | meta | mediawiki | Phabricator | feedback

Module:Nap: WIP into mul.source


#1

Just to test Lua/Scribunto power we are testing https://wikisource.org/wiki/Module:Nap. Its name comes from the language code for Neapolitan, a fastly growing sub-project into mul.source. There are two core ideas inside:

  1. the Lua code gets the whole wikitext of the hosting page;
  2. then it parses the text, searching for key data into standard templates/standard code fragments.

We are testing it into nsIndex and nsPage, with exciting results for self-categorization of contents; by now, MediaWiki:Proofreadpage index template is parsed into nsIndex, and hidden header code is parsed into nsPage. Obviously Nap gets the namespace too, and behaves differently into different namespaces. No parameter is needed.

Are we simply rediscovering the wheel?


#2

Be aware of performance problems. A few modules en.wiktionary did the same and brought down some pages in the process, see

https://en.wiktionary.org/wiki/Wiktionary:Grease_pit/2016/February#a:_The_time_allocated_for_running_scripts_has_expired.

In general it’s not good practice for a module to read the whole surrounding page. A module should get its input via parameters. Maybe there is a different way of achieving the same thing?


#3

I saw that reading another page is server-expensive while reading the surrounding page is not. Why precisely you say that it is not a good practice? Have you a link to some doc or talk? Obviously the core of that idea is precisely, to avoid to pass any parameter, so that if you are right I’ve to stop my tests immediately.

NewPP limit report
Parsed by mw1173
Cached time: 20160309234238
Cache expiry: 2592000
Dynamic content: false
CPU time usage: 0.094 seconds
Real time usage: 0.122 seconds
Preprocessor visited node count: 483/1000000
Preprocessor generated node count: 0/1500000
Post‐expand include size: 7406/2097152 bytes
Template argument size: 1195/2097152 bytes
Highest expansion depth: 15/40
Expensive parser function count: 0/500
Lua time usage: 0.008/10.000 seconds
Lua memory usage: 587 KB/50 MB

This is the report for a page using nap module in its “heaviest” form. I don’t see anything alarming: am I wrong?


#4

It really depends on what you do with the data, and how much data is stored on the page. The example I linked ran a bunch of regexes against the surrounding page which on large pages consumed around 10% CPU. My main point is that normally the module should not need to know about the context it is called from, everything it needs to perform its task is passed as parameter. That’s just a good general practice in software development. If the module depends on the context then changing the page might inadvertently break the module.

In your case it is obviously not possible just to use parameters, or would be very cumbersome, and it might be the only way to do it.

Another question would also be why the components who actually “own” the data don’t generate the correct categories themselves - the index page status (not-proofread etc.) is definitely already exposed, so you might be duplicating some categories. Reading and parsing the surrounding page should be the last resort.


#5

I’m developing the idea while working as a normal user with no sysop privilege, so I can’t (and I don’t want!) to edit system templates as MediaWiki:Proofreadpage index template; nor obviuosly I can access to extension tags as pages. Nap module can access to any data into such protected structures and it can very simply build categories by intersections of data - i.e. can filter data by language field and categorize by author only if language=nap; it can too, by pretty simple string manipulation, solve unlimited multiplicity of data into a field or normalization of content.