Showing posts with label wiki. Show all posts
Showing posts with label wiki. Show all posts

Friday, February 28, 2025

Image for: Friday, February 28, 2025

Exploring structured data on commons

 Recently I've been helping Yaron do some SPARQL query optimization for his site Commons Walkabout.

Its a cool site. It lets you explore the media on Wikimedia Commons by filtering through various metadata fields.

For example - bodies of water located in a place that is in Ireland.

Its addicting too. The media with the best quality metadata tends to be those donated by museums, which often means they are quite interesting.

An image the commons walkabout app showed me that I thought was pretty: Arabah desert in 1952 photographed by Benno Rothenberg

Structured data on Commons

Image for: Structured data on Commons
As a result of helping with the optimization, I've been exploring structured (Wikidata-style) data on commons.

The structured data project has been a bit controversial. I think there is a feeling in the community that WMF abandoned the project at the 90% done point. It does mostly work, but there is still a lot of rough edges that make it less successful than it would otherwise be. Tools for authoring, interacting and maintaining the metadata are lacking from what I understand. Most files are not as described by metadata as they ought to be. Importantly there is now a dual system of organization - the traditional free text image description pages and category system, along with the newer structured metadata. Doing both means we're never fully committed to either.

Source: XKCD

The biggest criticism is the situation with the commons query service (See T297995 and T376979). Right now the service requires you to log in first. In the beginning this sounded like it would be a temporary restriction, but it now appears permanent.
 
Logging in is a big issue because it makes it impossible to do client side apps that act as a front-end to the query service (Its not theoretically impossible, but the WMF's implementation of logging in doesn't support that). The auth implementation is not very user friendly, which is a significant hindrance, especially when many people who want to do queries aren't necessarily professional programmers (For example, the official instructions suggest using the browser dev console to look up the value of certain cookies as one of the steps to use the system). Some users have described the authentication system as such a hindrance that it makes more sense to shut the whole thing down than to keep it behind auth.

The SPARQL ecosystem is designed to be a linked ecosystem where you can query remote databases. The auth system means commons query service can't be used in federation. It can talk to other servers but other servers cannot talk to it.

Its a bit hard to understand why WMF is doing this. Wikidata is fully open, and that is a much larger dataset of interest to a much broader group of people. If blazegraph is hard (Which don't get me wrong, i am sure it is), the commons instance should be trivial compared to the Wikidata one. You can just look at the usage graphs that clearly show almost nobody using the commons query service relative to the wikidata query service. The commons query service seems to be averaging about 0 requests per minute with occasional spikes up to 15-40 reqs/min. In comparison, the wikidata query service seems to average about 7500 reqs/minute

I've heard various theories: That this is the beginning of an enshitification process so that Wikimedia can eventually sell access under the Wikimedia Enterprise banner or that they don't want AI companies to scrape all the data (Why an AI company would want to scrape this but not wikidata and why we would want to prevent them, I have no idea). These aren't really super convincing to me.

I suspect the real reason is that WMF has largely cut funding to the structured data project. That just leaves a sysadmin team responsible for the blazegraph query end point. However normally such a team would work in concert with another team more broadly responsible for sdc. With no such other team, the blazegraph sysadmin team is very scared of being sucked into a position where suddenly they are solely responsible for things that should be outside their team's remit. They really don't want that to happen (hard to blame them), so they are putting the breaks on moving things forward with the commons query service.

This is just a guess. I don't know for sure what is happening or why, but that is the theory that makes the most sense to me.

The data model

Regardless of the above, the structured data project and SPARQL is really cool. I actually really like it.
 
While playing with it though, some parts do seem kind of weird to me.

Blank nodes for creator

The creator property says who created the image. Ideally the value is supposed to be a Q-number, but many creators don't have one.

The solution is to use a blank node. This makes sense, blank nodes in RDF are placeholders. They aren't equal to any other node, but allow you to specify properties.

If the relationship chain was:

<sdc:Some image> <wdt:P170 (creator)> < blank node > <wdt:P4174 Wikimedia Username> "Some username"

That would be fine. However its not. Instead creator property is rectified so that <wdt:P4174 Wikimedia Username> "Some username" is modifying the creator predicate instead of being a property of the blank node.

This feels so ontologically weird to me. It kind of weird we have to resort to such a hack for what is undoubtedly the main use case of this system.

Functionally dependent predicates

Some predicates functionally depend on the file in question. I feel like these should be added by the system. They should not be controlled or edited by the user.

For example P3575 data size. That's technical metadata. Users should not be responsible for inserting it. Users should not be able to change it. The fact that they are is a poor design in my opinion. Similarly for P2048 height, P2049 width, P1163 media type, P4092 checksum.

I find P1163 media type (aka the mime type of the file) especially weird. Why is this a string? Surely it should be the Q number of the file format in question if we're going to be manually filling it out?

The especially weird part is some of this data is automatically added to the system in the schema namespace. The system automatically adds schema:contentSize, schema:encodingFormat, schema:height, schema:width, schema:numberOfPages which are equivalent to some of these properties. So why are we duplicating them by hand (or by bot)?

At the same time, there seems to be a lot missing from schema. I don't see sha1 hash (sha1 isn't the best hash since it is now broken, but its the one MediaWiki uses internally for image). I'd love to see XMP file metadata included here as well, since it is already in RDF format.

The thing really missing (unless i missed it) is it seems impossible to get the url of the file description page or even the mediawiki page title, without string manipulation. This seems like a bizarre omission. I should be able to go from the canonical url at commons to the SDC item.

Querying the system

Querying the system can be a bit tricky sometimes because the data is usually spread between commons and wikidata, so you have to make use of SERVICE clauses to query the remote data, which can be slow.

The main trick seems to be to try and minimize the cross database communication. If you have a big dataset, try and minimize it before communicating with the remote database (or the label service).

Copious use of named subqueries (due to them being isolated by the optimizer) can really help here.

If you are fetching distinct terms (or counts of distinct terms) ensuring that blazegraph can use the distinct term optimization is very helpful.  It seems like the blazegraph query optimizer isn't very good and often cannot use this optimization even when it should. Making the group by very simple and putting it into a named subquery can help with this.
 
The distinct term optimization is critically important for running fast aggregation queries. Often it makes sense to first get the list of distinct terms you are interested in and their count (if applicable) in a named subquery, then go and fetch information for each item or filter them instead of doing it all in one group by.

If you have a slow query that involves any sort of group by, the first thing i would suggest is to try and extract a simple group by into a subquery (by simple I mean: only 1 basic graph pattern, no services, grouping by only 1 term, and either no aggregate functions or the only aggregrate function being count(*)) and then use the results of that query as the basis of the rest of your query.

If its still too much, using the bd:sample service can be really helpful. This runs the query over a random subset of results instead of the whole thing. If you just want to get the broad trends this can often be good enough.

The most complicated thing to query seems to be the P170 creator predicate. There are 72 million items with that predicate, the vast majority are blank nodes, so the number of distinct terms is in the millions (Even for files with the same creator, they are considered distinct of they are blank nodes). Queries involving it seem to almost always time out.
 
Initially I thought the best that could be done was sampling and interpolating. For example, this query that gives you the top creators (who have a Q numbers). The numbers aren't exactly right, but they seem to be within the right order of magnitude.

Unfortunately filtering via wikibase:isSomeValue() is very slow so we can't just filter out the blank nodes. I did find a hack though.  In general blazegraph arranges blank nodes at the end of the result set (or at least, it seems so in this case). If you do a subquery of distinct terms with a limit of about 10,000 you can get all the non-blank nodes (Since there are only about 7200 of them and they are at the beginning). This is hacky since you can't use range queries with URI values and you can't even put an order by on the query or it will slow down, so you just have to trust blaze graph is consistently returning things in the order you want even though it is by no means required to. It seems to work.. For example, here is an example table using this method counting the number of images created by creators (with Q numbers) grouped by their cause of death. A bit morbid, but is is fascinating you can make such an arbitrary query.

Conclusion

Image for: Conclusion

Anyways, I do find RDF and SPARQL really cool, so its been fun poking around in commons' implementation of it. Check out Yaron's site https://commonswalkabout.org/ it is really cool.


Thursday, January 30, 2025

Image for: Thursday, January 30, 2025

Happy (belated) new year

 I was going to write a new years post. Suddenly it's almost February.

Last year I had the goal of writing at least 1 blog post a month.

 I was inspired by Tyler Cipriani who wrote a post talking about his goal of writing at least once a month. It seemed like a reasonable goal to help practice writing. I've often wanted to maintain a blog in the past, but would always find I petered out after a few posts. So last year I set myself the goal of at least once a month.

How'd I do?

Ok I guess. I wrote 9 blog posts last year. Short of my goal of 12, but still not terrible. Looking back, I realize that I wrote 9 in 2023 and 11 in 2022, so I guess the goal didn't actually increase my writing. Nonetheless I'm still pretty happy that I was writing posts throughout the year.

Based on blogger's internal analytics, it seems like people like when I do CTF writeups. For some unclear reason my post about MUDCon got a tremendous amount of views relative to my other posts. However, the real goal is more to practice writing than to get views, so I suppose it doesn't matter much. That said, I do write the CTF writeups in the hope that others can learn from them, so it is nice to see that people read them.

On to next. Maybe this year I'll actually make it to once a month - I'd like to keep going with this goal. I think I'll try and write more shorter off the cuff things (Like this) and less big technical posts.

This year

While its already been a month and I've already mostly forgotten about my goals. But I did write some down.

I want this to be a year of trying new things. I want to try and branch out to new a different things.

I want to explore my creative side a bit more. To be clear, I do think there is a lot of creativity in computer programming, but I also want to try more traditionally creative things. Paint some pictures! I've been taking a beginner acrylic painting class at the local community center, which has been great so far. Maybe I should try and join the two interests and make a silly computer game.

I'd also like to explore different computer things. I've been doing MediaWiki stuff for a while now, but I don't want that to be the only thing I ever do. I'd like to try and direct my open source contributions towards things I haven't done before. Maybe things that are more low level than php web apps. Perhaps I should learn rust! I'd like to work on stuff where I feel like I'm learning and I think it would be cool to spend some time learning more about traditional CS concepts (algorithms and what not). Its been a long time since I was in university and learning about that sort of thing. Maybe it would be fun to brush up. In my current programming work its very rare for that sort of thing to be relevant.

Where I do keep with open source contributions in the MediaWiki/Wikimedia sphere of influence, I want to work on things that are unique or new. Things that, whether they are good or bad ideas, at least open up new possibilities for discussion. I can't help but feel that MediaWiki land has been a little stagnant. So many of the core ideas are from late 2000s. They've been incrementally improved upon, but there really isn't anything new. Both in Wikimedia land and in third party MediaWiki land.  Perhaps that is just a sign of a maturing ecosystem, but I think its time to try some crazy new ideas. Some will work and some will fail, but I want to feel like I'm working on something new that makes people think of the possibilities, not just improving what is already there (Not that there is anything wrong with that, maintenance is critical and under appreciated, its just not what i want to work on right now, at least not as a volunteer)

I think I've gotten a start in that direction with the calculator project that WikiProject med funded me to do. It spurned a lot of interesting discussion and ideas, which is the sort of thing I want to be involved with.

Maybe I'll explore Wikidata a bit more. I always found RDF databases kind of interesting.

On the third party MW side, I've felt for a long time that we could use some different ideas for querying and reporting in MediaWiki (Cargo and SMW is cool, but I don't think they quite make the right trade-offs). I'd love to explore ideas in that space a little bit.

So in conclusion, what is the yearly goals?

  • I think I want to be a little more intentional in planning out my life
  • I want my open source MediaWiki contribs to be more along prototyping new and unique ideas. I want to avoid being sucked into just fixing things at Wikimedia (After all, that's why WMF allegedly has paid staff).
  • I want to try and pursue creative hobbies outside of computer programming.
  • I want to try and do more programming outside of my MediaWiki comfort zone.

See you back here this time next year to see how I did.


Thursday, January 16, 2025

Image for: Thursday, January 16, 2025

Signpost article on {{Calculator}}

 The most recent issue of the Wikipedia Signpost published an article I wrote about the {{calculator}} series of templates, which I worked on.

 The signpost is Wikipedia's internal newspaper. I read it all the time. Its really cool to have contributed something to it. 

So what's this all about?

Essentially I was hired by Wiki Project Med, to make a script to add medical calculators to their wiki, MDWiki.  For example, you might have a calculator to calculate BMI, where the reader enters their weight and height. After being used on MDWiki, the script made its way to Wikipedia.

 The goal of this was to make something user programmable, so that users could solve their own problems without having to wait on developers. The model we settled on was a spreadsheet-esque one. You can define cells that readers can write in, and you can define other cells that update based on some formula.

Later we also allowed manipulating CSS, to provide more interactivity. This opened up a surprising number of opportunities, such as interactive diagrams and demonstrations.

See the Signpost article for more details: https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2025-01-15/Technology_report and let me know what you think.

Tuesday, December 31, 2024

Image for: Tuesday, December 31, 2024

Drawing with CSS in MediaWiki

One thing you aren't allowed to do when writing pages in MediaWiki, is use SVG elements.

You can include SVG files, but you can't dynamically draw something based on the contents of the page.

There are two schools of thought on adding features to MediaWiki page formatting.

Some people think we should give low level tools and let the users figure it out. An example (in my opinion) of this is Scribunto (Lua support). This let editors create small scripts in the lua programming language as macros. Its fairly low level - learning to program is complicated. However it gave users ultimate flexibility and only a few users need to learn how to make lua scripts. The other users could just use them without understanding how they work.

The other school of thought is to give users high level tools. For example, the upcoming Charts extension or the older easytimeline extension. This gives the users some flexibility, but many of the details are hidden. The lack of flexibility sounds like a downside, but there are some benefits. Its easier to ensure everything is consistent when its managed by the system instead of the user. Its easier to ensure everything meets performance requirements, accessibility requirements, etc. Its easier to ensure there is alternative content where the rich content can't be supported (e.g. perhaps in some app).

I tend to lean towards the former, as I think the extra flexibility spurns creativity. Users will take the tools and use them in ways you never imagined, which I think is a good thing (not to mention more scalable), but I understand where the other side is coming from.

One of the limitations of MediaWiki, is a template author can't really draw things dynamically. Sure they can use uploaded images and fancy HTML, but they can't use true server-side dynamically generated graphics because SVG elements aren't allowed in wikitext. Sometimes people use quite ingenious html hacks to get around this - e.g. {{Pie chart}} uses CSS borders to make pie graphs. For the most part though this niche is filled with high level extensions like the former Graph extension or the upcoming Charts extension, which let users insert a graph based on some specification. This is great if the graph the extension generates is what the user needs, but what if their requirements differ slightly?

Personally I think we should just do both - let users have at it with SVG, while also providing the fancier high level extensions. See what they like best. (Some people argue against embedded svg on security grounds, but its really no different than HTML. You just need to whitelist safe elements/attributes and not allow risky stuff).

Modern CSS

Recently though, I realized that CSS has advanced a lot, and you don't really need SVG to do drawings. You can do it in pure CSS.

CSS has a property called "clip-path". This allows specifying a clipping path in SVG path syntax. Only the parts inside the path show through.

SVG path syntax is essentially how to tell the computer, go to this point, draw a line to this point, from there make a curve to some other point, and so on and so forth. Essentially instructing the computer how to draw a picture.

This allows drawing a lot of stuff you would normally need SVG for. To be sure, there are a lot of other things SVGs do, but at its core, this is the main drawing part.

Here is an example of a smiley face and the word Example in fancy writing (Taken from [[File:Example.svg]]) in pure css

In fact, i made a start on an implementation of the canvas API in lua using this as the output mode: https://en.wikipedia.org/wiki/Module:Sandbox/Bawolff/canvas

Limitations

A big limitation is lack of stroking. The clip-path thing works great for filling regions, but it doesn't work for stroking paths.

When drawing - to stroke means to draw an outline, where to fill means you draw a shape and fill everything inside of it.

This doesn't seem insurmountable though. Most actual rasterization software works by flattening curved paths to lines, we could do the same thing in Lua, although it might not be the most efficient solution in terms of outputted markup.

Much of the time, I was actually surprised how much of the Canvas spec has CSS equivalents. Even obscure stuff - what interpolation do you want for rendered images? CSS has image-rendering property for that.

There are some things with no equivalent. We can't do anything that depends on client side, can't get font metrics, can't get the pixel data of an image, etc. Centering text the way canvas does it actually seems kind of hard. Composition operators are mostly not supported. Blending operators can be, but won't work properly with images. A bunch of other small things aren't supported, mostly involving more complex aspects of text layout.

On the whole though, I'm shocked at how far I got and how much of the stuff I don't have yet seems like a simple matter of programming.

There is also the possibility of limited interactivity with this, using :hover CSS and CSS animations (although transitioning paths probably doesn't work great, it seems possible in theory). Possibly even combined with {{calculator}} to allow limited interaction via form controls.

Try out [[Module:Sandbox/Bawolff/canvas]] if you are curious.

Tuesday, October 1, 2024

Image for: Tuesday, October 1, 2024

Hashtags and implementing extensions in MediaWiki by modifying services

 Recently I decided to make a Hashtags extension for MediaWiki.

The idea is - if you add something like #foo to your edit summary, that becomes a link in RecentChanges/history/etc, which you can click on to view other edits so tagged.

I implemented this using MediaWiki's newish services framework. This was the first time I used this as an extension mechanism, so I thought I'd write a blog post on my thoughts.

Why Hashtags

Image for: Why Hashtags

The main idea is that users can self-group their edits if they are participating in some sort of campaign. There are already users doing this and external tools to gather statistics about it. For example Wikimedia hashtag search so there seems to be user demand.

MediaWiki also has a feature called "change tags". This allows edits to be tagged in certain ways and searched. Normally this is done automatically by software such as AbuseFilter. It is actually possible for users to tag their own edits, provided that tag is on an approved list. They just have to add a secret url parameter (?wpChangeTags) or edit by following a link with the appropriate url parameter. As you can imagine, this is not easily discoverable.

I've always been excited about the possibility of hashtags. I like the idea that it is a user-determined taxonomy (so-called folksonomy). You don't have to wait for some powerful gate-keeper to approve your tag. You can just be bold, and organize things in whatever way is useful. You don't have to discover some secret parameter. You can just see that other people write a specific something in their edit summaries and follow suite.

This feels much more in line with the wiki-way. You can be bold. You can learn by copying.

The extension essentially joins both methods together, by automatically adding change tags based on the contents of the edit summary.

Admittedly, the probability of my extension being installed on Wikipedia is probably pretty low. Realistically i am not even trying for that. But one can dream.

For more information about using the extension see https://www.mediawiki.org/wiki/Extension:Hashtags

How the extension works

Image for: How the extension works

There are a couple things the extension needs to do:

  • When an edit (or log entry) is saved, look at the edit summary, add appropriate tags based on that edit summary
  • If an edit summary is revision deleted or oversighted, remove the tags
  • Make sure hashtag related change tags are marked as "active" and "hidden".
  • When viewing a list of edits (history, Special:RecentChanges, etc), ensure that hashtags are linked and that clicking on them shows only edits tagged with that tag

 The first three parts work like a traditional extension, using normal hooks.

I used the RevisionFromEditCompleteHook and ManualLogEntryBeforePublishHook to run a callback anytime an edit or log entry is saved. Originally I used RecentChange_saveHook, however that didn't cover some cases where an edit was created but not published to RecentChanges, such as during page moves. ManualLogEntryBeforePublishHook covers more cases than it might appear at first glance, because it will also tag the revision associated with the log entry. All this is pretty straightforward, and allowed tagging most cases. Restricted (confidential) logs still do not get tagged. It seems difficult to do so with the hooks MediaWiki provides, but perhaps its best not to tag such log entries, lest information is leaked.

Similarly, I used the ArticleRevisionVisibilitySetHook to delete/undelete tags after revdel. Unfortunately MediaWiki does not provide a similar hook for log entries. I proposed adding one, but that is still pending code review.

Marking tags as active was also quite straightforward using normal hooks.

All that leaves is ensuring hashtags are linked. For this I leveraged MediaWiki's newish dependency injection system.

Services and dependency injection

For the last few years, MediaWiki has been in the progress of being re-architectured, with the goal to make it more modular, easier to test, and easier to understand. A core part of that effort has been to move to a dependency injection system.

If you are familiar with MediaWiki's dependency injection system, you might want to skip past this part.

Traditionally in MediaWiki, if some code needed to call some other code from a different class, it might look like this:

class MyClass { 
    function foo($bar) {
        $someObject = Whatever::singleton();
        $result = $someObject->doSomething( $bar );
        return $result;
    }
}

In this code, when functionality of a different class is needed, you either get it via global state, call some static method or create a new instance. This often results in classes that are coupled together.

In the new system we are supposed to use, the code would look like this:

class MyClass {
    private SomeObject $someObject;
    public function __construct( SomeObject $someObject ) {
         $this->someObject = $someObject;
    }
    public function foo( $bar ) {
         return $this->someObject->doSomething( $bar );
    }
}

The idea being, classes are not supposed to directly reference one another. They can reference the interfaces of objects that are passed to it (typically in the constructor), but they are not concretely referencing anything else in the system. Your class is always given the things it needs; it never gets them for itself. In the old system, the code was always referencing the Whatever class. In the new system, the code references whatever object was passed to the constructor. In a test you can easily replace that with a different object, if you wanted to test what happens when the Whatever class returns something unexpected. Similarly, if you want to reuse the code, but in a slightly different context, you can just change the constructor args for it to reference a different class as long as it implements the required interface.

This can make unit testing a lot easier, as you can isolate just the code you want to test, without worrying about the whole system. It can also make it quite easy to extend code. Imagine you have some front-end class that references a back-end class that deals with storing things in the database. If you suddenly need to use multiple storage backends, you can just substitute the implementation of the backend class, without having to implement anything special.

MediaWikiServices

All this is great, but at some point you actually need to do stuff and create some classes that have concrete implementations.

The way this works in "new" MediaWiki, is the MediaWikiServices class. This is responsible for keeping track of "services" which are essentially the top level classes.

Services are classes that:

  • Have a lifetime of the entire request.
  • Normally only have one instance existing at a time.
  • Do not depend on global state (Config is not considered state, but anything about the request, like what page is being currently viewed is state. Generally these services should not depend on RequestContext)

You can register services, in a service wiring file. This also allows you to register what services your service needs as constructor arguments.

Some classes do not fit these requirement. For example, a class that represents some data we would expect to have multiple instances with shorter lifetimes, to represent the data in question. Generally the approach for such classes is to create a Factory class that is a service, which makes individual instances and passing along dependencies as appropriate.

But still the question remains, how do you get these services initially. There is an escape hatch, where you can call MediaWikiServices::getInstance()->getService( $foo ), however that is strongly discouraged. This essentially uses global state, which defeats the point of dependency injection, where the goal is that your class is passed everything it needs, but never reaches out and gets anything itself.

The preferred solution is that the top level entrypoint classes in your extension are specified in your extension's extension.json file.

Typically extensions work on the levels of hooks. This is where you register a class, which has some methods that are called when certain events in MediaWiki happen. You register these hook handlers in your extension's manifest (extension.json) file.

In old mediawiki, you would just register the name of some static methods. In new MediaWiki, you register a class (HookHandlers), along with details of what services its constructor needs. MediaWiki then initiates this class for you with appropriate instances of all it dependencies, thus handling all the bootstrapping for you.

To summarize, you tell MediaWiki you need certain services for your hook class. When the hook class is instantiated, it is constructed with the default version of the services you need (Possibly including services you defined yourselves). Those services are in turn instantiated with whatever services they need, and so on.

All this means you can write classes that never reach out to other classes, but instead are always provided with the other classes they need.

What does this have to do with Hashtags?

All this is not generally meant as an extension mechanism. The goal is to make it so classes are isolated from each other for easier testing, modification and understanding.

However, the services bootstrap process does have an idea of default services, which it provides when creating objects that have dependencies. The entire point is to be able to easily replace services with different services that implement the same interface. Sure it is primarily meant as a means of better abstraction and testability, but why not use it for extensibility? In this extension, I used this mechanism to allow extending some core functionality with my own implementation.

One of those services is called CommentParserFactory. This is responsible for parsing edit summaries (Technically, it is responsible for creating the CommentParser objects that do so). This is exactly what we need if we want to change how links are displayed in edit summaries.

In the old system of MediaWiki, we would have no hope in making this extension. In the old days, to format an edit summary, you called Linker::formatComment(). This was a static method of Linker. There would be no hope of modifying it, unless someone explicitly made a hook just for that purpose.

Here we can simply tell MediaWiki to replace the default instance of CommentParserFactory with our own. We do this by implementing the MediaWikiServicesHook. This allows us to dynamically change the default instantiation of services.

There are two ways of doing this. You can either call $services->redefineService() which allows you to create a new version of the Service from scratch. The alternative is to call $services->addServiceManipulator(). This passes you the existing version of the service, which you can change, call methods on, or return an entirely different object.

Hashtags uses the latter method. I essentially implemented it by wrapping the existing CommentParser. I wanted to just replace hashtags with links, but continue to use MediaWiki core for all the other parsing.

Using Services as an extension mechanism

How did this go?

On the bright side, i was able to make this extension, which i otherwise would not have been able to. It indeed works.

Additionally, I feel like simply replacing a class and implementing its interface, is a much cleaner abstraction layer for extensibility than simply executing a callback at random points in the code that can do anything.

There are some frustrations though with using this approach.

First of all, everything is type hinted with the concrete class. This meant i had to extend the concrete class in order for all the type hints to work even though I didn't really need or want to inherit any methods. Perhaps it would be nice to have some trait that automatically adds all the forwarding magic with __get().

I also intentionally did not call the parent constructor, which i guess is fine because i never call any of the parent's methods, however it does feel a bit odd. The way i implemented my class was to take in the constructor an instance of the "base" class that i forward calls to after wrapping them in my own stuff. Part of the reason for this is I wanted to use the old CommentParser object, because constructor functions in MediaWiki are highly unstable between versions, so I didn't want to have to manually construct anything from MW core.

The part where this really falls down is if you want to have multiple extensions layering things. This really only works if they both do so in the same way.

At one point in my extension, I wanted to access the list of tags parsed. In retrospect, perhaps this would have been better implemented as having a separate service alias that is explicitly for my class.  Instead, i implemented it by adding an additional method to return this data and just relied on the normal service name. But i ran into the problem of i can't be sure nobody else redefined this service, so how do i know that the service i got is the one i created with this method? I just tested using instaceof, and if not tried to wrap the returned service again in my class. This has the benefit of if anyone else modifies the CommentParser service in a layered way, I will hopefully run their code and the list of tags will match the result. On the other hand its icky, so maybe the better method was the aforementioned having a separate service alias explicitly for my class. However again, that has the problem of how to bootstrap, since I didn't want to ever have to instantiate the base CommentParser class, with MediaWiki's highly unstable constructors.

This whole part seemed icky, and the various options seemed bad. In practice it maybe doesn't matter as its unlikely two extensions will wrap this service, but it seems like there should be a better way. I guess wrapping the service is fine, its extracting additional information that's the problem.

I think documenting best practices on how to "decorate" services would go a long way here to making this better. Right now it is very unclear how to do this in a way that is composable and extensible.

I've also seen examples of people using reflection when trying to "extend" services, but luckily i didn't need to resort to that.

Testability

The last thing I wanted to mention, is i do think this general system has had dividends. The path to get here was a little rocky, but I do think this style of code is much clearer.

I used to absolutely hate writing tests for mediawiki code. It was such a frustrating developer experience. Now it is much less painful. For this extension I have 100% test coverage (excluding maintenance scripts).  Admittedly it is a relatively small extension, but five years ago, I couldn't imagine bothering to do that for any MediaWiki code.

Conclusion

Image for: Conclusion

I had fun writing this extension and am proud of the result. It was interesting to experiment with using MediaWikiServices as an extension mechanism

Consider trying it out on your wiki, and let me know what you think! The extension should be compatible with MediaWiki 1.39 and higher (Using appropriate REL branch)

More information about it can be found at https://www.mediawiki.org/wiki/Extension:Hashtags



Tuesday, April 23, 2024

Image for: Tuesday, April 23, 2024

MediaWiki Users and Developers Conference Spring 2024

 Last week I went to Portland for the MediaWiki Users and Developers conference (nee EMWCon). This is primarily a conference for people doing stuff with MediaWiki outside of Wikimedia. I had a blast.



I always enjoy conferences on the smaller side. They feel so much more personal. This year's conference had Ward Cunningham as the guest of honour. Ward was a fascinating person to meet and get to talk to.

I also must say hats off to the organizers - conference ran smoothly, venue was great, food was amazing. Seriously some of the best food I've ever had at any Wikimedia conference.

This was also my first time in Portland. Portland is a beautiful city. I didn't have a huge amount of time to explore the city, but I did manage to go to the Chinese garden, which was absolutely stunning. I also loved how many interesting murals there were in the city. Even the graffiti seemed prettier than normal.

 
While listening to the talks, I realized that a good talk is very similar to a good design doc. Perhaps this is an obvious comparison, but I never really noticed before how similar the two things are. In both cases, you want to give the reader/viewer context about the problem you want to solve, what solution you chose, why you chose it and how it worked out. At the same time you want to avoid the temptation to go too far into implementation details.

I think my favourite talk was Jeffery's. He demo'd using LLMs to answer questions based on the content of the Wiki. The demo deities weren't fully in his favour, but I think it also demonstrated an important point that LLMs are cutting edge technologies that don't always give the expected answer 100% of the time. In any case, he did a great job presenting.

I did get the sense that I think some participants were disappointed that there was very little representation of WMF management (whether "real" management or product management) at the conference. Birgit did give a remote talk and Selena did come to a happy hour event after the conference, but neither really participated.

I don't think the participants necessarily wanted anything from WMF management, but there is a little bit of a feeling of being unseen. Many of the conference goers use MediaWiki for their own purposes and are interested to know what WMFs plans are for the future and how it will affect them (as do we all).

 

 

I think some participants were hoping to maybe make some connections for better mutual understanding and just reduce uncertainty about what is on the roadmap for MediaWiki. In theory Birgit's talk was about the plans for MediaWiki, but I suspect it was too laden with annual planning corporate buzzwords for anyone to figure out what it actually meant concretely.

The flip side of that of course is that open source is a do-orcracy. The corporate MediaWiki users as a general rule do not contribute back to MediaWiki core all that often, which is the price of admission to the various power structures of MediaWiki.

Create Camp

At the create camp, I had a long chat with Mark about what parts of the documentation are unclear to users new to MediaWiki. While I think all of will admit that our documentation is sub-par (bug 1), it was great to get a fresh perspective on it.
 
I think adding screencasts in addition to the written documentation can help with the problem of assumed knowledge and missing implied steps.

I also heard a bit about SemanticMediaWiki (SMW) bug 5392. This is a bug where sometimes SMW drops properties associated with a page. It seems like there is a lot of frustration among the SMW community over this bug. At the same time, it doesn't seem like anyone has seriously tried to debug it. The bug does look a bit annoying to track down. It appears to be some sort of race condition, appearing somewhat randomly and more often when there are multiple things going on at the same time (e.g. the job queue is being run with more threads seems to make it more common) but nobody really knows so hence there are no steps to reproduce. Additionally there has been no attempts to create a minimal test case (e.g. What extensions are needed for the bug to appear) nor has anyone posted any debug logs from the parses in question. No one has even determined if the properties are missing at parse time or if they are being overridden at a later time. Anyways, I suspect its going no where unless people post a lot more information on the task or they hand over a server experiencing the bug to someone good at debugging.

Conclusion

I had a great time. Hopefully I'll be able to come again next year.

Saturday, March 30, 2024

Image for: Saturday, March 30, 2024

MediaWiki edit summary XSS write-up

 Back in January, I discovered a stored XSS vulnerability in core MediaWiki (T355538; CVE-2024-34507). Essentially by setting a specific edit summary when editing a page, you could run javascript (And take over the account of anyone viewing the edit summary, for example on the history page or recentchanges)

MediaWiki core is generally pretty good when it comes to security. There are many sketchy extensions, and sometimes there are issues where an admin might be able to run javascript, but by and large unauthenticated XSS vulns are fairly rare. I think the last one was CVE-2021-44858 from back in 2021. The next one before that was CVE-2017-8815 in 2017, which only applied to wikis configured to have a site language of certain languages (e.g. Serbian and Chinese). At least, those were the ones I found when looking through the list. Hopefully I didn't miss any. In any case, finding XSS triggerable by an unprivleged attacker in MediaWiki core is pretty hard.

So what is the bug? The proof of concept looks like this - Create an edit with the following edit summary:

[[Special:RecentChanges#%1b0000000|link1]] [[PageThatExists#/autofocus/onfocus=alert("xss\n"+document.domain)//|link2]]

This feels a bit random at first glance. How does it work?

The edit summary parser

Whenever you edit a page on MediaWiki, there is a box for your edit summary. This is essentially MediaWiki's version of a commit message.

Very little formatting is allowed in this summary. A major exception is links. You can link to other pages by enclosing the link in [[ and ]].

So this explains a little bit about the proof-of-concept - it involves 2 links. But why 2? It doesn't work with just 1. What is with the weird link targets? They are clearly abnormal, but they also don't look like typical XSS. There are no < or >, there aren't even any unclosed quotes.

Lets take a deeper look at how MediaWiki applies formatting to these edit summaries. The code where all this happens is includes/CommentFormatter/CommentParser.php.

The first thing we might notice is the following line in CommentParser::preprocessInternal: "// \x1b needs to be stripped because it is used for link markers"

In the proof of concept, the first part is [[Special:RecentChanges#%1b0000000|link1]], where %1b appears. This is a good hint that it has something to do with link markers, whatever those are.

Link markers

But what are link markers?

When MediaWiki makes a link, it needs to know whether the page being linked to exists or not, since missing pages use a red colour. The most natural way of doing this is, when encountering a link, to check in the DB whether or not the page exists.

However, there is a problem. When rendering a long page with a lot of links, we have to do a lot of DB lookups. The lookups are simple, but still on a separate (albeit nearby server). Each page to lookup involves a local network request to fetch the page status. While that is happening, MW just sits and waits. This is all very fast, but even still it adds up a little bit if you have say 500 links on a page.

The solution to this problem was to batch the queries. Instead of immediately looking up the page, MW would put a small link marker in the page at that point and carry on. Once it is finished, it would look up all the links all at once, and then do another pass to replace all the link markers.

So this is what a link marker is, just a little marker to tell MW to come back to this spot later after it figured out if all the links exist. The format of this marker is \x1B<number> (So \x1B0000000 for the first one, \x1B0000001 for the second, and so on). \x1B is the ASCII escape character.

Back to the PoC

This explains the first part of the proof of concept: [[Special:RecentChanges#%1b0000000|link1]] - the link target is a link marker. The code has a line:

                                // Fix up urlencoded title texts (copied from Parser::replaceInternalLinks)
                                if ( strpos( $match[1], '%' ) !== false ) {
                                        $match[1] = strtr(
                                                rawurldecode( $match[1] ),
                                                [ '<' => '&lt;', '>' => '&gt;' ]
                                        );
                                }


Which normalizes titles using percent encoding to use the real characters. Thus the %1B gets replaced with an actual 0x1B byte sequence. The code did try and strip 0x1B characters earlier, but at that point, it was still just %1b and did not match the check.

We now have a link with a link marker inside of it. An important note here is that Special:RecentChanges is not a normal page. It is a special page. MediaWiki knows it exists without having to consult the database, so it does not get the link marker treatment. This is important because we cannot use it as a fake link marker if it gets replaced by a real link marker.

At this stage after inserting link markers, the proof of concept becomes the following string:

<a href="/w/index.php/Special:RecentChanges#\x1B000000" title="Special:RecentChanges">link1</a> \x1B0000000

A link with a link marker inside it!

The second link

The \x1B0000000 is a stand in for [[PageThatExists#/autofocus/onfocus=alert("xss\n"+document.domain)//|link2]].

The replacement at the end is a normal replacement, and everything is fine. However there are now two replacements - there is also the replacement inside the link: href="/w/index.php/Special:RecentChanges#\x1B000000"

This is the fake link marker that we contrived to get inserted. Unlike the normal link markers, this is inside an attribute. The replacement text assumes it is being inserted as normal HTML, not as an attribute. Since it is a full link that also has quotes inside it, the two layers of quotes will interfere with each other.

Once the replacements happen we get the following mangled HTML for our proof of concept:

<a href="/w/index.php/Special:RecentChanges#<a href="/w/index.php/Test#/autofocus/onfocus=alert(&quot;xss\n&quot;+document.domain)//" title="Test">link2</a>" title="Special:RecentChanges">link1</a> <a href="/w/index.php/Test#/autofocus/onfocus=alert(&quot;xss\n&quot;+document.domain)//" title="Test">link2</a>

This obviously looks wrong, but its a bit unclear how browsers interpret it. A little known fact about HTML - /'s can separate attributes so long as no equal signs have been encountered yet. After the browser hits the second " mark, it thinks the href attribute is closed and that the remaing is some additional attributes. The browser essentially parses the above html as if it was:

<a href="/w/index.php/Special:RecentChanges#<a href=" w="" index.php="" Test#="" autofocus onfocus="alert(&quot;xss\n&quot;+document.domain)//&quot;" title="Test">link2</a>" title="Special:RecentChanges"&gt;link1</a> <a href="/w/index.php/Test#/autofocus/onfocus=alert(&quot;xss\n&quot;+document.domain)//" title="Test">link2</a>

In other words, an <a> tag, that has an attribute named autofocus and an onfocus event handler. On page load, the link is automatically focused, which triggers the javascript in the onfocus attribute to run, allowing the attacker to do what they want.

Take aways

I think the major take aways is that running Regexes over partially parsed HTML is always scary. We've had similar issues in the past, for example T110143.

The general pattern we've used to fix this and similar issues, is make sure the replacement token has special characters that would be mangled if it appeared in an unexpected context. Concretely, we added " and ' to the token, which would get escaped if placed in an attribute, and thus no longer matching and no longer being replaced.

More generally though, I think this is a good example of why even a minimal CSP policy would be helpful.

CSP is a complex standard, that can do a lot of things and has a lot of pieces. One of the things it can do, is disable "unsafe-inline" javascript. This means javascript from attributes (like onfocus) and javascript URLs. Usually this also includes inline <script> tags without a nonce, but that part is optional. A key point here, is this also generally means you cannot execute javascript via .innerHTML anymore, which is a fairly common vector for XSS via javascript.

Normally disabling unsafe-inline would be part of a broader effort to secure javascript, however its possible to take things a step at a time. This vulnerability would have been stopped just by disabling event attributes. A surprising portion of MediaWiki & extension XSS vulns [Excluding boring - an admin can change something in an unsafe way issues] involve just html attributes (or javascript: urls), which is a web feature that nobody really needs for legit reasons and is generally considered bad practise in normal usage. Even the most minimal CSP policy might really help MediaWiki's overall security posture against XSS vulns.

For more info about the vulnerability, please see the original report at https://phabricator.wikimedia.org/T355538.

Wednesday, January 10, 2024

Image for: Wednesday, January 10, 2024

Imagining Future MediaWiki

 As we roll into 2024, I thought I'd do something a little different on this blog.

A common product vision exercise is to ask someone, imagine it is 20 years from now, what would the product look like? What missing features would it have? What small (or large) annoyances would it no longer have?

I wanted to do that exercise with MediaWiki. Sometimes it feels like MediaWiki is a little static. Most of the core ideas were implemented a long time ago. Sure there is a constant stream of improvements, some quite important, but the core product has been fixed for quite some time now. People largely interact with MediaWiki the same way they always have. When I think of new fundamental features to MediaWiki, I think of things like Echo, Lua and VisualEditor, which can hardly be considered new at this point (In fairness, maybe DiscussionTools should count as a new fundamental feature, which is quite recent). Alternatively, I might think of things that are on the edges. Wikidata is a pretty big shift, but its a separate thing from the main experience and also over a decade old at this point.

I thought it would be fun to brainstorm some crazy ideas for new features of MediaWiki, primarily in the context of large sites like Wikipedia. I'd love to hear feedback on if these ideas are just so crazy they might work, or just crazy. Hopefully it inspires others to come up with their own crazy ideas.

What is MediaWiki to me?

Image for: What is MediaWiki to me?

Before I start, I suppose I should talk about what I think the goals of the MediaWiki platform is. What is the value that should be provided by MediaWiki as a product, particularly in the context of Wikimedia-type projects?

Often I hear Wikipedia described as a top 10 document hosting website combined with a medium scale social network. While I think there is some truth to that, I would divide it differently.

I see MediaWiki as aiming to serve 4 separate goals:

  • A document authoring platform
  • A document viewing platform (i.e. Some people just want to read the articles).
  • A community management tool
  • A tool to collect and disseminate knowledge

The first two are pretty obvious. MediaWiki has to support writing Wikipedia articles. MediaWiki has to support people reading Wikipedia articles. While I often think the difference between readers and editors is overstated (or perhaps counter-productive as hiding editing features from readers reduces our recruitment pool), it is true they are different audiences with different needs.

What I think is a bit under-appreciated sometimes but just as important, is that MediaWiki is not just about creating individual articles, it is about creating a place where a community of people dedicated to writing articles can thrive. This doesn't just happen at the scale of tens of thousands of people, all sorts of processes and bureaucracy is needed for such a large group to effectively work together. While not all of that is in MediaWiki, the bulk of it is.

One of my favourite things about the wiki-world, is it is a socio-technical system. The software does not prescribe specific ways of working, but gives users the tools to create community processes themselves. I think this is one of our biggest strengths, which we must not lose sight of. However we also shouldn't totally ignore this sector and assume the community is fine on its own - we should still be on the look out for better tools to allow the community to make better processes.

Last of all, MediaWiki aims to be a tool to aid in the collection and dissemination of knowledge¹. Wikimedia's mission statement is: "Imagine a world in which every single human being can freely share in the sum of all knowledge." No one site can do that alone, not even Wikipedia. We should aim to make it easy to transfer content between sites. If a 10 billion page treatise on Pokemon is inappropriate for Wikipedia, it should be easy for an interested party to set up there own site that can house knowledge that does not fit in existing sites. We should aim to empower people to do their own thing if Wikimedia is not the right venue. We do not have a monopoly on knowledge nor should we.

As anyone who has ever tried to copy a template from Wikipedia can tell you, making forks or splits from Wikipedia is easy in theory but hard in practice. In many ways I feel this is the area where we have most failed to meet the potential of MediaWiki.

With that in mind, here are my ideas for new fundamental features in MediaWiki:

As a document authoring/viewing platform

Image for: As a document authoring/viewing platform

Interactivity

Detractors of Wikipedia have often criticized how text based it is. While there are certainly plenty of pictures to illustrate, Wikipedia has typically been pretty limited when it comes to more complex multimedia. This is especially true of interactive multimedia. While I don't have first hand experience, in the early days it was often negatively compared to Microsoft Encarta on that front.

We do have certain types of interactive content, such as videos, slippy maps and 3D models, but we don't really have any options for truly interactive content. For example, physics concepts might be better illustrated with "interactive" experiments, e.g. where you can push a pendulum with a mouse and watch what happens.

One of my favourite illustrations on the web is this one of an Enigma machine. The Enigma machine for those not familiar was a mechanical device used in world war 2 to encrypt secret messages. The interactive illustration shows how an inputted message goes through various wires and rotates various disks to give the scrambled output. I think this illustrates what an Enigma machine fundamentally is better than any static picture or even video would ever be able to.

Right now there are no satisfactory solutions on Wikipedia to make this kind of content. There was a previous effort to do something in the vein of interactive content in the graph extension, which allowed using the Vega domain specific language to make interactive graphs. I've previously wrote on how I think that was a good effort but ultimately missed the mark. In short, I believe it was too high level which caused it to lack the flexibility neccessarily to meet the needs of users, while also being difficult to build simplifying abstractions overtop.

I am a big believer that instead of making complicated projects that prescribe certain ways of doing things, it is better to make simpler, lower level tools that can be combined together in complex ways, as well as abstracted over so that users can make simple interfaces (Essentially the unix philosophy). On Wiki, I think this has been borne out by the success of using Lua scripting in templates. Lua is low level (relative to other wiki interfaces), but the users were able to use that to accomplish their goals without MediaWiki developers having to think about every possible thing they might want to do. Users were than able to make abstractions that hid the low level details in every day use.

To that end, what I'd like to see, is to extend Lua to the client side. Allow special lua interfaces that allow calling other lua functions on the client side (run by JS), in order to make parts of the wiki page scriptable while being viewed instead of just while being generated.

I did make some early proof-of-concepts in this direction, see https://bawolff.net/monstranto/index.php/Main_Page for a Demo of Extension:Monstranto. See also a longer piece I wrote, as well as an essay by Yurik on the subject I found inspiring.

Mobile editing

This is one where I don't really know what the answer is, but if I imagine MW in 20 years, I certainly hope this is better.

Its not just MediaWiki, I don't think any website really has authoring long text documents on mobile figured out.

That said, I have seen some interesting ideas around, that I think are worth exploring (None of these are my own ideas)

Paragraph or sentence level editing

This idea was originally proposed about 13 years ago by Jan Paul Posma. In fact, he write a whole bachelor's thesis on it.

In essence, Mobile gets more frustrating the longer the text you are editing is. MediaWiki often works on editing at the granularity of a section, but what about editing at the granularity of a paragraph or a sentence instead? Especially if you just want to fix a typo on mobile, I feel it would be much easier if you could just hit the edit button on a sentence instead of the entire section.

Even better, I suspect that parsoid makes this a lot easier to implement now than it would have been back in the day.

Better text editing UI (e.g. Eloquent)

A while ago I was linked to a very interesting article by Scott Jenson about the problems with text editing on mobile. I think he articulated the reasons it is frustrating very well, and also proposed a better UI which he called Eloquent. I highly recommend reading the article and seeing if it makes sense to you.

In many ways, we can't really do this, as this is an android level UI not something we control in the web app. Even if we did manage to make it in a web app somehow, it would probably be a hard sell to ordinary users not used to the new UI. Nonetheless, I think it would be incredibly beneficial to experiment with alternate UIs like these, and see how far we can get. The world is increasingly going mobile, and Wikipedia is increasingly getting left behind.

Alternative editing interfaces (e.g. voice)

Maybe traditional text editing is not the way of the future. Can we do something with voice control?

It seems like voice controlled IDEs are increasingly becoming a thing. For example, here is a blog post about someone who programs with a voice programming software called Talon. It seems like there are a couple other options out there. I see Serenade mentioned quite a bit.

A project in this space that looks especially interesting is cursorless. The demo looked really cool, and i could imagine that a power user would find it easier to use a system like this to edit large blobs of WikiText than the normal text editing interface on mobile. Anyways, i reccomend watching the demo video to see what you think.

All this is to say, I think we should look really hard at the possibilities in this space for editing MediaWiki from a phone. On screen keyboards are always going to suck, might as well look to other options.

As a community building platform

Image for: As a community building platform

Extensibility

I think it would be really cool if we had "lua" extensions. Instead of normal php extensions, a user would be able to register/upload some lua code, that gets subscribed to hooks, and do stuff. In this vision, these extension types would not be able to do anything unsafe like raw html, but would be able to do all sorts of stuff that users normally use javascript for.

This could be per user or also global. Perhaps could be integrated with a permission system to control what they can and cannot do.

I'd also like to see a super stable API abstraction layer for these (and normal extensions). Right now our extension API is fairly unstable. I would love to see a simple abstraction layer with hard stability guarantees. It wouldn't replace the normal API entirely, but would allow simpler extensions to be written in such a way that they retain stability in the long term.

Workflows

I think we could do more to support user-created workflows. The Wiki is full of user created workflows and processes. Some are quite complex others simple. For example nominating an article for deletion or !voting in an RFC.

Sometimes the more complicated ones get turned into javascript wizards, but i think that's the wrong approach. As I side earlier, I am a fan of simpler tools that can be used by ordinary users, not complex tools that do a specific task but can only be edited by developers and exist "outside" the wiki.

There's already an extension in this area (not used by Wikimedia) called PageForms. This is in the vein of what I am imagining, but I think still too heavy. Another option in this space is the PageProperties extension which also doesn't really do what I am thinking of.

What I would really want to see is an extension of the existing InputBox/preload feature.

As it stands right now, when starting a new page or section, you can give a url parameter to preload some text as well as parameters to that text to replace $1 markers.

We also have the InputBox extension to provide a text box where you can put in the name of an article to create with specific text pre-loaded.

I'd like to extend this idea, to allow users to add arbitrary widgets² (form elements) to a page, and bind those widgets to specific parameters to be preloaded.

If further processing or complex logic is needed, perhaps an option to allow the new preloaded text to be pre-processed by a lua module. This would allow complex logic in how the page is edited based on the user's inputs. If there is one theme in this blog post, it is I wish lua could be used for more things on wiki.

I still imagine the user would be presented with a diff view and have to press save, in order to prevent shenanigans where users are tricked into doing something they don't intend to.

I believe this is a very light-weight solution that also gives the community a lot of flexibility to create custom workflows in the wiki that are simple for editors to participate in.

Querying, reporting and custom metadata

This is the big controversial one.

I believe that there should be a way for users to attach custom metadata to pages and do complex queries over that metadata (including aggregation). This is important both for organizing articles as well as organizing behind the scenes workflows.

In the broader MediaWiki ecosystem, this is usually provided by either the SemanticMediaWiki or Cargo extensions. Often in third party wikis this is considered MediaWiki's killer feature. People use them to create complex workflows including things like task trackers. In essence it turns MediaWiki into a no-code/low-code user programmable workflow designer.

Unfortunately, these extensions all scale poorly, preventing their use on Wikimedia. Essentially I dream of seeing the features provided by these extensions on Wikipedia.

The existing approaches are as follows:

  • Vanilla MediaWiki: Category pages, and some query pages.
    • This is extremely limited. Category pages allow an alphabetical list. Query pages allow some limited pre-defined maintenance lists like list of double redirects or longest articles. Despite these limitations, Wikipedia makes great use out of categories.
  • Vanilla mediawiki + bots:
    • This is essentially Wikipedia's approach to solving this problems. Have programs do queries offsite and put the results on a page. I find this to be a really unsatisfying solution. A Wikipedian once told me that every bot is just a hacky workaround to MediaWiki failing to meet its users' needs, and I tend to agree. Less ideologically, the main issue here is its very brittle - when bots break often nobody knows who has access to the code or how it can be fixed. Additionally, they often have significant latency for updates (If they run once a week, then the latency is 7 days) and ordinary users are not really empowered to create their own queries.
  • Wikidata (including the WDQS SPARQL endpoint)
    • Wikidata is adjacent to this problem, but not quite trying to solve it. It is more meant as a central clearinghouse for facts, not a way to do querying inside Wikipedia. That said Wikidata does have very powerful query features in the form of SPARQL. Sometimes these are copied into Wikipedia via bots. SPARQL of course has difficult to quantify performance characteristics that make it unsuitable for direct embedding into Wikipedia articles in the MediaWiki architecture. Perhaps it could be iframed, but that is far from being a full solution.
  • SemanticMediaWiki
    •  This allows adding Semantic annotations to articles (i.e. Subject-verb-object type relations). It then allows querying using a custom semantic query language. The complexity of the query language make performance hard to reason about and it often scales poorly.
  • Cargo
    • This is very similar to SemanticMediaWiki, except it uses a relational paradigm instead of a semantic paradigm. Essentially users can define DB tables. Typically the workflow is template based, where a template is attached to a table, and specific parameters to the template are populated into the database. Users can then use (Sanitized) SQL queries to query these tables. The system uses an indexing strategy of adding one index for every attribute in the relation.
  • DPL
    • DPL is an extension to do complex querying and display using MediaWiki's built in metadata like categories. There are many different versions of this extension, but all of them have potential queries that scale linearly with the number of pages in the database, and sometimes even worse.

I believe none of these approaches really work for Wikipedia. They either do not support complex queries or allow too complex queries with unpredictable performance. I think the requirements are as follows:

  • Good read scalability (By read, I mean scalability when generating pages (during "parse" in mediawiki speak). On Wikipedia, pages are read and regenerated a lot more often than they are edited.
    • We want any sort of queries to have very low read latency. Having long pauses waiting for I/O during page parsing is bad in the MediaWiki architecture
    • Queries should scale consistetly. They should at worse be roughly O(log n) in the number of pages on the wiki. If using a relational style database, we would want the number of rows the DBMS have to look at be no more than a fixed max number
  • Eventual write consistency
    • It is ok if it takes a few minutes for things using the custom metadata to update after it is written. Templates already have a delay for updating.
    • That said, it should still be relatively quick. On the order of minutes ideally. If it takes a day or scales badly in terms of the size of the database, that would also be unacceptable.
    • write performance does not have to scale quite as well as read performance, but should still scale reasonably well. 
  • Predictable performance.
    • Users should not be able to do anything that negatively impacts site performance
    • Users should not have to be an expert (or have any knowledge) in DB performance or SQL optimization.
    • Limits should be predictable. Timeouts suck, they can vary depending on how much load the site is under and other factors. Queries should either work or not work. Their validity should not be run-time dependent. It should be obvious to the user if their query is an acceptable query before they try and run it. There should be clear rules about what the limits of the system are.
  • Results should be usable for futher processing
    • e.g. You should be able to use the result inside a lua module and format it in arbitrary ways
  • [Ideally] Able to be isolated from the main database, shardable, etc.
  • Be able to query for a specific page, a range of pages, or aggregates of pages (e.g. Count how many pages are in a range, average of some property, etc)
    • Essentially we want just enough complexity to do interesting user defined queries, but not enough that the user is able to take any action that affects performance.
    • There are some other query types that are more obscure but maybe harder. For example geographic related queries. I don't think we need to support that.
    • Intersection queries are an interesting case, as they are often useful on wiki. Ideally we would support that too.

 

Given these constraints I think the CouchDB model might be the best match for on-wiki querying and reporting.

Much of the CouchDB marketing material is aimed around their local data eventual consistency replication story. Which is cool and all but not what I'm interested in here. A good starting point for how their data model works is their documentation on views. To be clear, I'm not neccesarily suggesting using CouchDB, just that its data model seems like a good match to the requirements.

CouchDB is essentially a document database based around the ideas of map-reduce. You can make views which are similar to an index on a virtual column in mysql. You can also make reduce functions which calculate some function over the view. The interesting part is that the reduce function is indexed in a tree fashion, so you can efficiently get the value of the function applied to any contiguous range of the rows in logrithmic time. This allows computing aggregations of the data very efficiently. Essentially all the read queries are very efficient. Potentially write queries can be less so but it is easy to build controls around that. Creating or editing reduce functions is expensive because it requires regenerating the index, but that is expected to be a rare operation and users can be informed that results may be unreliable until it completes.

In short, the way the CouchDB data model works as applied to MediaWiki could be as follows:

  • There is an emit( relationName, key, data) function added to lua. In many ways this is very similar to adding a page to a category named relationName with a sortkey specificed by key. data is optional extra data associated with this item. For performance reason, there may be a (high) limit to the max number of emit() on a page to prevent DB size from exploding.
  • Lua gets a function query( relationName, startKey, endKey ). This returns all pages between startKey and endKey and their associated data. If there are more than X (e.g. 200) number of pages, only return the first X.
  • Lua gets a queryReduced( relationName, reducerName, startKey, endKey ) which returns the reduction function over the specified range. (Main limitation here is the reduce function output must be small in size in order to make this efficient)
  • A way is added to associate a lua module as a reduce function. Adding or modifying these functions is potentially an expensive operation. However it is probably acceptable to the user that this takes some time

All the query types here are efficient. It is not as powerful as arbitrary SQL or semantic queries, but it is still quite powerful. It allows computing fairly arbitrary aggregation queries as well as returning results in a user-specified order. The main slow parts is when a reduction function is edited or added, which is similar to how a template used on very many pages can take a while to update. Emiting a new item may also be a little slower than reading since the reducers have to be updated up the tree (With possibly contention on the root node), however that is a much rarer operation, and users would likely see it as similar to current delays in updating templates.

I suspect such a system could also potentially support intersection queries with reasonable efficiency subject to a bunch of limitations.

All performance limitations are pretty easy for the user to understand. There is some max number of items that can be emit() from a page to prevent someone from emit()ing 1000 things per page. There is a max number of results that can be returned from a query to prevent querying the entire database, and a max number of queries allowed to be made from a page. The queries involve reading a limited number of rows, often sequential. The system could probably be sharded pretty easily if a lot of data ends up in the database.

I really do think this sort of query model provides the sweet spot of complex querying but predictable good performance and would be ideal for a MediaWiki site running at scale that wanted SMW style features.

As a knowledge collection tool

Image for: As a knowledge collection tool

Wikipedia can't do everything. One thing I'd love to see is better integration between different MediaWiki servers to allow people to go to different places if their content doesn't fit in Wikipedia.

Template Modularity/packaging

Anyone who has ever tried to use Wikipedia templates on another wiki knows it is a painful process. Trying to find all the dependencies is a complex process, not to mention if it relies on WikiData or JsonConfig (Commons data: namespace)

The templates on a Wiki are not just user content, but complex technical systems. I wish we had a better systems for packaging and distributing them.

Even within the Wikimedia movement, there is often a call for global templates. A good idea certainly, but would be less critical if templates could be bundled up and shared. Even still, having distinct boundries around templates would probably make global templates easier than the current mess of dependencies.

I should note, that there are extensions already in this vein. For example Extension:Page_import and Extension:Data_transfer. All of them are nice and all, but I think it would maybe be cooler to have the concept of discrete template/module units on wiki, so that different components are organized together in a way that is easier to follow.

Easy forking

Freedom to fork is the freedom from which all others flow. In addition to providing an avenue for people who disagree with the status quo a way to do their own thing, easy forking/mirroring is critical when censorship is at play and people want to mirror Wikipedia somewhere we cannot normally reach. However running a wiki the size of english wikipedia is quite hard, even if you don't have any traffic. Simply importing an xml dump into a mysql DB can be a struggle at the sizes we are talking about.

I think it would be cool if we made ready to go sqlite db dumps. Perhaps possibly packaged as a phar archive with MediaWiki, so you could essentially just download a huge 100 GB file, plop it somewhere, and have a mirror/fork

Even better if it could integrate with EventStream to automatically keep things up to date.

Conclusion

Image for: Conclusion

So those are my crazy ideas for what I think is missing in MediaWiki (With an emphasis on the Wikipedia usecase and not the third party use-case). Agree? Disagree? Hate it? I'd love to know. Maybe you have your own crazy ideas. You should post them, after all, your crazy idea cannot become reality if you keep it to yourself!

Notes:

¹ I left out "Free", because as much as I believe in "Free Culture" I believe the free part is Wikimedia's mission but not MediaWiki's.

² To clarify, by widgets i mean buttons and text boxes. I do not mean widgets in the sense of the MediaWiki extension named "Widgets".