Talk: Using Natural Language Processing to determine the quality of Wikipedia articles

Okay, I know that Wikimania is over, but I still have a backlog of talk notes that I just simply didn’t have time to post whilst at the conference, PLUS WikiBlogPlanet stalled for 3 days whilst I was at the conference. Yes, I know, I suck. I’ll try to make up for that by posting my talk notes over the few days. So I know they’re not super-fresh, but they are only a few days old. Oh, and it’s in note form, rather than complete sentences.

Talk notes: “Using Natural Language Processing to determine the quality of Wikipedia articles” by Brian Mingus. Speaker’s background is in studying brains.

Quality should not be imposed. It should come from the community.

Brains transform information like this: Letters -> Words -> sentences -> paragraphs -> sections -> discourse -> argument structure.

Wikipedia has many more readers than editors.

Learning is critical. See [[Maximum Entropy]]. Similar to neural networks. Neurons that fire together wire together. The brain is an insanely complicated machine, that finds meta associations. The more examples of a phenomenon leads to more smartness. Things associated with relevant features get strengthened, things that are irrelevant get weakened. Each brain is unique, so everyone’s idea of quality is different. Each individual brain is invaluable based on a lifetime of experience.

Some qualitative measures of article quality:

  • Clarity

  • flow

  • precision

  • unity (outline)

  • authority (footnotes)

  • and many more things besides….

Quantitative versus qualitative.

Picking a good apple for beginners:

  1. Apple should be round

  2. Apple should have even colouring

  3. Apple should be firm

Takes about 10 years to become an expert in something. E.g. an apple farmer for 10 years would not go through a check-list – they “just know” if it’s a good apple.

Some example quantitative indicators of article quality:

  • Number of editors

  • images

  • length

  • number of internal and external links

  • features article status.

Has a function for testing each of these and many more – these functions get passed the wikitext, the htmltext, the plain text (as 3 arguments).

On the English Wikipedia, the “editorial team” hand-rates the quality of an article. They also rate the importance of each article. E.g. Around 1428 Features Articles on the English Wikipedia.

Wanted some software that would try to predict the ratings of each article.

Assigns weights to each feature.

Model learned that Wikipedia was mostly low quality material. Therefore it learned to classify everything as bad, in the first iteration.

To prevent this, had to restrict learning to a subset of articles, because the number of stubs vastly outweigh the number of complete articles in the Wikipedia.

Used 3 categories / ratings for quality: 1) Good, 2) Okay, 3) Bad.

Can test how well the model works using “a confusion matrix”, which shows how confused the model is, by showing how much stuff it gets wrong (e.g. by classifying a FA as “bad”, or a stub article as “good”)

Result: The model was learning.

PLoS One is an alternate rating system. They use 3 categories:

  • Insight

  • Style

  • ?? (sorry, I missed this item)

Need to avoid group-think – see need to rating from others only after giving your rating. Can’t allow rater to know everyone else’s rating before they give it a rating, or else you can get weird effects (like groupthink) – avoid this by independence.

Performance: Takes 24 hours to run on 120,000 articles, on a cluster with 26 nodes.

Link to talk’s page.