Talk: Embeddable Wiki Engine; Proof of Concept

Wikimania 2007 talk notes: “Embeddable Wiki Engine; Proof of Concept” – by Ping Yeh, Google Taiwan R&D Centre.

Want to make a Wikipedia off-line client.

Problem: Currently one and only one software is guaranteed to correctly view the database dumps – i.e. MediaWiki.

Wants a reusable MediaWiki parser. This would:

  • Ensure the correctness of the Parser

  • Reduce manpower required

Showed a diagram of the typical software architecture of a wiki system.

MediaWiki is very tied to a SQL engine. But for embeddable stuff can only really assume a flat file for data storage.

Project EWE is the code name of the project to test some of these ideas. An attempt to make components and the wiki engine reusable by many programming languages. A preliminary version is ready.

Split the parser into a parser (transforms wiki mark-up into a document tree), and a formatter.

The document tree uses DocNodes – page -> section -> header / paragraph / list. Each bit is separate, so can replace each bit.

MediaWiki parser : Uses GNU flex to specify the syntax. Based on the help pages. No templates support yet. Has a manually crafted parser to parse the tokens into a document tree.

HTMLformatter: trivial conversion from DocTree to HTML tags.

Intending to make this a wiki library that can be called from C++ or PHP.

Things want to add:

  • Language bindings for Python.

  • Moin-Moin compatibility.

  • XML formatter.

  • MediaWiki formatter – need to add templates.

  • A search engine.

Problems:

  • Compatibility with extensions – e.g. the <math> extension changes output for MediaWiki.

  • Wikipedia has too many extensions. If re-implement each extension, then it will be a lot of work. It’s essentially a duplication of effort.

Link to talk’s page.

Note: I’ll post some more notes for more talks in a few days time.

Talk: Metavid & MetaVidWiki

Wikimania 2007 talk notes: “Metavid & MetaVidWiki” by Michael Dale

Basic idea: Video + wiki

Metavid:

  • GPLv2 license

  • Free database dump.

Currently seems focussed on US congress speeches.

Can drag and drop videos in a list of videos to make a play list.

Current Metavid limits:

  • Very focussed on US congress (because funded by a foundation who particularly care about this).

  • No way to collectively improve the meta data.

  • Not browser neutral (prefers Firefox).

  • Not very scalable.

Currently rewriting (got a grant / funding) to make MetaVidWiki, using:

  • MediaWiki (so an open wiki architecture)

  • Semantic MediaWiki (so proper machine-readable relationships between things).

Have 3 namespaces for wiki video:

  • Metadata (e.g. who is speaking).

  • time information (e.g. when speaker changes or starts a new chapter, etc.).

  • editing sequences.

HTML5 will include a <video> tag based on Ogg Theora. Will have compatibility with this, or can fall back to plug-in methods for browsers that do not support HTML5.

Saw an interactive demo – looks pretty cool – has a video playback window, plus a list of coloured timeflow to show the splices between video segments, plus a bit where people can have and edit wikitext about each segment of the video. Can also drag-and-drop the segments to rearrange video.

Looking to go live with MetaVidWiki (with a reduced feature set) within a month or two.

Link to talk’s page.

Talk: A content driven reputation system for Wikipedia

Wikimania 2007 talk notes: “A content driven reputation system for Wikipedia.” by Luca De Alfaro.

Trusting content. How can we determine automatically if an edit was probably good, or probably bad, based on automated methods?

Authors of long-lived content gain reputation.

Authors who get reverted lose reputation.

Longevity of edits.

Labels each bit of text by the revision in which it was added.

Short-lived text is when <= 20% of the text survives to the next revision.

Have to keep track of live / current text and the deleted text.

Did you bring the article closer to the future? Did you make the page better? If so reward. If you went against the future, then you should be punished.

Ratio that was kept versus ratio that was reverted.

+1 = everything you changed was kept

-1 = everything you did was reverted.

Run it on a single CPU, no cluster :-)

Reputation – data shown for only registered users.

89% of the edits by low reputation users were against the future direction of the page.

Versus only 5% of the edits by high reputation users were against the future direction of the page.

I.e. there is a reasonable correlation between the reputation of the user, and whether edit was with the future direction of the page.

Can catch around 80% of the things that get reverted.

See demo: http://trust.cse.ucsc.edu/

Author lends 50% of their reputation of the text they create.

Want an option or special page to show the trustedness of a page. Time to process an edit is less than 1 second. Storage required is proportional to the last edit.

Instead of saying trust the whole article, can have partial trust in some of the article. Could provide a way of addressing concern about whether Wikipedia can be trusted – this way vandalism would likely be flagged as untrusted content.

Speaker would like a live feed of Wikipedia Recent Changes to continue and improve this work. Or perhaps it could run on toolserver? Erik later emphasised that having a good and friendly UI would make it easier to help with getting weight behind this tool.

Link to talk’s page.

Talk: Using Natural Language Processing to determine the quality of Wikipedia articles

Okay, I know that Wikimania is over, but I still have a backlog of talk notes that I just simply didn’t have time to post whilst at the conference, PLUS WikiBlogPlanet stalled for 3 days whilst I was at the conference. Yes, I know, I suck. I’ll try to make up for that by posting my talk notes over the few days. So I know they’re not super-fresh, but they are only a few days old. Oh, and it’s in note form, rather than complete sentences.

Talk notes: “Using Natural Language Processing to determine the quality of Wikipedia articles” by Brian Mingus. Speaker’s background is in studying brains.

Quality should not be imposed. It should come from the community.

Brains transform information like this: Letters -> Words -> sentences -> paragraphs -> sections -> discourse -> argument structure.

Wikipedia has many more readers than editors.

Learning is critical. See [[Maximum Entropy]]. Similar to neural networks. Neurons that fire together wire together. The brain is an insanely complicated machine, that finds meta associations. The more examples of a phenomenon leads to more smartness. Things associated with relevant features get strengthened, things that are irrelevant get weakened. Each brain is unique, so everyone’s idea of quality is different. Each individual brain is invaluable based on a lifetime of experience.

Some qualitative measures of article quality:

  • Clarity

  • flow

  • precision

  • unity (outline)

  • authority (footnotes)

  • and many more things besides….

Quantitative versus qualitative.

Picking a good apple for beginners:

  1. Apple should be round

  2. Apple should have even colouring

  3. Apple should be firm

Takes about 10 years to become an expert in something. E.g. an apple farmer for 10 years would not go through a check-list – they “just know” if it’s a good apple.

Some example quantitative indicators of article quality:

  • Number of editors

  • images

  • length

  • number of internal and external links

  • features article status.

Has a function for testing each of these and many more – these functions get passed the wikitext, the htmltext, the plain text (as 3 arguments).

On the English Wikipedia, the “editorial team” hand-rates the quality of an article. They also rate the importance of each article. E.g. Around 1428 Features Articles on the English Wikipedia.

Wanted some software that would try to predict the ratings of each article.

Assigns weights to each feature.

Model learned that Wikipedia was mostly low quality material. Therefore it learned to classify everything as bad, in the first iteration.

To prevent this, had to restrict learning to a subset of articles, because the number of stubs vastly outweigh the number of complete articles in the Wikipedia.

Used 3 categories / ratings for quality: 1) Good, 2) Okay, 3) Bad.

Can test how well the model works using “a confusion matrix”, which shows how confused the model is, by showing how much stuff it gets wrong (e.g. by classifying a FA as “bad”, or a stub article as “good”)

Result: The model was learning.

PLoS One is an alternate rating system. They use 3 categories:

  • Insight

  • Style

  • ?? (sorry, I missed this item)

Need to avoid group-think – see need to rating from others only after giving your rating. Can’t allow rater to know everyone else’s rating before they give it a rating, or else you can get weird effects (like groupthink) – avoid this by independence.

Performance: Takes 24 hours to run on 120,000 articles, on a cluster with 26 nodes.

Link to talk’s page.

Wikimania Talk notes: “Where have all the editors gone?”

I’ll copy and paste my notes for some of the talks at Wikimania 2007 here, in case it’s helpful so that everyone can follow what’s going on. As such they will be in point / summary form, rather than well-formed prose:

 

Talk:”Where have all the editors gone?” by Seth Anthony. Background in chemistry education. User since 2003 – and has seen people come and go. This raised some questions for him:

 

Who adds real content to the Wikipedia? Not just correcting typos and wikification.

 

Not all edits are created equal. Some are negative (e.g. vandalism). Some are positive (e.g. tweaks, spelling, formatting, wikifying), and some are really positive (adding content). Some have possible value (e.g. admin / bureaucracy / discussion).

 

Made a study using a sample of edits to the Wikipedia. (Size of sample not clear.)

Facts & figures on findings:

  • 28% of edits: outside article namespace.
  • 10% article talk pages.
  • 62% article namespace.

(i.e. 1/3 of the edits are not about the articles)

 

Breakdown of edits:

  • 5% vandalism
  • 45% of edits are tweaking / minor changes / adding categories.
  • 12% content creation. Of that, 10% is adding content to already existing articles, and 2% is creating new articles.
  • Rest? (Probably discussion?)

 

So only 12% of edits create fresh content.

 

Of these 12%, was most interested in this, so broke this down:

  • 0% were made by admins
  • 69% were registered users.
  • 31% were created by anon users, or non-logged in users.

 

… and only 52% were by people who had a user page. I.e. only half of the people had a name-based online identity.

 

Editors are not homogeneous.

 

Content creators, versus admins (for the English Wikipedia, in 2007) :

Admins Content creators

Num Edits 12900 edits 1700 edits (admins edit more)

% main namepsace 51% 81% (admins spend less proportion of time on content)

Ave num edits/page 2.27 2.2 (same)

edits per day 16 5 (admins more active)

 

Breakdown of each group:

 

Breakdown of Content creators –

  • Anons – 24% of high content edits are anon users. Drive-by editors. Who are they? One time editors. Are they normal editors who are accidentally logged out?
  • 28% are “dabblers”. fewer than 150 edits. Editing more than 1 month. Not very likely to have a user page.
  • 48% are “wikipedians”. More than 1 edit per day. Almost all have edited within the last week. They create articles. They tend to have a focus area e.g. “Swedish royalty” – great work in a specific area. They are subject area editors. Generally have > 500 edits.

 

Admins breakdown – have 2 groups – “A” and “B”

  • Admins A: 70% of admins make 20%-60% of their edits in the article space.
  • Admins B: 30% of admins have 60-90% edits in the article space. Only 1/3 of admins are in this class.

 

Were group “B” admins once “anon wikipedians”? The short answer seems to be “yes”.

  • Admins A: Early edits were less on articles.
  • Admins B: Early edits were more on articles.

 

So 4 distinct groups of editors:

  • Anons / drive-by editors
  • Occasional dabblers
  • Subject area editors – Anons + Admins B
  • Professional bureaucrats – Admins A.

 

A possible early indicator: Type A admins create their user page sooner than Admins B :=)