Rough cartogram of Wikipedia edits by country

A very rough cartogram of Wikipedia edits by country, from the data on meta :

Wikipedia edits by country

Disclaimers:

  • I really really don’t know what I am doing with cartogram software, so please take this with a huge grain of salt.
  • There was no country for Singapore, so it under-represented.
  • I don’t think there was any data for Africa at all, so I’m not sure why it’s even showing!
  • Similarly any countries not listed in the data were under-represented. This discriminates against countries with less than 0.1% of the edits (i.e. small countries or countries with only a few editors).
  • The data is not massively accurate (only accurate to a tenth of a percent).
  • Data is from about a year ago, so things may have changed since (e.g. China, with its on-again-off-again firewalling of Wikipedia was not in this data, but Hong Kong was).

Wikimania 2007 talk: “Virtual and national cultures”

Talk: “Virtual and national cultures: Wikimedia, projects and organisation” by Delphine Ménard.

[If you’re watching the video of this, then skip forward a few minutes until the slides start working, because there was an A/V glitch at the start with getting the slides to display]

One of the funniest edit wars on the French Wikipedia was over “endive” (a plant) – because of “chicon” versus “endive” (two different names for the same thing – Endive in France and Chicon in Belgium). However the page now says “chicon” and “endive” all the way through.

Yoghurt versus yogurt – which spelling do we use?? On the English Wikipedia, seems to be a first come first serve approach – whoever writes the page first gets to determine what the page’s spelling is.

How much do real life cultures impact the Wikipedia?

The Spanish wikipedia is one of the few wikipedias that calls admins something else – namely “bibliotecarios” – which means “librarians”. Which kind of makes sense – like librarians, they keep the place clean, stop bad behaviour in the library, and make it welcoming to people, and help people when they need help.

The German Wikipedia banned “Fair Use” outright pretty early on (not recognised in the German legal system). English allows it. French is somewhere in-between these two perspectives.

[[Henry the Navigator]] – same article, on 3 different wikipedias. German: just the facts; Portugal: has facts, plus says he was good; The English one has a dispute about whether he was a homosexual or not!!

Village pump: In French this is called “Le Bistro” – the café – more informal.

Request for Admins – on the French, you CAN self-nominate. On the English one, it’s far less common to self-nominate.

How culture affects local Wikimedia organisations: Today we have 10 official chapters, in 3 continents. In the US: Do we have a US chapter, or a state-based chapters? Versus Argentina, that wants to have a country-wide organisation.

Q&A: Heard from Indonesian wikipedia: European conflicts that spill onto the Indonesian articles. E.g. conflicts over Geographical name of something (which comes first). Have to try to work together to find a solution that is acceptable to both sides.

Link to talk’s page.

Wikimania 2007 talk: “Enhancing wikis with social networking tools”

Talk: “Enhancing wikis with social networking tools” by Evan Prodromou (surname pronunciation guide: “Pro-Dro-Mo”, the “u” is silent)

WikiTravel founder

Keiki co-founder

Q: What makes wikis great? A: Cumulative effect – everyone can edit. Easy and adaptive. Progressively improvement. Get something close to consensus opinions.

…. but wikis can’t do everything! Don’t over apply wikis. Example the wikiclock – lets users update the current time, manually. Surprisingly accurate (only a few mins behind)

Wikis are not good for:

  • Automatable jobs

  • Personal opinion

  • “Protected” content & structured content

2 Types of wiki Communities:

  • community of practice (lets get something done)

  • community of interest (shared areas of interest)

Numbers:

  • 65% of WikiTravel users are engaged for 1 day or less (1 edit – never see again)

  • 95% for less than 1 month

  • This accounts for 70% of edits – so get lots of content from anon people.

Want to retain users, keep users engaged.

Features that work well with wikis:

  • Social networking

  • Blogs

  • Photo sharing

  • Forums

  • Social bookmarking

I.e. the whole web 2.0 playbook.

This works well, and complements wikis well.

Tagging / concepts / categories. Gives a way to associate wiki content with non-wiki content. Unity of content (wiki) versus a multitude of chaotic content (social side)

Don’t repeat yourself: People don’t want a new blog, or a new Flickr site. Want tagging. Done via RSS, FOAF, web APIs.

Can do various ways of gluing stuff together (Likes Drupal for example).

Case study #1: WikiTravel Extra.

  • Travel is very personal

  • Opinions

  • reviews! Want to capture these.

  • travel photos

Technology:

  • Drupal

  • Shared login via OpenID authentication system (was already using at WikiTravel)

  • Lots and lots of plugins

  • Custom software and glue code to bring it all together

Showed WikiTravel Extra:

  • same logo & look and feel as WikiTravel

  • lots of different content and blog posts.

  • Have geographic forums

  • Have photo sharing

Results: Have had good feedback from Extra.

Case study #2: Kei.ki (the name is Hawaiian for “child”).

Decided to create a free content parenting guide, open to everyone, edited by everyone.

Even more that travellers, parents like to share their experiences.

There are privacy concerns (e.g. sharing kids photos)

Have no existing wiki community. Need to focus more on content production, less on the wiki community.

Results gave a Kei.ki demo. Just opened today! Launching in French and English.

URL: http://kei.ki/

Q: Will Kei.ki be made into a paper book?

A: Hopefully, yes.

Link to talk’s page.

Wikimania 2007 talk: “MediaWiki API”

Talk: “MediaWiki API” by Yuri Astrakhan

Summary: “It’s all about the data, stupid!”

Yurikbot: More than 3 million edits.

The API adds a new layer of access. Allows new clients to access data (e.g. JavaScript browser extensions, Standalone apps like vandal fighter, data gathering for researchers).

Does not use the HTML rendering code.

Example of this; The navigation pop ups extension – this uses the API (because it is faster).

Current situation:

  • We have login

  • Can query existing data

  • Multiple output formats.

Coming:

  • Change data (this is currently in development by the Spanish Vodafone folks, among others)

API is very modular – can add things, and they will just plug in.

Can get some of the following:

  • Page information

  • Lists (e.g. list of backlinks to a page)

  • Metadata (servers)

  • Can get multiple types of information all in one query

  • Conveniences to avoid gotchas – e.g. normalisation, resolving redirects, etc.

(gave some demos of these on the live Wikipedia).

Tries to be very quick – only does the work that you ask it for.

Gave API performance figures: API gets 2% of Wikimedia site hits. Yet uses 0.1% of CPU load. I.e. API is 20 times more efficient than the main UI. However at the moment is a read-only API, so would expected better performance that the main UI (which also has to do writes).

Most of the hits are coming from the Open Search module currently.

Future: Want to add unit tests for the API (Note from me: it’s coming, but don’t hold your breath!).

Asked API users to use GZIP compression when calling it – added recently to PyWiki – this saves lots of bandwidth.

In the Q&A time, I asked the Vodafone folks about the API write capacity they’re working on, and very roughly when they thought it might be ready: Their estimate: Should be ready around Christmas (i.e. 4.5 months). They want it for their mobile phone customers, who want to be able to modify the Wikipedia, as well as view it on their mobile phones.

Demo time: Showed some examples of some API code (e.g. added a simple module, showed how to add new supported formats to the API).

Link to talk’s page.

Wikimania 2007 talk: “Special:Contributions/newbies: User study of newly registered user behaviour”

Talk: “Special:Contributions/newbies: User study of newly registered user behaviour” by Brianna Laugher. User name : pfctdayelise (pronounced as: “perfect day elise”). Most active on Commons.

Why are we interested in what new users do? New users are really important to community growth. Wikis have to keep growing – they grow or they die.

Interested in the English Wikipedia as it is the trailblazer, and it’s the first point of contact for many people with wikis (so still want a good impression for wikis in general).

Why do new users sign up? Do they think it’s a social site? Do they want to add articles about their employers?

What were other people’s experiences? How did other people interact with you when you first join? When did you first feel that you were part of the wiki community?

The attendees of Wikimania are the success stories.

Did a study of 1 day’s worth of new users. Picked a random day, and just observed all users on EN who registered on that day. What happened to those new users?

Disclaimers:

  • Deleted edits are not available through the MediaWiki API (would like this; would also like to be able to get a diff of an edit via the API, which currently is not possible).

The day: Feb 1st 2007. 10641 users signed up. This is a fairly typical number.

Showed some of the usernames:

Some were crazy, and you could be sure that they were not coming back.

Not every account represents one person. For example, there were a lot of Stephen Colbert usernames!

Total number of edits that they made:

7000 of the users out of 10000 had zero edits (i.e. 70%). They did nothing with their account.

How many edits until you are part of the community? 50? 100? 1000?

Only 5% of user made more than 10 edits.

Images uploaded by these users:

1329 images uploaded. 40% of these images were deleted.

Some very troubling cases:

  • 58 images uploaded. Of these 57 were deleted. Seems soul destroying for these people – it’s a lot of effort to upload images. Suspects that the majority of these issues are copyright problems.

Wikipedia = social networking site? To test this looked at talk page. 14% of the sample users had a talk page. Had a look at the edit summaries to the talk pages of these users:

  • 14% of the people got a page deletion notice (e.g. you added something not notable).

  • 36% got a vandalism warning.

  • 19% got an image deletion message.

  • 20% got a welcome message.

  • Some people left a talk message to themselves.

Some people seemed to use Wikipedia as a help lifeline. Kind of disturbing messages about people asking for help with bad domestic situations.

Community dynamics:

  • “All the low-hanging fruit has been picked” – Andrew Lih

  • Backlogs

Discussion of user warnings that we show to people:

“Don’t bite the newbies”. We have template warnings. They are pre-recorded messages. They are officious. It’s not people talking to other people. Templates use the royal “we”.

{{uw-test-2}} –> {{uw-test-3}} –> {{uw-test-4}} –> blocking

Not good system for socialisation and introducing people to Wikipedia and what we are about.

Are new users potential Wikipedians, or are they just pests mucking stuff up?

[[Wikipedia: WikiProject user warnings]]. It would be nicer to have [[Wikipedia: WikiProject user socialization]] – instead of scalding people, would be good to socialise them.

What are the goals of Wikipedia?

New users DO muck stuff up. How do we reward good behaviour and encourage people?

There is no page that identities new content adding users (not just spelling or formatting changes), especially by new users, so they can be encouraged and rewarded. Should we add this?

Everyone has something valuable thing to share. The hard thing is find what that is, and get them to share that stuff.

There is a wiki ethos.

“Laugher’s Law”: If you are going to act as is X is not allowed (existing social restriction), you may as well stop letting people do X (introduce a technical restriction).

Recommendations:

  • Change the community attitude.

  • Recognise and reward and promote GOOD behaviour, rather than punishing BAD behaviour.

We treat too much stuff as vandalism. Sometimes people are just confused, or don’t know what to do. Not all people are vandals that we currently call vandalism. E.g. blanking a page can be someone who really knows their stuff (but doesn’t know wikis), and knows that the page as presented is just wrong – so they blank it out.

People being templated to death is due to admins being overloaded. It’s why they were created.

Link to talk’s page.

Talk: Embeddable Wiki Engine; Proof of Concept

Wikimania 2007 talk notes: “Embeddable Wiki Engine; Proof of Concept” – by Ping Yeh, Google Taiwan R&D Centre.

Want to make a Wikipedia off-line client.

Problem: Currently one and only one software is guaranteed to correctly view the database dumps – i.e. MediaWiki.

Wants a reusable MediaWiki parser. This would:

  • Ensure the correctness of the Parser

  • Reduce manpower required

Showed a diagram of the typical software architecture of a wiki system.

MediaWiki is very tied to a SQL engine. But for embeddable stuff can only really assume a flat file for data storage.

Project EWE is the code name of the project to test some of these ideas. An attempt to make components and the wiki engine reusable by many programming languages. A preliminary version is ready.

Split the parser into a parser (transforms wiki mark-up into a document tree), and a formatter.

The document tree uses DocNodes – page -> section -> header / paragraph / list. Each bit is separate, so can replace each bit.

MediaWiki parser : Uses GNU flex to specify the syntax. Based on the help pages. No templates support yet. Has a manually crafted parser to parse the tokens into a document tree.

HTMLformatter: trivial conversion from DocTree to HTML tags.

Intending to make this a wiki library that can be called from C++ or PHP.

Things want to add:

  • Language bindings for Python.

  • Moin-Moin compatibility.

  • XML formatter.

  • MediaWiki formatter – need to add templates.

  • A search engine.

Problems:

  • Compatibility with extensions – e.g. the <math> extension changes output for MediaWiki.

  • Wikipedia has too many extensions. If re-implement each extension, then it will be a lot of work. It’s essentially a duplication of effort.

Link to talk’s page.

Note: I’ll post some more notes for more talks in a few days time.

Talk: Metavid & MetaVidWiki

Wikimania 2007 talk notes: “Metavid & MetaVidWiki” by Michael Dale

Basic idea: Video + wiki

Metavid:

  • GPLv2 license

  • Free database dump.

Currently seems focussed on US congress speeches.

Can drag and drop videos in a list of videos to make a play list.

Current Metavid limits:

  • Very focussed on US congress (because funded by a foundation who particularly care about this).

  • No way to collectively improve the meta data.

  • Not browser neutral (prefers Firefox).

  • Not very scalable.

Currently rewriting (got a grant / funding) to make MetaVidWiki, using:

  • MediaWiki (so an open wiki architecture)

  • Semantic MediaWiki (so proper machine-readable relationships between things).

Have 3 namespaces for wiki video:

  • Metadata (e.g. who is speaking).

  • time information (e.g. when speaker changes or starts a new chapter, etc.).

  • editing sequences.

HTML5 will include a <video> tag based on Ogg Theora. Will have compatibility with this, or can fall back to plug-in methods for browsers that do not support HTML5.

Saw an interactive demo – looks pretty cool – has a video playback window, plus a list of coloured timeflow to show the splices between video segments, plus a bit where people can have and edit wikitext about each segment of the video. Can also drag-and-drop the segments to rearrange video.

Looking to go live with MetaVidWiki (with a reduced feature set) within a month or two.

Link to talk’s page.

Talk: A content driven reputation system for Wikipedia

Wikimania 2007 talk notes: “A content driven reputation system for Wikipedia.” by Luca De Alfaro.

Trusting content. How can we determine automatically if an edit was probably good, or probably bad, based on automated methods?

Authors of long-lived content gain reputation.

Authors who get reverted lose reputation.

Longevity of edits.

Labels each bit of text by the revision in which it was added.

Short-lived text is when <= 20% of the text survives to the next revision.

Have to keep track of live / current text and the deleted text.

Did you bring the article closer to the future? Did you make the page better? If so reward. If you went against the future, then you should be punished.

Ratio that was kept versus ratio that was reverted.

+1 = everything you changed was kept

-1 = everything you did was reverted.

Run it on a single CPU, no cluster :-)

Reputation – data shown for only registered users.

89% of the edits by low reputation users were against the future direction of the page.

Versus only 5% of the edits by high reputation users were against the future direction of the page.

I.e. there is a reasonable correlation between the reputation of the user, and whether edit was with the future direction of the page.

Can catch around 80% of the things that get reverted.

See demo: http://trust.cse.ucsc.edu/

Author lends 50% of their reputation of the text they create.

Want an option or special page to show the trustedness of a page. Time to process an edit is less than 1 second. Storage required is proportional to the last edit.

Instead of saying trust the whole article, can have partial trust in some of the article. Could provide a way of addressing concern about whether Wikipedia can be trusted – this way vandalism would likely be flagged as untrusted content.

Speaker would like a live feed of Wikipedia Recent Changes to continue and improve this work. Or perhaps it could run on toolserver? Erik later emphasised that having a good and friendly UI would make it easier to help with getting weight behind this tool.

Link to talk’s page.

Talk: Using Natural Language Processing to determine the quality of Wikipedia articles

Okay, I know that Wikimania is over, but I still have a backlog of talk notes that I just simply didn’t have time to post whilst at the conference, PLUS WikiBlogPlanet stalled for 3 days whilst I was at the conference. Yes, I know, I suck. I’ll try to make up for that by posting my talk notes over the few days. So I know they’re not super-fresh, but they are only a few days old. Oh, and it’s in note form, rather than complete sentences.

Talk notes: “Using Natural Language Processing to determine the quality of Wikipedia articles” by Brian Mingus. Speaker’s background is in studying brains.

Quality should not be imposed. It should come from the community.

Brains transform information like this: Letters -> Words -> sentences -> paragraphs -> sections -> discourse -> argument structure.

Wikipedia has many more readers than editors.

Learning is critical. See [[Maximum Entropy]]. Similar to neural networks. Neurons that fire together wire together. The brain is an insanely complicated machine, that finds meta associations. The more examples of a phenomenon leads to more smartness. Things associated with relevant features get strengthened, things that are irrelevant get weakened. Each brain is unique, so everyone’s idea of quality is different. Each individual brain is invaluable based on a lifetime of experience.

Some qualitative measures of article quality:

  • Clarity

  • flow

  • precision

  • unity (outline)

  • authority (footnotes)

  • and many more things besides….

Quantitative versus qualitative.

Picking a good apple for beginners:

  1. Apple should be round

  2. Apple should have even colouring

  3. Apple should be firm

Takes about 10 years to become an expert in something. E.g. an apple farmer for 10 years would not go through a check-list – they “just know” if it’s a good apple.

Some example quantitative indicators of article quality:

  • Number of editors

  • images

  • length

  • number of internal and external links

  • features article status.

Has a function for testing each of these and many more – these functions get passed the wikitext, the htmltext, the plain text (as 3 arguments).

On the English Wikipedia, the “editorial team” hand-rates the quality of an article. They also rate the importance of each article. E.g. Around 1428 Features Articles on the English Wikipedia.

Wanted some software that would try to predict the ratings of each article.

Assigns weights to each feature.

Model learned that Wikipedia was mostly low quality material. Therefore it learned to classify everything as bad, in the first iteration.

To prevent this, had to restrict learning to a subset of articles, because the number of stubs vastly outweigh the number of complete articles in the Wikipedia.

Used 3 categories / ratings for quality: 1) Good, 2) Okay, 3) Bad.

Can test how well the model works using “a confusion matrix”, which shows how confused the model is, by showing how much stuff it gets wrong (e.g. by classifying a FA as “bad”, or a stub article as “good”)

Result: The model was learning.

PLoS One is an alternate rating system. They use 3 categories:

  • Insight

  • Style

  • ?? (sorry, I missed this item)

Need to avoid group-think – see need to rating from others only after giving your rating. Can’t allow rater to know everyone else’s rating before they give it a rating, or else you can get weird effects (like groupthink) – avoid this by independence.

Performance: Takes 24 hours to run on 120,000 articles, on a cluster with 26 nodes.

Link to talk’s page.

Wikimania Talk notes: “Where have all the editors gone?”

I’ll copy and paste my notes for some of the talks at Wikimania 2007 here, in case it’s helpful so that everyone can follow what’s going on. As such they will be in point / summary form, rather than well-formed prose:

 

Talk:”Where have all the editors gone?” by Seth Anthony. Background in chemistry education. User since 2003 – and has seen people come and go. This raised some questions for him:

 

Who adds real content to the Wikipedia? Not just correcting typos and wikification.

 

Not all edits are created equal. Some are negative (e.g. vandalism). Some are positive (e.g. tweaks, spelling, formatting, wikifying), and some are really positive (adding content). Some have possible value (e.g. admin / bureaucracy / discussion).

 

Made a study using a sample of edits to the Wikipedia. (Size of sample not clear.)

Facts & figures on findings:

  • 28% of edits: outside article namespace.
  • 10% article talk pages.
  • 62% article namespace.

(i.e. 1/3 of the edits are not about the articles)

 

Breakdown of edits:

  • 5% vandalism
  • 45% of edits are tweaking / minor changes / adding categories.
  • 12% content creation. Of that, 10% is adding content to already existing articles, and 2% is creating new articles.
  • Rest? (Probably discussion?)

 

So only 12% of edits create fresh content.

 

Of these 12%, was most interested in this, so broke this down:

  • 0% were made by admins
  • 69% were registered users.
  • 31% were created by anon users, or non-logged in users.

 

… and only 52% were by people who had a user page. I.e. only half of the people had a name-based online identity.

 

Editors are not homogeneous.

 

Content creators, versus admins (for the English Wikipedia, in 2007) :

Admins Content creators

Num Edits 12900 edits 1700 edits (admins edit more)

% main namepsace 51% 81% (admins spend less proportion of time on content)

Ave num edits/page 2.27 2.2 (same)

edits per day 16 5 (admins more active)

 

Breakdown of each group:

 

Breakdown of Content creators –

  • Anons – 24% of high content edits are anon users. Drive-by editors. Who are they? One time editors. Are they normal editors who are accidentally logged out?
  • 28% are “dabblers”. fewer than 150 edits. Editing more than 1 month. Not very likely to have a user page.
  • 48% are “wikipedians”. More than 1 edit per day. Almost all have edited within the last week. They create articles. They tend to have a focus area e.g. “Swedish royalty” – great work in a specific area. They are subject area editors. Generally have > 500 edits.

 

Admins breakdown – have 2 groups – “A” and “B”

  • Admins A: 70% of admins make 20%-60% of their edits in the article space.
  • Admins B: 30% of admins have 60-90% edits in the article space. Only 1/3 of admins are in this class.

 

Were group “B” admins once “anon wikipedians”? The short answer seems to be “yes”.

  • Admins A: Early edits were less on articles.
  • Admins B: Early edits were more on articles.

 

So 4 distinct groups of editors:

  • Anons / drive-by editors
  • Occasional dabblers
  • Subject area editors – Anons + Admins B
  • Professional bureaucrats – Admins A.

 

A possible early indicator: Type A admins create their user page sooner than Admins B :=)