Recovering from a corrupted MySQL install due to a dying hard disk

Might as well write these steps down in case I ever need them again:

Background: A test box started making a faint high-pitched squealing sound, and then powering off. Happened semi-randomly, but most commonly during periods of hard disk access (such as boot-up). Fixed the hardware problem by replacing the Power Supply Unit ($25 from MSY).


Then the above PSU semi-random-poweroff problem in turn caused the box’s 5-year-old hard disk to start playing up (age, plus having the power repeatedly die mid-write probably doesn’t help any).

Steps for moving to a new hard disk:

  • Buy a new hard disk of equal or larger size.
  • Install it in one of the USB external single hard disk enclosures that supports both an SATA or an IDE hard disk ($23 at MSY).
  • Download, burn, and boot from System Rescue CD (v1.5.6 is the current latest). Take the default boot menu option, and the default keymap.
  • After it boots, turn on and plug in the USB drive.
  • At the shell, see which disk has which device name: fdisk -l
  • Recover from the old to the new disk: ddrescue -b 2M /dev/hda /dev/sda ./ddres.txt
  • The above copied about 17 Mb per second, and claimed zero disk errors were found.
  • Poweroff, and swapped the old disk out of the machine and the new disk in.
  • Boot the new disk, was dropped into a shell during boot due to finding file system errors carried over from the old disk. Run fsck on the affected partion: fsck /dev/hda8 -y
  • Reviewing the SMART warnings in the syslog from the old disk seemed to indicate that it probably was dying, confirming that swapping the disks was the correct course of action.

Then found the previous HDD corruption had in turn corrupted a MySQL database (aren’t cascading failures great?). This manifested itself as at least 10 different MySQL errors/warnings/problems:

Error in /var/log/syslog when starting mysqld: Failed to open log (file ‘/var/log/mysql/mysql-bin.000348’, errno 2) The cheat was to delete the last line, the one referencing the /var/log/mysql/mysql-bin.000348 file, from the /var/log/mysql/mysql-bin.index file. Note that in this case I knew that the last log file contained no updates that mattered, so it really was no loss.
vim /var/log/mysql/mysql-bin.index

MySQL server would no longer start, instead giving a “Fatal error: Can’t open and lock privilege tables: Can’t find file: ‘host’ (errno: 2)” message in the logs. For me, the problem was that the /var/lib/mysql/mysql/host.MYI file was missing. What fixed it: Repair the host table.
mysqld_safe --skip-grant-tables &
mysql
mysql> use mysql
mysql> REPAIR TABLE host USE_FRM;
mysql> exit

Try to reset the host table with useful starting data, to fix this error when starting mysql client: error: ‘Access denied for user ‘debian-sys-maint’@’localhost’ (using password: YES)’
mysql_fix_privilege_tables
mysqladmin shutdown

To fix these errors in /var/log/syslog :
[ERROR] /usr/sbin/mysqld:Fatal error: Can’t open and lock privilege tables: Table ‘./mysql/db’ is marked as crashed and should be repaired
[ERROR] /usr/sbin/mysqld: Table ‘./mysql/db’ is marked as crashed and should be repaired

cd /var/lib/mysql/mysql
myisamchk db
myisamchk *.MYI

To fix these warnings in the syslog:
myisamchk: warning: Table is marked as crashed
MyISAM-table ‘db.MYI’ is usable but should be fixed

myisamchk -r db

To fix this error in /var/log/syslog :
[ERROR] /usr/sbin/mysqld: Incorrect information in file: ‘./mysql/tables_priv.frm’
This did not work: repair table tables_priv USE_FRM;
Cheated: just copied /var/lib/mysql/mysql/tables_priv.* from another working machine.
chmod -x,o-r,g+w tables_priv.*
chown mysql.mysql tables_priv.*

Recurrence of this error:
ERROR 1045 (28000): Access denied for user ‘debian-sys-maint’@’localhost’ (using password: YES)
And a new one:
Access denied for user ‘root’@’localhost’ (using password: NO)
… and at this point a “desc user;” showed that the user table file must have had a doubly-claimed inode with another table during the fsck, as it was a completely different user table schema from that found on another machine.
Cheated again: just copied /var/lib/mysql/mysql/user.* over from another working machine.
chown mysql.mysql user.*
/etc/init.d/mysql start

Then to fix: ERROR 1045 (28000): Access denied for user ‘debian-sys-maint’@’localhost’ (using password: YES)
cat /etc/mysql/debian.cnf
Copy the “password” field’s value for the “debian-sys-maint” user to the clipboard.
mysql
mysql> use mysql
mysql> GRANT ALL PRIVILEGES ON *.* TO 'debian-sys-maint'@'localhost' IDENTIFIED BY 'insert_password_copied_above_from_clipboard' WITH GRANT OPTION;
mysql> exit

To fix this warning in /var/log/syslog: WARNING: mysqlcheck has found corrupt tables
Force a check of all tables:
mysqlcheck -A

Fix for this error when granting privileges: ERROR 126 (HY000): Incorrect key file for table ‘./mysql/db.MYI’; try to repair it
mysql
mysql> GRANT ALL PRIVILEGES ON dbname.* to 'dbuser'@'localhost' IDENTIFIED BY "fakefake";
ERROR 126 (HY000): Incorrect key file for table './mysql/db.MYI'; try to repair it
mysql> use mysql
mysql> REPAIR TABLE db USE_FRM;
+----------+--------+----------+----------------------------------------------------+
| Table    | Op     | Msg_type | Msg_text                                           |
+----------+--------+----------+----------------------------------------------------+
| mysql.db | repair | info     | Wrong bytesec: 255- 37- 32 at 0; Skipped           |
| mysql.db | repair | info     | Found block that points outside data file at 424   |
....
| mysql.db | repair | info     | Found block that points outside data file at 24960 |
| mysql.db | repair | status   | OK                                                 |
+----------+--------+----------+----------------------------------------------------+
151 rows in set (0.01 sec)
mysql> GRANT ALL PRIVILEGES ON dbname.* to 'dbuser'@'localhost' IDENTIFIED BY "fakefake";
Query OK, 0 rows affected (0.00 sec)
mysql> exit
mysqladmin shutdown
/etc/init.d/mysql start

To fix these errors in the syslog on mysqld startup:
/etc/mysql/debian-start[4592]: ERROR 1017 (HY000) at line 116: Can’t find file: ‘columns_priv’ (errno: 2)
/etc/mysql/debian-start[4592]: ERROR 1017 (HY000) at line 516: Can’t find file: ‘proc’ (errno: 2)

mysql
mysql> use mysql
mysql> REPAIR TABLE proc USE_FRM;
mysql> REPAIR TABLE columns_priv USE_FRM;
mysql> exit

… and after all the above, the box powers on, and stays on, the disk errors are gone, the mysqld service starts cleanly, and from a quick cursory glance, the data still looks okay.

Migrating email from Outlook to Evolution: Linux’s final frontier

At various times in Linux’s history, various things about Linux have really sucked:

  • Getting hardware to work used to really suck, and you used to have to patch the kernel and recompile your own kernel… and then the kernel got a lot better, and the hardware support got a lot better, and I haven’t had to recompiled a kernel in years, and I’m happier because it largely “just works”.
  • Setting up printers used to really suck, with stuffing around with printcap files, and printer configurations, and desperately trying to get it to work … and then cups and printer detection improved, and now it’s generally all painless and point-and-click to install a printer, and I’m happier because it largely “just works”.

  • Setting up X-windows used to really suck, with editing X config files and mucking around with modelines… and then monitor detection got better, and I haven’t had to do anything with a X config file in years, and I’m happier because it largely “just works”.

  • Getting your Microsoft Office documents migrated from Windows to Linux used to really suck … and then Open Office came along, and it does a fairly good job of importing Office’s documents, and I’m happier because it largely “just works”.

This week I came to realize that there’s one last frontier remaining, where Linux still really sucks. And that frontier is migrating from a Windows graphical email client (Outlook in this case) to a Linux graphical email client (Evolution in this case). It does not “just work” … not at all. This blog entry will now explain why I say this.

Earlier this week, my main hard disk on my main desktop machine (a Windows machine) died, with a horrible repetitive clicking and grinding sound. By sheer random good luck, I had backed up all my data onto an external hard disk about one hour before this happened, plus I had a brand new machine ready to go on which I was considering trying Linux anyway. It looked like the stars were in alignment: Linux on the desktop of my main machine, here we come!

Installing and configuring Ubuntu was totally painless. (If you care, the exact steps followed are here: http://nickj.org/Ubuntu_8.04.1_desktop_setup_steps ). And there are a lot of things that I’m really liking with this new operating system, including the following:

  • Complete hardware support: My old Windows install could not detect & use all the cores of my CPU (Windows 2000 Pro does not detect / use a Quad core CPU, whereas Ubuntu 8.04.1 does). In Windows, I needed special drivers for my sound card and my mouse and my keyboard and video card, all of which had to be manually installed or downloaded. In Ubuntu 8.04.1, all my hardware “just works”, or it offers to install restricted drivers (for the video card), and then it “just works”. Nice.
  • The ease of installing and uninstalling software (synaptic + aptitude). Having a well-integrated package manager for installing and uninstalling everything is most pleasant.
  • The way when I play a video, and don’t have the right codec installed, it will offer to download and install the needed codecs for me, and once it’s finished (which typically takes all of 20 seconds), it will play the video. That’s really nice, and sure beats having to manually work out the right video codec to install. Nice one.

But the migration of data from Outlook 2000 to Evolution was shockingly, appallingly bad. There was a lot of misinformation on the web about approaches that should work, but which actually had data loss, and in the end, it took me four full frustrating days to get most of my personal data moved across. The whole migration was so painful that it has left me quite annoyed. When I hear people bandy around phrases like “the year of the Linux desktop”, I simply makes me think: Dream on! This is not the year of the Linux desktop. This is not even the fucking decade of the Linux desktop. Come back in 2011, at the earliest. And when I hear Evolution described as “an Outlook killer”, I can only laugh. To be an “XYZ killer”, you have to do everything that XYZ did, but better, AND you have to be able to import XYZ’s data. Microsoft Word was a WordPerfect killer, because it did what WordPerfect did, but it did it better (in a WYSIWYG way), and it imported WordPerfect’s data. Same for Excel versus Lotus 1-2-3. Same for Firefox versus Internet Explorer. But Evolution does NOT import data from Outlook (for any meaningful definition of the word “import”), so BY DEFINITION it simply cannot be an Outlook killer. And that’s before even getting to the fact that Evolution is not as feature-rich, nor as mature, nor as user-friendly, nor as bug-free as Outlook.

Understand that I’m no Microsoft apologist – I will happily use FOSS, if it’s as good or better.

However, if you are a masochist, if you enjoy pushing hot needles under your fingernails, then it is possible to make the transition. I evidently fall into this category, because I was too stupid or too stubborn to just give up. So here’s how you too can do the same, but I’m warning you straight up, it is not pretty, and it is not easy, and it is not quick.

And for anyone who says it is easy, allow me to enumerate some of the relevant facts:

  • I have/had 4 PST files, not one, like most people (i.e. the data is spread across 4 files, because Outlook barfs when a PST’s size approaches 2 Gb, so I needed to spread it out across multiple files to prevent this).
  • I have/had around 10 years worth of data. All of that data, every single bit of it, needs to come with me. This point is non-negotiable.
  • I use/used all the features of Outlook, apart from Journaling. That’s Email, Calendar, Contacts, Notes, and Tasks. Five categories of data, every single one of which I need. More details about each:
  1. Email. A total of between half a million and 600,000 emails, spread across a 692 nested email folders. These folders are categorized in a hierarchy to keep like mail grouped together. That’s 10 years of work/personal/hobby emails, sent and received and drafts, and this includes a collection of 40,000 spam emails received, kept to help with Bayesian training of spam versus ham.
  2. Contacts (roughly 550 contacts, some of which are just an email address, and some of which have complete details, spread across 9 nested folders).
  3. Notes (being used to store check-lists, passwords, etc., with 400 notes spread across 4 nested folders).
  4. To-do lists (being used to store information, check-lists, and list of bugs or wish-list items in various bits of software that I maintain, with 410 tasks spread across 26 nested folders).
  5. Calendar items (10 years of past events, and additional future events, recording both what happened, and predicted dates and deadlines for things that are going to happen).

So that’s the background. Here’s what does not work for migrating this data from Outlook to Evolution:

  1. Does not work: Export from Outlook, or getting Outlook to export it’s own data into some industry-standard file format. Ha! Have you ever looked at the File -> Import and Export section of Outlook? Remember, Microsoft are arrogant monopolist pricks, who have no vested interest in helping you move to anything else. So, we get just two export options, both of which are basically useless. Next option please.
  2. Does not work: Getting Evolution to read the PST files and import the data directly. It just doesn’t. Various feature requests for this has been open for the past 6 or 7 years, without any visible sign of progress. Move along, nothing to see here.
  3. Does not work: Readpst, which is an Ubuntu package, and which is apparently derived from libpst: On the very first PST I tried this with, it gave a series of warning about NULL pointers, gave a series of messages indicating that it wasn’t going to transfer everything anyway, and then proceeded to segfault. Clearly that’s not going to work.
  4. Does not work: Moving data from Outlook to Outlook Express, and then moving from Outlook Express into something else. Outlook Express only imports data from the main PST (ignoring the other files), and it loses data (converts all Contacts to mail items, converts all Tasks to mail items, losing many or most emails in the process). That’s right folks, even two Microsoft teams, who presumably work in the same building, can’t get their own email products to import data correctly from each other. Forget this.
  5. Does not work: Import into Thunderbird, which uses MBOX format, and then move the MBOX files onto the Linux box, and point Evolution at those. To do start this, you Install Thunderbird, and when you run it for the first time, choose “import from Outlook”, which will import the address book and your mail. I had high hopes for this option, it was very easy to use, and it seemed to work great … at first. However, it has a major problem: severe data-loss. Here’s an example: I have a folder that contains every bit of spam email I was ever sent. It’s useful for training spam detectors, and I found out, it’s also incredibly useful for testing migration tools for data integrity (because spammers send all kinds of weird formats, weird attachments, they ignore standards with impunity, etc.). In short, spam makes the perfect test case. This spam folder has 40,877 pieces of spam mail. How may bits of email do you think Thunderbird imported? The answer is 494 mail messages. That is a 99% data loss rate. Amazing. And there was not a single warning, not a single error – just completely silent 99% data loss. Now, I don’t care about losing my spam, but I do care very much about losing real data, and I had zero faith in Thunderbird at this point to migrate my data without data loss. So, ditch Thunderbird.
  6. Does not work: Migrate from Outlook PST to IMAP. Migrate using IMAP. Connect Outlook to an IMAP server on your LAN, copy everything there, connect Evolution to the same server, and copy or move everything from the IMAP server into Evolution. In theory, this should work great, and with a few test folders, it does. But the issue here is one of scalability: it seems to work fine on the simple stuff, but falls apart on the bigger stuff. Moving a single email would work fine. A single folder would work usually fine. But moving a hierarchy of folders with half a gigabyte of email would cause Outlook to start copying data, and then after about 10 minutes it would usually just get stuck, and then about 20 minutes later it would give a dialog box saying that the copy operation had failed. As a result, this option is unusable if you have substantial data, due (I suspect) to an Outlook IMAP bug. Other versions may work fine, but Outlook 2000 was buggy in this regard – for me, it kept hanging and could not completely transfer all of my data – and therefore it was, unfortunately, unsuitable for migrating data.

So what does this leave? At this stage, I thought I was out of options, and was tempted to just give up on Linux, and stick with Windows. Outport would move some of my data, but it would not move email, which is the largest and most complicated single component that I needed to move. Eventually I found the answer: O2M (which is a US$10 commercial product) for moving email + calendar + contact data, and Outport for tasks + notes, and 2 custom scripts I had to write to massage the O2M and Outport data into the correct format. Disclaimer: I don’t have any financial interest in O2M, I don’t know the people involved, and so forth – it simply the best tool that I could find for the job, and O2M does have problems too, but it’s problems are far less severe than the problems with the other methods.

Here is the link for the step-by-step details of how to migrate from Outlook 2000 to Evolution.

Decent browsers on mobile phones: Are we there yet?!

Brion Vibber summarises from OSCON on the future of browsers on mobile phones.

Some quick thoughts – capable smartphones are expensive (e.g. $350 to $1000), and the basic phones below that price point tend to be pretty limited and have small screens (but they’re cheap and fairly tough, so as an actual phone they work fine, but as an internet-enable communication device, they suck).

The good news that is that people turn over their phones relatively quickly (e.g. in Aus approx 11 million phones were sold for the last few years to a population base of 21 million, so average active phone lifespan presumably is around 1.9 years). So even if everyone bought only capable smartphones from this point onwards, it would take most of 2 years to get to sufficient market saturation that a phone with a capable browser could be assumed. But the fact is that people won’t all start buying smartphones (without a truly compelling reason to), and people who have smartphones won’t all sign up to mobile internet packages (it’s better in the US I think, but in Aus you usually have to pay extra for this, and you typically get an allowance of anything from 100 Mb to several gigs per month of bandwidth, and if you go over that you get slapped hard with extra usage charges – I’ve heard up to $1 per megabyte, but that’s so scary I hope it’s not true). So yeah, it puts people off. Realistically, probably 4 or more likely 5 years before this mess is sorted out and most people have a decent enough phone with a reasonable browser with mobile internet.

And for things like GPS, I say “BOO!” to only native apps being able to access that. GPS badly needs a standardised JavaScript interface, that can do stuff like say “do you have GPS?” and get a boolean answer, “do you have a signal?” and get a boolean answer, and then ask “what is the long + lat?” and get back an array of two decimal numbers. When this is native and works and runs without throwing errors in all browsers (both on phones and on desktops), then it’s going to be fricken awesome (e.g. walk around and have your phone display the Wikipedia article for the nearest landmark, walk around the city/go skiing and see where your mates are on a map on your phone and so be able to easily meet up with them for lunch/coffee, go to a new city and get a tour on your phone that knows where you are and tells you the most interesting tourist highlights that are closest to your location and that you haven’t visited yet, and so on and so forth). When it happens it’s going to be heaven-on-a-stick, but getting there feels like it could be painful and slow.

NRMA feedback fail

It’s that time of year to renew my car registration, and buy the accompanying compulsory third-party insurance. So I tried the NRMA, and price-wise their quote was okay, but I wanted it mailed to me in the post so that I can pay it closer to when I actually need it. Trying to tell the NRMA this proved to be impossible:

… and then clicking the submit button gives this:

Feedback rejected

… forbidding all English punctuation – that’s a really nice touch! So I removed all commas, full stops, apostrophes, and question marks, leaving one continuous string of text, and clicked submit again. The result is this:

Wow, that’s impressively crap. Being that bad at listening to people’s feedback doesn’t just happen, it takes serious dedication and practise and commitment.

Response to “Where did all the PHP programmers go?”

Ok, I’ll bite in response to this “Where did all the PHP programmers go?” blog post:

What I cannot understand is why people with more than one Bachelor Degree in Computer Science recommend using bubble sort.

Sounds wrong but harmless, as you don’t write a sort implementation from scratch in PHP. You write the comparators used for the sort order, but the actual sort implementation is provided for you by language. I presume it uses qsort internally, but don’t know for sure. I have a degree in CS, and I can scarcely even recall the bubble sort algorithm (or even most of the sort algorithms for that matter), for the simple reason that it doesn’t matter in the real world (in 99% of cases) for web developers using scripting languages. That may sound (gasp) shocking, but it’s true – PHP is not a performance-orientated language, and it’s a fairly high-level language with a decent library of native functions, so you don’t generally write sort algorithms (rather you use the library ones that are provided for you, unless you have an overwhelmingly good reason not to).

The question you need to ask is: are you running a Computer Science class on sorting algorithms, or are you looking for people who know PHP and can get your thing built?

“What is the difference between the stack (also known as FILO) and the queue (also known as pipe, also known as FIFO)?”

Maybe rephrase the question to “you want to store multiple bits of information in a data structure or an array or a collection of some sort. How would you add data to the beginning of that data structure, and how would you remove data from the end?”

I.e. focus less on the Computer Science theory, and more on the application of it.

“Using PHP programming language, create a list to store information about people. For each person you’ll need to store name, age, and gender. Populate the list with three sample records. Then, print out an alphabetically sorted list of names of all males in that list. Bonus points for not using the database.”

Here’s a trivial implementation just using arrays, I don’t claim it’s remotely pretty or elegant, and I wrote just to see what’s involved in the above task:

<?php
error_reporting( E_STRICT | E_ALL );

function sort_by_name( $a, $b ) {
       if( $a['name'] === $b['name'] ) return 0;
       return $a['name'] > $b['name'];
}

function printMales( $array ) {
        foreach( $array as $person ) {
                if( $person['gender'] != 'male') continue;
                print "Name: " . $person['name'] . "\n";
        }
}

$people = array( array( 'name' => 'Bob'      , 'age' => 36, 'gender' => 'male'   ),
                 array( 'name' => 'Alice'    , 'age' => 23, 'gender' => 'female' ),
                 array( 'name' => 'Doug'     , 'age' => 63, 'gender' => 'male'   ),
                );

print "Before:\n";
print_r( $people );
usort( $people, 'sort_by_name' );
print "After:\n";
print_r( $people );
print "\n";
printMales( $people );

?>

But you know what? I had to look up the PHP manual for usort because I couldn’t recall off the top of my head whether it was “u_sort” or “usort”, and I couldn’t recall the parameters and their order. Also I had 3 trivial syntax errors that I fixed in 15 seconds. Now, I really hope for this pen-and-paper test that you are giving people access to the PHP manual, or if you are not that you are being very tolerant of minor syntactical errors or people who can’t recall whether the function name has an underscore, or who can’t recall the exact order of the parameters, and so forth. Because the question is: Is this a test of whether someone has memorized the entire PHP manual, or is this a test of whether people who can do what you want? Because when they are working, then you will give them access to the PHP manual – right?! If you want to distress people in the interview, then sure, treat it as a rote memory test of the PHP manual and Computer Science theory, and make it awkward if they get anything wrong – but if you’ve want to solve the problem of finding people then there has to be some leeway for recollection of technical trivia that you can find through Google in a few seconds.

Look, I’ve been in a similar situation of looking for PHP people to hire (the candidates were from China in this case), and the approach we used was to give them a test beforehand that they could do (in 24 hours of their own time), and then if they looked okay then they could get called in for an interview. This allowed culling people who were very bad, or who gave code that didn’t run – as there really is very little excuse for code that’s invalid or that doesn’t work if you’ve got 24 hours and access to the internet and your own computer. Most of the people weren’t great, some were very bad, and some were okay. If it helps, that PHP test is here, and it’s only intended to be a very simple test.

OpenSSL Debian seeding problem

OpenSSL Debian seeding problem – what a mess – installing the update itself is trivial, but the sysadmin time is in having to chase down and remove and regenerate weak keys generated by multiple packages, which in turn can have propagated to multiple machines, generated any time over the last year and a bit on a Debian or Ubuntu system. Ouch. Most helpful resource for doing that is the SSLKeys page on the Debian wiki.

Sydney BarCamp 3, day 1 notes

My quick notes from the first day of Sydney BarCamp 3 – apologies if they are quite terse:

  • Making computing cool – Let’s make everything objects, and hide file systems and devices from applications, with an on object layer. Benefits in reducing all the glue everywhere when communicating data over the wire or between apps; Could also allow apps to be migrated from one machine to another; Could even have a login of standard apps that follows you everywhere via the cloud., including retained state from your last login, but without using something like Citrix.
  • Processing and the demo scene. Gave a background to the demos and the demoscene. Introduced processing, which is a Java-based tool, built by 2 guys who have been working on it for about 4 years. Artists are one of the target audiences. More info at http://processing.org.
  • Sydney free wireless project. Currently trying to work out what standard hardware to use for the city-wide mesh, now that there are concerns over Meraki becoming much less open and losing their way (who have introduced a more restrictive EULA and have made flashing the hardware much harder). Open mesh dashboard is an open fork from Meraki, but still need to sort out a reasonable cost for the hardware including shipping to Aus. Also want the mesh to interoperate with other meshes – e.g. want to be able to automatically connect this mesh and an OLPC mesh, if at all possible.
  • Spoke to someone using 3 mobile networking on their laptop – uses a PC card with HSDPA. Recommended it, $15 per month for 1 Gb, or $49 for 5 Gb, and the modem is ~$298, or free if you sign a 24 month contract. There is currently a price war going on between Vodafone, 3, etc. over mobile broadband, prices are improving.
  • Quotes: “The problem with Domain Specific Languages (DSLs) is that they are Domain Specific”. “The tipping point for data portability is the user expectation of having data-portability between web apps.”
  • Some lessons from a start-up biz:
  1. Advertising is useful. Measure it carefully.
  2. Tech roadmap is about PR – tells customers “what’s coming next” – you need one – not binding – “announce before you announce”.
  3. Take a punt on marketing. Hard work getting the word out about your product. You have 9 lives when marketing – one failure won’t kill you.
  4. Make mistakes properly. Failing is okay, but do it properly. Fail in spectacular fashion.
  5. Everything takes longer than you think. It’s true.
  6. Be unconventional.
  7. Q: What mistake cost the most time? A: Messing around with landing pages. Company wisdom is that you should make a lot of them and test to see what is most effective. Need a lot of volume to perform useful tests. A case of premature optimisation.
  8. Q: Do we need to talk a lot of lawyers and accountants at start-up? A: No, not when in the initial stages. However when you have worked out what your idea is, and have money coming in, then need to talk to both. But be aware of the risks.
  • grails – previously called “groovy on rails”. Person now working on getting http://memsavvy.com/ off the ground. Grails is based on Java. (Java, spring, hibernate and Apache app.) Grails currently has 63 plugins (one for adding search, one for web objects, etc.). Grails solves a technical problem. An out-of-the-box MVC system. Sky.com, using grails, serving 186m pages/month.
  • A business owner is 3 people: 1) Entrepreneur 2) Manager who keeps the biz afloat 3) Technician who built the product
  • “Start-up kitchen” is a start-up incubator. It provides a practical solution to continuous cash flows. Has an office in St Leonards. For start-up cash flows, you are hired in a part-time way (2 or 3 days a week) (work depends on the skill set that someone has; may be internal work; or external IT shop work for blue-chip clients), which gives you cash flow.
  • “Talking to rich guys”. (about what angel or VC people are looking for in a company). Investors want a biz capable of $100m of in 4 to 5 years. In the valley there are lots of VCs. In Australia, not so much – want to do late stage buyouts and make money charging fees to a company. There is plenty of money available; there are just not enough REAL businesses that can make good use of that money. As a rule, investors don’t like software, or web apps. To get in front of a dozen to 50 rich people, need to have a good story (need a business, a real business). Most Australian angel investors are retired or semi-retired engineers who love gadgets. For the first 100,000 units want to manufacture locally. “IM” is an information memorandum – like a prospectus, but a lower standard (because is not covered by regulations). Example: A company is looking for $1m. Angels want 35% ownership of the company, but will rarely get it. (Investment range of 200k to 500k is angels, and $1m + is small institutions). Watch out for fees – e.g. one guy wanted 250k in fees to raise 500k. Brains are the cheapest thing you can buy. E.g. “women on boards” who want a paid position on boards – e.g. 35k per annum, and for this they would have to go to 8 meetings per year, and are personally liable for the business if anything goes wrong. Women are much cheaper than blokes (there are institutionalised problems for women in business trying to get equal pay). Anything above this, pay cash-in-hand $100 per hour. To get money have to be able to give a good answer to “WIT FM?” for the investor – “What’s In It For Me?”
  • Good places to get stock photos for $1 or $2 a pop: istockphoto.com or luckyoliver.com
  • BarCamp Canberra is on in 2 weeks. (sat 19th April).
  • Sociability design. This is like usability design for applications – which is making the app as usable for your user as possible, so that it is pleasant and intuitive to use. Sociability design is making a socially useful system, such as social sites like LinkedIn, Facebook, and MySpace. There are parallels between usability – especially Jacob Nielsen’s 10 main types of usability – and the basics of how you make a pleasing social user experience. Table of comparisons. The speaker’s blog. The language used to describe relationships needs to be richer, whilst still being diplomatic.
  • Open coffee – a coffee meeting for people starting up. Runs every second Thursday.
  • Twitter – got a quick intro to this. 140 character microblogging / updates. Max of 240 free SMSes per week in Australia.
  • The bar opened, and I played 3 rounds of the Werewolves + Seekers + Healers + Villagers game (rules are here or here, we played with a healer), which was a fun social game. There were between 11 and 15 people at the start of each round. It just confirmed what I always known – that I am a really bad at deception – I was found out fairly quickly when I was a werewolf!

Location-aware wikis – the next big wiki thing?

There are some important changes coming in the next five years around how people will use wikis, specifically in conjunction with mobile devices. I’d like to publicly outline my thoughts on the background, the premise, and the potential.

Background

First some background. Around 4 or 5 years ago, most laptops started including local wireless and better power-saving as standard (i.e. greater portability of computing power). About 2 years ago, the number of laptops sold exceeded the number of desktop and server systems sold, and that trend has only continued since (i.e. greater ubiquity of portable computing power).

About 12 years ago, the first mobile phone I owned was a second-hand classic Motorola the size and weight of a small brick (it was too heavy to carry often, so mostly I left it in my car – it was similar to this, but a bit smaller – it was mobile, but not wearable, and the battery life was rubbish, maybe a few hours, and it could only do phone calls). About 9 years ago my phone was basic Nokia – it was much lighter, with battery life of a bit over 1 day, but it was still a bit heavy so it had a belt clip, and it could make calls and send SMS (i.e. very basic data). My current Nokia phone is about 4 years old, it’s cheap, it’s lightweight (85 grams), it has battery life of about a week, and it does WAP, but no Wi-Fi. So the trend lines are clear in retrospect for both laptops and mobiles, and looking ahead, they are converging: Greater portability; Greater computing power; Greater battery life; Greater access to mobile data; And mobile phones are basically becoming wearable mini-computers that you carry around in a pocket with you.

The premise

So far, this hasn’t impacted wikis too much, but I think we’re about to reach a tipping point where these trends do have a bigger impact on wikis – I would like outline why, and what’s required for it to happen. In particular, lately a number of friends and family have independently upgraded to mobile phones with inbuilt GPS plus mobile Internet functionality. I think GPS + mobile Internet + wikis could be a game changer, and it could be a seriously kick-arse combination. But you need all 3 components for it to work.

Think about it – a wiki that has local information about your area, the best restaurants, the best sights and entertainment, all with genuine user-comments and guides and feedback and ratings. Everything in that wiki is geotagged – that’s part of the core purpose of the wiki. You “carry” the wiki with you in your pocket, on your phone, through your mobile Internet. And as you move around, the GPS shows you where you are, and what’s near to you that has got articles and that was good. Wander wherever you like, knowing that you’ll always have the best low-down on what’s good and what’s not, no matter where you are. Be a local anywhere.

Now the mobile phone manufacturers have already started to include some limited GPS software with “points of interest” on their phones – e.g. the Nokia Navigator 6110 will show you nearby ATMs, petrol stations, public bathrooms, etc. That’s great for facts for commodity destinations (e.g. most ATMs or Petrol stations are completely interchangeable). But what about restaurants – which ones are worth eating at, and are in your budget? Sights – which ones are actually worth seeing, according to the people that have been there? The currently GPS software lacks depth in this regard, but worse it lacks participation. This makes it broken.

There are audio tour guides starting to show up for cities – e.g. in Hong Kong you can purchase a SIM card which would then give you free over-the-phone access to a canned tour guide you can listen to as you wandered in a certain area of the city. But it’s basically scripted for you, and you don’t get to “edit” it to add your picks for those who come after you. Canned audio guides lack interactivity and participation.

There are some city-specific wikis (e.g. DavisWiki, ArborWiki), which have good depth about an area. But mostly they lack geotagging, and there’s bound to be some server-side software updates needed to make location-aware wikis work well on mobile phones. So currently the wikis we have about a specific location aren’t particularly usable from a mobile phone. They’re about a place, but they are not location-aware or portable. As a result, city-specific wikis have been a niche wiki application, but in a few years the number of wikis in this area will explode. I know that a number of entrepreneurs are interested in local wikis or the data stores behind them, and it’s an area that has a huge and largely untapped potential, but which to date has mostly been done well by transitory college students.

There are some sites (e.g. for New York) where you can get functionality something like what I’m describing (by scribbling notes on a map), but I suspect it’s not as deep or as broad or as structured as a wiki can be.

No, what you need is all 3 things together: The location-awareness of GPS, the depth and timeliness of being able to access a great big store of current information via the Internet, and the participation of wikis. But it will happen. I’m calling it – mark my words. And whoever does it first and does it best will probably make a bloody fortune.

This plus this on this equals good

The problems

What’s holding it back currently is that advanced phones are expensive (e.g. about AU $850 for a Nokia N95, but there is at least one open-source phone which will have GPS called the OpenMoko in development), not all phones have GPS (e.g. the lauded iPhone lacks GPS – what were Apple thinking? – wouldn’t buy one of these until it has GPS if I were you), and mobile Internet is expensive and often usage-metered rather than flat rate. But those things will get fixed in time. The technology exists and works – it just needs to become widely distributed. Mobile Internet will become ubiquitous in phones, even the cheap ones. GPS will become ubiquitous in phones, even the cheap ones. And mobile Internet will get cheaper as demand for it increases and competition increases, or it will be overtaken by citywide mesh wireless networks. These things will happen, and the opportunity is very real. So it’s not an “if” but a “when”. I’m thinking maybe 5 years before it’s common to see people in the street doing this. But if you want to be there and be ready for that time in 5 years, you probably need to start building it now. But the building it will probably be expensive, simply because the first one of anything non-trivial in software usually is expensive.

What will it look like? How will it work?

The first thing to realise is that if you’re walking around, you don’t normally want a lot of text. A 40-kilobyte Wikipedia article is a tad unwieldy to read on a 2.6″ screen whilst walking around in the full sunlight. What you want instead is a summary of information, possibly spoken by software instead of written text. A little bit of the right information at the right time: “Turn left here. Walk 50 metres. It’s nearly lunchtime – Excellent Portuguese Chicken on your right for $10”. Keep it simple, keep it short.

Now if people want more information at that point, then give it to them. “Hmm… Portuguese food… yum, sounds tasty… let’s quickly scan the menu and ratings… **click** **scroll** … okay, sold!”

Now it’s not a wiki unless you can then add your thoughts. So after you meal, you notice that the hours are slightly out of date, and correct them. Maybe you upload a photo of the shop or your dish (before you ate it!). And you add a rating (4 out of 5) and a quick note: “the chicken is succulent and tasty. Be sure to ask for garlic sauce on your chips – it tastes great!”

Another thing you could do is follow a planned route if you’re new to an area, for a “best of” tour. This is kind of like the Hong Kong idea, but because it’s a wiki it could evolve and be updated in a decentralised fashion. Similarly planning your own routes for later, and storing them on the wiki, would be good. And after you had done the route, if the wiki asked you whether you had any corrections or updates that you wanted to make, then that would be good.

There will also probably have to be a more traditional detailed way to view the wiki, like the standard Wikipedia Monobook skin. This would allow both mobile and desktop users to update and edit the site, whilst still allowing mobile users to have a more concise view of the information.

An important thing to note is that most of the content has to be created by locals. Someone on the other side of the planet can add skeleton entries for restaurant or parks or museums such as names and addresses, but the valuable content, the user-generated stuff, has to come from ordinary users, on the ground, who know the place in question, have tried it, and have had some sort of reaction. So a low barrier to entry (much lower than the Wikipedia) is required to allow sufficient people to contribute feedback to allow it to work.

How to make it happen faster

The single best way to help make this happen faster is to build citywide free mesh wireless networks in your neighbourhood. The mobile Internet is the biggest stumbling block, and big telecoms are hugely resistant to change or dropping their prices unless forced to (basically, they’re pricks). GPS in phones is coming, and I see no sign that companies like Nokia are holding back; and wiki people generally don’t hold back, so that doesn’t worry me either. The wireless networking does a bit though. The answer may be to build a grassroots network, using a self-healing easy-deployment wireless mesh, such as Meraki is doing in San Francisco. (By the way, if anyone wants to start making one of these mesh networks in Sydney, let me know, I’d happily be involved in that).

Anyway, that’s it from me. Just remember: GPS phone + wireless Internet + local wikis = perfect storm. Ciao!

Comparing compression options for text input

If you’re compressing data for backups, you probably only care about 4 things:

  1. Integrity: The data that you get out must be the same as the data that you put in.
  2. Disk space used by the compressed file: The compressed file should be as small as possible.
  3. CPU or system time taken to compress the file: The compression should be as quick as possible.
  4. Memory usage whilst compressing the file.

Integrity is paramount (anything which fails this should be rejected outright). Memory usage is the least important, because any compression method that uses too much RAM will automatically be penalised for being slower (because swapping to disk is thousands of times slower than RAM).

So essentially it comes down to a trade-off of disk space versus time taken to compress. I looked at how a variety of compression tools available on a Debian Linux system compared: Bzip2, 7-zip, PPMd, RAR, LZMA, rzip, zip, and Dact. My test data was the data I was interested in storing: SQL database dump text files, being stored for archival purposes, and in this case I used a 1 Gigabyte SQL dump file, which would be typical of the input.

The graph of compression results, comparing CPU time taken versus disk space used, is below:
Compression comparison graph
Note: Dact performed very badly, taking 3.5 hours, and using as much disk space as Bzip2, so it has been omitted from the results.
Note: Zip performed badly – it was quick at 5 minutes, but at 160 Mb it used too much disk space, so it has been omitted from the results.

What the results tell me is that rather than using bzip2, either RAR’s maximum compression (for a quick compression that’s pretty space-efficient), or 7-zip’s maximum compression (for a slow compression that’s very space-efficient), are both good options for large text inputs like SQL dumps.

Google Developer Day 2007

Went to the Google Developer Day 2007 yesterday. It was held at 9 locations worldwide. The Sydney one was the second-largest one, with 700 developers attending.

To summarize the event down to a sound bite, it was about Google APIs and mashups. (and I’m sort of hoping I won’t hear to word “mashup” again for at least a week…)

Here are my notes, and I have indented the bits that seemed to me to potentially be relevant or useful to MediaWiki or the Wikipedia:

  • All their APIS are at http://code.google.com/apis/
  • GData / Google Data APIs – provides simple read/write access via HTTP, and authentication with Google. See also S3 for a similar idea (docs here).
  • Google Web Toolkit. Also known as GWT, and pronounced “gwit”. Converts Java code to JavaScript, in a cross-browser compatible way. AJAX development can be painful because of browser compatibility problems. GWT is one solution to this problem. Licensed under Apache 2.0. Develop in any Java IDE, but recommend Eclipse. Launched a year ago (almost exactly). Using Java as source language because of its strong typing.
  • Google Gears, was presented by the creator of Greasemonkey. Gears is a browser plugin / extension, for IE + Firefox + Safari, that that allows web apps to run offline. I.e. can extend AJAX applications to run offline, with access to persistent local SQL data storage. Means users don’t always need to be online, as there is access to persistent offline storage for caching data from the web, and for caching writes back to the web. Released under BSD license. Uses an API that they want to become a standard. Idea is to increase reliability, increase performance, more convenient, and for all the times people are offline (which is most of the time for most people). It’s an early release, with rough edges. Use local storage as a buffer, and there is a seamless online-offline transition. For the demo he disconnected from the net. What talks to what: UI <–> Local Db <–> Sync <–> XmlHttpRequest. Gears has 3 modules – LocalServer (starts apps), Database (SQLlite local storage), and WorkerPool (provides non-blocking background JavaScript execution). WorkerPool was quite interesting to me – non-blocking execution, that overcomes a limitation of JS – different threads that don’t hog the CPU … really want the whole Firefox UI to use something like this, so that one CPU-hogging tab doesn’t cause the whole browser to choke.
    • Thoughts on how Gears could potentially be applied to MediaWiki: An offline browser and editor. Will sync your edits back when go online, or when the wiki recovers from a temporary failure. Could also cache some pages from the wiki (or all of them, on a small enough wiki) for future viewing. Basically, take your wiki with you, and have it shared – the best of both worlds.
  • Google Mapplets. Mapplets allows mashups inside of google maps, instead of being a google map inserted into a 3rd party web page. “Location is the great integrator for all information.” URL for preview. Can use KML or GeoRSS for exporting geographic information.
    • Thoughts on how to use this for the Wikipedia: Geotagging in more data could be good. E.g. Geotagging all free images.
  • Google Maps API overview. A lot of the maps.google.com development is done in Sydney. Talk involved showing lots of JS to centre google maps, pan maps, add routes, add markers, change marker zones, add custom controls, show / hide controls. A traffic overlay for showing road congestion is not available for Sydney yet, but will be available soon. Some applications of the Maps API: Walk Jog Run – see or plan walking or running routes – example; Set reminders for yourself ; Store and share bike routes.
    • Thoughts on applications for the Wikipedia: Perhaps a bot that tries to geolocate all the articles about locations in the world? Will take a freeform article name string, and convert to longitude + latitude, plus the certainty of the match (see page 16 of the talk for example of how to do this). Could get false matches – but could potentially be quite useful.
  • Google gadgets. Gadgets are XML content.
  • KML + Google Earth overview. KML = object model for the geographic representation of data. “80% of data has some locality, or connection to a specific point on the earth”. Googlebot searches and indexes KML. KML de facto standard, working towards making a real standard.
    • Already have a Wikipedia layer. It seems to be 3 months out of date, and based off of the data dumps though.

Misc stuff:

  • Google runs a Linux distro called Goobuntu (Google’s version of Ubuntu).
  • Summer of code – had “6338 applications from 3044 applicants for 102 open source projects and 1260 mentors selected 630 students from 456 schools in 90 countries”.
  • My friend Richard, one the organisers of Sydney BarCamp, spoke with some of the Google guys, & they were quite enthusiastic about maybe hosting the second Sydney BarCamp at a new floor they’re adding to Google’s offices in late July or early August. If that works out, it could be good … although if it could not clash with Wikimania, that would be better!
  • Frustration expressed by many people about the way the Australian govt tries to sell us our own data (that our tax dollars paid for in the first place), restricting application of that data. Example: census data. Much prefer the public domain approach taken in the US.