Smithsonian turns to crowdsourcing for massive digitization project

By Hamish McKenzie , written on November 8, 2013

From The News Desk

There are 5 million plant specimens in the US Herbarium at the Natural History Museum’s Botany Department, one of the most extensive collections of plant life in the world. They all have labels. But only 1.3 million of those labels can be read by computers.

That’s where you come in.

Jason Shen and Sarah Allen, a pair of Presidential Innovation Fellows working with the Smithsonian Institute to improve its open data initiatives, have gone all Mechanical Turk on the esteemed knowledge network.

In a pilot project that is serving as a test run for other large Smithsonian scientific collections – accouning for a total of about 126 million specimens – the innovation fellows are crowdsourcing the transcription of scanned images of the labels.

To get involved, you don’t need to commit to a certain number of hours, or make yourself available at specific times. You just log into the Smithsonian’s recently established transcription site, select a project to work on, and start transcribing. Different volunteers can work on the same project at different times. When you’ve done your bit, you submit it for review, at which point a different volunteer comes in to check to see that you’ve done the transcription correctly.

So, for instance, you might get to look at specimens collected by Martin W. Gorman on his 1902 expedition to Alaska’s Lake Iliamna Region, and read his thoughts on his curious findings. If you’re the type to get excited by a bit of vintage potentilla fruitcosa, then this is your Disneyland.

It’s the sort of crowdsourcing initiative that has been going on for years in other corners of the Internet, but the Smithsonian is only just getting going. It has long thought of itself as passer-on of knowledge – its mission is “the increase and diffusion of knowledge” – with the public as inherent recipients rather than contributors, so the “let’s get everyone to help us with this gargantuan task” mentality has not been its default position. It does rely on a lot of volunteers to lead tours and maintain back rooms, and the likes, but organizing knowledge is another thing.

Plus, it’s always been a huge hassle. The Smithsonian is essentially 19 different organizations, each of which its own management system. It’s only in the last nine years that the various organizations’ databases have been tied together in a unified enterprise system that lets people view things institution-wide.

Shen, who founded a startup called Ridejoy, and Allen, who founded a startup called Mightyverse, are working with the Institute as part of a months-long posting that is typical of the year-old Presidential Innovation Fellows program.

Shen and Allen quietly launched the Smithsonian Transcription Center in August as part of a wider effort to digitize all of the Institute’s collections. The Herbarium effort is one of the most significant to date, but other projects have included field notes of bird observations to letters written between 20th-century American artists. More than 1,400 volunteers have contributed to the projects to date, accounting for more than 18,000 transcriptions.

[Photo via Smithsonian Collections Blog]