The Internet Archive has uploaded around 2.6 million historical photographs to Flickr. The man who found a simple technique to create and automate the upload hopes to inspire other digital libraries to do the same.
The images run from the 16th century to 1922, putting them all in the public domain in the US. They mainly come from books that are out of print or rare and include many artworks where the original print is long gone.
The upload is the work of Kalev Leetaru of Georgetown University, who is on a fellowship sponsored by Flickr’s owner Yahoo.
He based his operation on the way that the Internet Archive scanned around 600 million pages of old books to create electronic text versions. The project involved using an optical character recognition program to not only recognize text, but filter out the parts of each page that didn’t have any text — in other words, the images.
Leetaru set up a system which used the Internet Archive’s records to retrieve the discarded section from each scanned page and turn it into a JPEG file. The system also retrieved any caption text plus the paragraphs before and after the image. This text was then automatically mined to select tags, making the Flickr images searchable.
The next step is to complete the project, which will mean uploading a total of 12 million images. Leetaru has also called for Wikipedia to have a day of action on which volunteers search the archive to find suitable images for pages that currently lack illustration.
Leetaru says he’s happy to share his code with any library or other organization that has already digitized books or plans to do so.