Trying to digitize a magazine with Snapter
I tried out the Snapter demo recently. Snapter is an intriguing piece of software that aims to ease the digitization of books, magazines, whiteboard photos, etc. You photograph your page with a digital camera, feed it into Snapter, and boom, you are presented with a beautiful PDF… or that is the idea. In practice, things don’t seem to work so well.
I photographed a magazine (about 50 pages). Then ran them through Snapter. The interface and workflow are still a little rough, but the more fundamental problem is the page detection. The program is suppose to detect the edges of the page. Using this information, it can than warp the image to deal with things like page curl, or crooked photos. On most pages, it wasn’t even close, comically so. The page detection algorithm appears to just look for high contrast and forgets that a the edge of a page is almost always a straight line, or close to a straight line. To add insult to injury, the interface gives you the opportunity to correct the page detection by dragging the handles to the true edge, however, when dragging the handles, it would refuse to move where I dragged the cursor.

The above photo is an example of Snapter doing a decent job of page detection thanks to the very simple layout of the pages, with high contrast between the page and the background. It has correctly found the center of the magazine (yellow vertical line). The red line has a minor blip on the lower left, and also didn’t quite find the left side. I imagine, it would do a decent job on a book with no color and no photos or illustrations.

On this set of pages, Snapter fails miserably. These pages are trickier than above, but there are still distinct page edges. Snapter has made comical wavy lines desperately trying to find the edge. Attempts to drag the handles to the true page edges is mostly ignored and just ends in frustration.
On top of these issues, Snapter is overpriced (for a consumer application) at $50 and only available for Windows. Here’s hoping for Snapter 3 soon.
Atiz announces new BookDrive DIY book digitizier
Atiz, that company that was sending me unsolicited email about their automated book scanner a while back, recently announced a new product for digitizing books, the “BookDrive DIY”. The design is very similar to Archive.org’s Scribe project. It cradles the book in a V shape and uses two overhead mounted Canon Digital SLR cameras. A human must manually turn the pages. The system, without the cameras or a computer is $3500. It includes software for capturing and cropping photos. You are on your own for OCR. It isn’t clear to me how the “transparent plane” is raised to allow page turning.
Targetted Spam by Book Digitizer
In an attempt to generate interest and buzz, I’ve now been spammed twice by the company Atiz. Spam isn’t unusual, but this spam is actually targeted and slightly customized. Atiz is a company selling an automated book digitizer. I presume they searched blogs mentioning competitors or book digitization, and sent emails to those individuals pushing their product.
And while I would rather not even mention their name due to their unsolicited email, the “BookDrive” does look pretty interesting. Instead of using a digital camera to photograph a book page, they basically are using a scanner head which automatically moves over the page, even adjusting for the curve.
Page turning is handled by an arm which has a sticky roller (I think) which they claim leaves no residue.
The oversized breadbox contraption connects to a Windows PC. There is a video of it in action available here. (warning fairly big MPEG video). The most glaring problem with the product is that it costs somewhere between 35,000 and 50,000 dollars.
More details on Internet Archive’s Scribe Book Scanner Project
Contrary to what I was led to believe in an email from the Director of Books at Internet Archive, the software for their book digitization project is indeed available to the public (under open licenses no less). I was surprised and excited to come across the Scribe project on Sourceforge today. The documention is nearly non-existant, the system looks like a mishmash of software tied together with glue and tape at the moment, but it is available… Now I just wish the Internet Archive would publicize that the software is available for other libraries to potentially use and improve.
Here are additional details that I’ve gleaned since my last post on the topic.
The “Scribe station”, as it is called, uses two Canon EOS 1Ds Mark II digital cameras which I presume makes up the majority of the cost of the station. The camera produces 16.7 megapixel photos for the low low price of $7100 each. One camera photographs the left page of the book, and the other photographs the right page. (Other systems I’ve seen are able to use one camera by the use of moving mirrors, or by moving the camera.)
In an interview, Brewster Kahle has said that they really didn’t want to build the system on their own, but they tried commercial solutions and weren’t satisfied with the results considering the cost. With their home grown station, he estimates that the cost is 10 cents per page. Not sure if that includes only the labor cost of someone flipping the pages, or if it includes the cost of the hardware and software. He also mentions in the interview that they’ve worked a little with Squid Labs in hopes of getting help on the development. You can hear an excerpt of the interview where he discusses the Scribe Station here (2.8 MB MP3). The entire interview is available here.
The station requires two Windows computers (the Canon software used to control the Cameras presumabley doesn’t support multiple cameras). The bulk of the system is written in Java, but PHP, cygwin, Imagemagick and a host of other dependencies are also used. While I don’t expect any other libraries to independently set up their own station using the IA’s work in the near future… if it continues to evolve, I can image 100’s or 1000’s of libraries digitizing books instead of just a handful that we have today. Suddenly the task of digitizing every public domain book wouldn’t be so duanting.
More details about the Internet Archive Book Digitization project in WSJ
The Wall Street Journal article, “Building an Online Library, One Volume at a Time“, has more details on the Internet Archive’s book digitization project, inlcluding a few more photos of the scanner which apparently costs $20,000 to $40,000 each. Still looking forward to the day when every library has one and suddenly we can make a real dent in the digitization process.
Internet Archive’s book scanner
Jessamyn got to attend the Internet Archive’s latest presentation relating to the Open Content Alliance. Basically they announced that more organizations, including Microsoft, are on the bandwagon of digitizing expired-copyright books. Jessamyn and others posted photos of the book scanning machine. I was intrigued as it looked like a custom built system.
I’ve sent a bunch of questions about the scanner to Brewster and the archivist-talk yahoo mailing list. However, Jessamyn was able to provide a handful of details.
- It is in fact a custom-built system
- Page turning is done by a human. No robotic arms, vacuums, or puff’s of air to flip the page.
- The glass platen is operated by a foot pedal
- Software deals with metadata like page number and copyright status
Not sure what is used to capture the image. I suspect a high-end digital camera. The fact that a human has to manually turn the pages is a big bummer. It likely makes things slower and more expensive. On the upside, it makes the system much simpler, and probably more reliable.
Above photo by “ioerror“. More photos of the event from Jessamyn here. For samples of scanned books check out the Open Library website.
Two Internet Archive Thoughts
I’ve been keeping an eye the Internet Archive’s recent project to digitize books in collaboration with several other companies and academic institutions. Two thoughts, one related, and one not really.
Why does the project not ever seem to mention Project Gutenberg and the sub project Distributed Proofreaders. I understand that the IA project is different in many ways, but it is a real bummer that PG doesn’t get props, and isn’t involved in the project. I would imagine they could provide some valuable insight, and with Distributed Proofreaders, some valuable resources.
My other thought is one that has been rattling around in my head for a while. Internet Archive is a incredibly cool project. Kahle’s vision is to build a modern day Library of Alexandria to preserve important cultural knowledge. In his presentations he always mentions the need to store the data in multiple locations around the world to avoid the library of Alexandria’s fate (destroyed by means that are not quite clear). It turns out that culturally significant works are already being preserved, and in not just one place, but 10’s of thousands of copies all over the globe. Is it librarians doing this work? Nope, it is the “pirates”. If any work has significant cultural value, pirates will make copies. In a x hundred years when the IA is a faint memory, researchers will come across some pirates ancient hard drive full of DivX movies. And of course, the pirates have a huge advantage over Libraries, Internet Archive, Google print, etc. They don’t have to go through the politics and details of copyright and the DMCA.
Digitization and OCR with a Digital Camera
A few weeks ago I did an experiment to see just how bad the results of Optical Character Recognition (OCR) on a digital photograph were. I finally got around to compiling the results. I found the results to be surprising.
The set up
I created a document containing a mix of 9 point characters (alphabet, numbers, and symbols). I styled some text with bold, italic, and underline. Most text was Times New Roman. Some lines were Helvetica. You can download a PDF of the original document here.
I scanned the document on a Canon CanoScan 9900 at various settings. The only scan that I’ve compiled results for was the bitonal 600 DPI TIFF.
I then mounted the document vertically on the wall with a white background. I photographed the document with my Canon Digital Rebel camera with a Canon 70-200mm f/4L lens mounted on a tripod. The photo was in raw format to maximize quality. I didn’t do very much experiementation with camera settings or alternative lenses. I did end up adjusting levels, brightness, and contrast in Photoshop.
I wasn’t interested in how accurate the OCR was. Rather, I was intrested in how the OCR results would differ between a digital photo and a digital scan. So I just used the OCR software that was included with the scanner. It came with OmniPage LE for Windows. As far as OCR software goes, it is probably pretty bad.
I scored the OCR results by doing three tests on each document.
- Was the character correct?
- Was the styling (bold, italic, underline) correct
- Was the font correct? (I based font correctness on whether the font was appropriately serif or sans-serif and ignored it if was technically a different font.)
The Results
| |
original document |
digital photo (incorrect chars.) |
600 dpi bitonal scan (incorrect chars.) |
| # characters |
3144 |
67 |
72 |
| styled characters |
1184 |
479 |
701 |
| font |
2672 |
114 |
0 |
| line endings |
64 |
6 |
7 |
Surprisingly (to me) the OCR did a slightly better job on the digital photo. And even with crappy OCR software the character recognition was remarkably good. 67 incorrect characters comes to only a 2% error rate. Many of the characters that were missed were tricky things like a backtick (`) being interpreted as an apostrophe (’) or the pipe character being interpreted as a one (1).
Conclusion
This was just a quick basic test. The conditions were fairly ideal (perfectly flat paper). A bound book would probably result in crappier results. Open Source software that corrects for the curvature of the page in a bound book would be really handy. However, 9pt text is pretty small. I don’t know what the average font size of books, but it probably doesn’t get much smaller than 9 point.
Konica Minolta PS7000
The Konica Minolta PS7000 is designed to digitize books, but unlike the fancy Kirtas APT BookScan 1200, this one doesn’t have the automatic vacuum powered page turner. It does not appear to be specifically designed for performing OCR. It does 600 DPI at A3 size (11.7 x 16.5 in) in black and white. No color or grayscale. Also does curvature correction. It has been out for several years. According to a post on the ImageLib mailing list from 1999, it costs between $15,000 and $20,000. For curiosities sake, the hunt continues…
Robotic Book Scanning
The Kirtas APT BookScan 1200 is a robot that can automatically “scan” books at a rate of 1200 pages per hour. According to the specification page, it uses a 16.6 megapixal camera to digitize the pages. A vacuum is used to gently turn the pages. The downside is that it costs 120,000 to 150,000 dollars. It is apparently the cheapest, high speed method for digitizing books without unbinding them. Definitely check out the demo. Kirtas also claims to have applied for several patents, but I haven’t been able to dig anything up at the USPTO website doesn’t turn up anything.
Interestingly, according to the rumors on the internets, Google is using something completely different in their library digitization project.
The technology that Google is using is nothing like the Kirtas and is entirely their own. I’m not able to provide any details on the nature of the technology, but because we did provide extensive review going into the project, and have occasional checks of systems and processes, I know it well enough (i.e., very well) to say that it’s nothing like the Kirtas.
When I learned that the BookScan 1200 was so expensive, I got curious how hard it would be to build something similar. I haven’t made it very far, but I did start an experiement by scanning a page of text, then taking a photo with my 6 megapixel digital camera and comparing the results. Hopefully I’ll have some time to compile the results and write something up soon. I never expected very good results, but basically wanted to see how good or bad it was.
Related links