Scanning success!

Yesterday I “scanned” two entire magazines – Prairie Wool Companion #16 and Weaver’s #44. The OCR software, while not perfect, is pretty darned good! Here are “before” and “after” shots of a small section with no weaving content (which I think is minor enough to fall under “fair use”):

As you can see, it’s not perfect – there are a few misspellings and a few cropped photos – but it’s good enough to use as a searchable PDF, and because there’s an option to save it as a PDF/A file type, I’m not worried about OCR errors making it unreadable. PDF/A saves the original image PLUS the searchable text, all on the same page. (The searchable version is overlaid on the original, but invisible, so all you see is the image with whatever you searched for highlighted.) By reducing the image quality slightly, I halved the file size, resulting in a 25 MB file for the entirety of Weaver’s #44 – making it small enough to place in Evernote, my online note-taking software. (The advantage of putting it in Evernote is that it means I can search all my magazine issues at once.)

Here is how I did it:

I set up a camera on a copy stand (which is basically a device to hold the camera so that the lens points directly downwards. Think of it as a tripod, except vertical instead of horizontal).
I put an open magazine underneath the camera, and positioned both camera and magazine so the two visible pages occupied the entire field of view.
I placed a pane of double-strength window glass over the magazine to flatten the pages.
120W halogen floodlights provided the light, placed on either side, at a shallow angle to avoid reflection from the glass.
I used a remote control to trigger the camera, so I wouldn’t jostle the camera body (and blur the photo) while pressing the button.
The resulting images were loaded onto the computer, and processed (an entire issue’s worth at once) using Abbyy Finereader 10.0, the best consumer-affordable OCR software available. Processing the images hogged the CPU for about 20 minutes for each issue – definitely intensive work!
I then saved the document in PDF/A format, with image quality set to medium (screen quality, not print quality).
Voila! A searchable .pdf of an entire magazine issue.

There was quite a bit of camera-fiddling to make sure I had the exposure and focus right – and I think I will go back and fiddle some more before doing the entire “run” of magazines. At the moment, the images are yellower than I’d like, I’m pretty sure there’s a camera setting to compensate.

There are a few other things that I’m considering doing, like going through each magazine to “clean up” the text-recognition. Not to correct misspellings – I haven’t got the time or patience to proofread every single page – but to delete advertisements, etc. that might result in extraneous words. For example, when I search for “temple”, I don’t want to see all the ads that contain the word “temple” (there might be a lot of them), I want to see articles that contain the word. Etc. So the cleanup will take time, but is worthwhile, I think.

Meanwhile, of course, I continue to crank away on my new job (first day went well, lots of information to digest), and will work some more on the weaving later this morning. I need to finish debugging the warp, and then I can start weaving samples and taking measurements, so I can knit up a blank accurately.

All in good time…!

Comments

terri says

June 21, 2011 at 2:52 pm

Glad to hear your first day went well–enjoy the new job!
Julie Sohns says

June 22, 2011 at 5:24 am

The camera setting you want to check for color correction is White Balance. If you set the White Balance to match the type of lights you are using you should lose the yellow color cast.

Share this post!

Comments