Sunday, December 1, 2013

Lost docs

If I had the time and the toner, I'd print everything onto archival paper. I thought, I wrote, I archived -- often in Microsoft Word, saving the file of course rather than printing it. The joke, as I like to tell it, is that cuneiform tablets > hard drives. I hadn't thought of the archival disaster that awaited these .doc files. How would I read them in the future? Or anyone? No copy of Word 97 on the shelf...

Joking aside, I'm glad I found this post by Peter Hansteen during a day's BSD wanderings. Forthwith I made it a mission to preserve what I could of the old files.

I tried Antiword and Apache's Tika, but neither of these could extract the text from every file. Enter the primitive lasting last. 

for i in $(ls ./WordDocs/doc/*.doc); do strings $i > $i.txt; done;

As the manual page for strings notes:
The algorithm for identifying strings is extremely primitive.
That said, plain text is better than no text. Hmm, there's a line.