Adventures in Blog Recovery
A few months ago something awful happened to a coworker of mine. Due to unexpected circumstances he ended up losing 3 years of blog posts and was unable to recover them. As you can probably imagine, this was crushing to him.
News of this loss traveled fast around the office and my coworker Keith decided to jump into action. He noticed that Google still had the majority of his site cached, so in a dramatic effort to save Sean’s blog he organized a mass download of Google’s cache. Ignoring that Google thought we were a bot a few times, downloading Google’s cache was a success. Knowing that more data is always good, Keith also ended up writing a script that took advantage of Yahoo’s BOSS API to download their cache. At the end Keith had gathered over 60 MB of cache data of Sean’s lost blog.
This is where I come into the picture a little more. Conveniently enough, I seem to take great pleasure in writing scripts to parse data. Keith knowing this came to me with this data and the simple task of recovering as many of Sean’s posts as possible.
The first major challenge I knew I needed to address was that Sean had used a few different templates over the years meaning that not all posts would be stored the same. Luckily two of the templates used the same markup for the posts making things much easier.
Once I had the whole post text available I was able to write some simple regex to parse out the post date, title, categories, and the actual post content.
The last step was to figure out the best way to insert the data into Sean’s new Wordpress blog. I came across Wordpress’s WXR format and decided to run with it.
In the long run I was able to recover 183 posts. Sean was ecstatic with these results so the endeavor was well worth it.
Sean has also allowed me to post the data-set along with my code if anyone is interested to see. The file is available here and in 7-Zip format. Please note that my code is uncommented and probably messy.





