Labor Day Link Fixup

Tagged as lisp

Written on 2017-09-04 11:18:00

Ancient History

For whatever reason, yesterday seemed a good day to decommission Linode 18032 that I bought way back in February of 2009 and had been using to run ever since. I bought in June 2007 in the wake of my first serious breakup (with Sonya Z) but ran it out of my parents basement up until getting the linode.

Ever since, though my knowledge increased, I didn't bother revamping the linode other than one reinstall with Ubuntu 12.04 when I first migrated my blog away from Wordpress to Coleslaw. I've been gradually moving my online presence towards and making plans to get back to blogging in earnest the past few weeks. But even though was made with (semi-workable) ansible roles, was still a big hand rolled minefield with years of digital detritus to boot. Nothing like a 3 day weekend to clean out your digital woodshed.

A Fresh Start

After deleting a lot of crap, I moved the statically hosted documents over to, added an nginx vhost, swapped the DNS over, and nuked the old linode. There was one last thing to do though. When I moved over to coleslaw from Wordpress years ago, I didn't bother fixing up the old links. So I have a bunch of ".post" documents in my blog's git repo that reference expired wordpress links to other posts (like "blog/?p=1234") instead of the slug coleslaw would assign the post.

I guess I didn't care enough at the time or was too focused on coleslaw itself to worry about "legacy content". But I found a wordpress XML backup and figured I might as well fix up the dead links today while I was at it. All while rolling my eyes at dealing with XML.

Link Fixup

Since coleslaw is designed to be backed by git as a content store, I started by grepping through the blog posts to get a list of all the old wordpress post IDs I linked to.

grep -Eo "\?p=(\d+)" *.post | cut -d "?" -f 2

Armed with that, I could tackle digging into the wordpress XML backup to map the post IDs to coleslaw generated titles. Shinmera has been writing stupid amounts of good lisp code over the past few years including Plump, an XML parser. I never completely gelled with CXML back in the day so I figured I'd give plump a go. The following assumes you have a decent lisp installed (I heartily recommend SBCL) and quicklisp.

(ql:quickload 'coleslaw)
(ql:quickload 'plump)

;; Make sure to use a pathname (the #p) for this, plump will treat a plain string as XML to be parsed, not a file.
(defvar *wp-xml* #p"/Users/brit/projects/linode-retirement/wordpress.2010-09-14.xml")

(defvar *doc* (plump:parse *wp-xml*))

I was actually stumped here for a bit because I used a string instead of a pathname as the argument to PARSE and it took me a few minutes querying and getting no results before I looked in the slime inspector and realized the doc hadn't parsed as expected. Once I had this though, it was pretty straightforward to build a hash of IDs to titles...

(defvar *posts* (plump-dom:get-elements-by-tag-name *doc* "item"))

;; For those wondering, :test #'equal is there so we can use string keys. Read Practical Common Lisp or google to learn more.
(defvar *post-id-map* (make-hash-table :test #'equal))

(defun extract (key node)
  (let ((value (first (plump-dom:get-elements-by-tag-name node key))))
    (plump:text value)))

;; Yes, I'm using a private coleslaw function here but I wrote coleslaw so ... uh ... do what I want! 
;; And in case you were wondering, handler-case is lisp's try/catch equivalent and I'm pretty much doing "rescue nil" here.
(defun fixed-title (title)
  (handler-case (coleslaw::slugify title)
    (simple-error () :junk)))
(loop for post in *posts*
      do (let ((title (extract "title" post))
               (id (extract "wp:post_id" post)))
           (setf (gethash id *post-id-map*) (fixed-title title))))

And now we're good to update the existing posts to have proper relative links to the content we actually want.

(ql:quickload 'cl-ppcre)

;; Can you tell I did all this in one repl session and just copy pasted it into this blog post?
(coleslaw::load-config "/Users/brit/projects/improvedmeans/")

;; Cute sidenote, since ppcre turns this regex into to a tree of closures which SBCL compiles, this can be #'disassemble-d.
;; Also, it's pretty fast.
(defvar *wp-url-regex*
  (cl-ppcre:create-scanner "w{0,3}?\.?redlinernotes\.com/blog/\\?p=(\\d+)"))

(defstruct counts
  (invalid-slug   0 :type fixnum)
  (post-not-found 0 :type fixnum)
  (updated-url    0 :type fixnum))

(defvar *results* (make-counts))

(defun invalid-slug-p (slug)
  (or (null slug)
      (every #'digit-char-p slug)))

(defun mismatch-p (slug)
  (let ((key (make-pathname :directory '(:relative "posts")
                            :name slug :type "html")))
    (null (gethash key coleslaw::*site*))))

(defun slug-for-match (text start end match-start match-end reg-start reg-end)
  (declare (ignore start end))
  (let* ((id (subseq text (aref reg-start 0) (aref reg-end 0)))
         (match (subseq text match-start match-end))
         (slug (gethash id *post-id-map*))
         (new-url (concatenate 'string "" slug ".html")))
    (cond ((invalid-slug-p slug)
           (incf (counts-invalid-slug *results*))
           (format t "Couldn't find valid slug for post id: ~d~%" id)
          ((mismatch-p slug)
           (incf (counts-post-not-found *results*))
           (format t "Not found in site content: ~A~%" slug)
           (incf (counts-updated-url *results*))
           (format t "Replacing ~A with ~A~%" match new-url)

(coleslaw::do-files (path "/Users/brit/projects/improvedmeans/" "post")
  (let* ((text (alexandria:read-file-into-string path))
         (updated-post (cl-ppcre:regex-replace-all *wp-url-regex* text #'slug-for-match)))
    (with-open-file (out path :direction :output :if-exists :supersede)
      (format out "~A~%" updated-post))))

(format t "~%~%===RESULTS===~%")
(dolist (type '(invalid-slug post-not-found updated-url))
  (let ((symb (alexandria:symbolicate 'counts- type)))
    (format t "~A: ~D Posts~%" type (funcall symb *results*))))

And there you go. A few links are still broken but things are generally improved, I'm down to 1 linode instead of 2, and I had a bit of fun on a lazy Sunday.





comments powered by Disqus

Unless otherwise credited all material Creative Commons License by Brit Butler