Publishing org-roam notes with pandoc

Using org-roam has helped me organize my thoughts and jot down whatever comes to mind in the moment, freeing my feeble mind to care for what's most important in the day. As I've been using it to take more notes, I'd like some of those notes (like this one) to become blog posts.

I've found pandoc to be a really good way to export, mainly for reasons of simplicity. The only issue with using org-roam and pandoc together is that org-roam's internal links don't translate to pandoc html pages. That's where pandoc's filters come into the picture.

Exporting a single page

To test exporting a single page with custom css, I've saved bettermotherfuckingwebsite.com's css declarations in a file aptly called style.css and use pandoc to export a single page.

$ pandoc -f org -t html5 --css=style.css --standalone note.org -o note.html

And lo and behold, it already looks like how I want it to look like. But a proper website needs a header and a footer, so we create two files header.html and footer.html and add them to the final page.

$ pandoc -f org -t html5 --css=style.css --include-before-body=header.html --include-after-body=footer.html --standalone note.org -o note.html

And now we have each page following a proper website template with header, body, and footer.

Sprinkle of internal links

To include links properly, I'll be using pandoc's lua filters to look up the link in org-roam's sqlite database and modify pandoc's AST to replace the id:xxx link with a proper href.

Before I can decide what filter to write, I need to see pandoc's generated AST.

$ pandoc --standalone -t native note.org

which gives me the following output

Pandoc (Meta {unMeta = fromList [("title",MetaInlines [Str "The",Space,Str "Grand",Space,Str "Unified",Space,Str "Theory",Space,Str "of",Space,Str "Everything"])]})
[Header 1 ("setting-up-org-roam",[],[]) [Str "Setting",Space,Str "up",Space,Code ("",[],[]) "org-roam"]
...

What we're interested in is

Link ("",[],[]) [Str "school"] ("id:e0e3eed4-d1ec-4e76-9244-cfbf22ba5a6f","")

Which according the module documentation and Text.Pandoc.Definition means it's a link item type]] with no attributes, alt text of "school", and target of "id:…".

function Link(elem)
   return pandoc.Str(elem.target)
end

Switching gears to python

org-roam stores note references with IDs in an sqlite database that by default sits under $HOME/.emacs.d/org-roam.db. To access this, I'd need the sql extension for lua which is not installed on many systems. Python has both json and sqlite as part of its batteries-included standard library, so I'll use that instead.

We can use pandoc's json api and write the filter which parses, modifies, and prints json. But there's a better way! pandocfilters and panflute modules are available for python which takes care of the plumbing for us. They are also available on pypi which means they can be installed easily with pip. I've chosen to work with panflute for no particular reason.

The filters can be used with the --filter argument.

$ pandoc -f org -t html5 --standalone --filter myfilter.py note.org -o note.html

so the final line will be

$ pandoc -f org -t html5 --css=style.css --include-before-body=header.html --include-after-body=footer.html --standalone --filter myfilter note.org -o note.html

Filtering effectively

I've named the filter sanitize_links.py.

#!/usr/bin/env python3

import panflute as pf
import sqlite3
import pathlib
import sys
import os
import pprint
import urllib

#### CHANGE THESE ####
ORG_ROAM_DB_PATH = "~/.emacs.d/org-roam.db"
#### END CHANGE ####

db = None

def sanitize_link(elem, doc):
    if type(elem) != pf.Link:
        return None

    if not elem.url.startswith("id:"):
        return None

    file_id = elem.url.split(":")[1]

    cur = db.cursor()
    cur.execute(f"select id, file, title from nodes where id = '\"{file_id}\"';")
    data = cur.fetchone()

    # data contains string that are quoted, we need to remove the quotes
    file_id = data[0][1:-1]
    file_name = urllib.parse.quote(os.path.splitext(os.path.basename(data[1][1:-1]))[0])

    elem.url = f"{file_name}.html"
    return elem

def main(doc=None):
    return pf.run_filter(sanitize_link, doc=doc)

if __name__ == "__main__":
    db = sqlite3.connect(os.path.abspath(ORG_ROAM_DB_PATH))
    main()

A note on versions!

I'm using Ubuntu 20.04 LTS which means some of the packages are outdated. It appears older pandoc versions didn't have great error messages making debugging difficult. Since I've updated pandoc with packages available on their release page, I've had better luck.

Worth noting the python3-pandocfilters package in in repos is also outdated, so using pip is recommended.

Publishing the right files

Some of my notes are to be published, but some I'd like to keep private. To do that, I have set up my notes to have a tag of "publish" for ones I want to, well, publish, by adding it to filetags.

#+filetags: publish

Then my build.sh script filters files that have a publish tag. Here's the entirety my of build.sh script. A Makefile would be more appropriate.

#!/bin/sh

CSS=org.css

mkdir -p html/
rm -f html/*

for note in $(grep -iRE '^#\+filetags:.*?publish' --color=never --files-with-matches); do
    echo "processing ${note}"
    pandoc -s -t html5 -f org --css="$CSS" --include-before-body=header.html --include-after-body=footer.html --filter fix_roam_links.py "$note" -o html/"$(echo $note | sed -e 's/\.org$/\.html/')"
done

index_file=$(grep -iR -l 'grand unified theory of everything' html | head -n 1)
echo "setting index file"
cp "$index_file" html/index.html

echo "copying $CSS"
cp "$CSS" html/

The output files go to html directory. And I publish by simply rsync'ing the files to my public directory. Here's the one-liner for upload.sh.

#!/bin/sh
rsync --progress html/* server:/srv/www/

Now it's time to add the publish tag to this file! With this setup, every time I add a new post, all I need to do is add a link to it to the homepage and run ./build.sh && ./upload.sh.

Footnotes

I change the title of my notes frequently, which means the filename and title go out of sync. To prevent this, I have come to appreciate having date and IDs as filenames. Here's a one-liner that converts the default filenames to "<date>-<id>.org" format.

for f in *.org; do mv "$f" "$(echo $f | grep -Po '^\d+')-$(grep ID $f | tr -s '\t ' ' ' | cut -d' ' -f2)"; done

Having said that, my notes are now only named by date and time to make it easier for org-roam to generate filenames.