Me making a blog

Writing practices the ability to collect and present evidence, so I wanted to make a blog. The timidity to post anything online drove me to make my own static site generator for the site, in hopes I'd never get it done, and hence never have to write anything.

I've had difficulties keeping previous blogs maintained using Jekyll, Hugo and Sphinx. While these are great tools used by many blogs, they didn't seem to fit me, and provided much more configuration and power than I really needed. I also failed to make writing a routine at the time, so when I'd come back months later, I'd spend time and energy figuring out how to rebuild the blog.

Overview

The general concept takes an input directory with content, and produces the content of a web in another directory.

The static site generators I've often run across often use Markdown files as input. Sphinx was the only exception I tried and it used reStructuredText, though it also supports Markdown.

The builder needs to (1) recursively traverse a directory and (2) read and write files. The desire to minimize dependencies drove this project. It boiled down to:

This blog is generated from Markdown files using a homemade markdown parser and static site generator. If at some point, I no longer have a C++ compiler available, I can port all my code to another language, which I've done for other hobby projects before. Writing everything myself provided a great learning experience about parsing and working with graph structures.

While my markdown parser isn't industrial grade parser for general use, it was edifying to write. I learned a lot of new things about Markdown that I didn't know before, like how Link Reference Definitions can go anywhere. It's also been exciting to learning more about modern HTML and CSS since I'm not a web dev.

Doing the same project multiple times helps teach me how to structure code, so I actually wrote parts of the blog generator three times. The final generator is actually the third that I wrote, progressively getting more advanced. The first was just for single Markdown file translation to HTML in Rust, which included a simple syntax highlighter for Ada. The second was in Ada, which converted entire trees of Markdown files into HTML, and the last, a C++ version.

Neither the Rust nor the Ada versions handled nested blocks. Ignoring nested quote blocks and nested lists, and limiting to single or double * or _ helps severely limit Markdown inline parsing complexity, though I eventually broke those barriers in my current generator. If you're looking for a simple project, a simplified Markdown parser from scratch with these limitations seems doable quickly.

The Many Flavors of Markdown

Markdown libraries exist for a variety of languages. However, that would be too easy and then I'd have to think up some content.

Markdown evolved beyond the original Daring Fireball's Markdown. Being initially released without a formal specification, being extremely useful, and being somewhat easy to implement means that it seems like there are more flavors of Markdown than flavors in a Coca-Coca Freestyle machine.

A CommonMark spec exists which helped me understand Markdown's concepts in a more formal way. I found this spec very readable and found the terminology and the plethora of examples helped me think much more clearly about parsing. It also describes a parsing strategy, though I used a different method. GitHub flavored Markdown is another common variant and also has a spec.

Block Parsing Approach

The CommonMark spec describes it as a sequence of characters, which form lines which build up to form "container blocks" (like lists) and "leaf blocks." That seems straightforward enough, groups characters into lines and then successively feed "lines" into the "parse markdown" machine until it has digested it into a tree of blocks.

This text:

> a quote > > a quote inside a quote > > > a pathological quote inside a quote inside a quote

turns into:

BlockQuote Paragraph String "a quote" BlockQuote Paragraph String "a quote inside a quote" BlockQuote Paragraph String "a pathological quote inside a quote inside a quote"

The current parse contains a cursor into the block tree, indicating which blocks are open. Initially I tried a literal cursor (i.e. pointer) into the tree, though I eventually simplified down to storing a stack of shared pointers to the lowest opened block, similar to a call stack.

OPEN: [Document][BlockQuote][BlockQuote][BlockQuote][Paragraph]

Line stack

When the parser receives a line, it can close any number of these blocks, so I keep a separate stack of the current path of the line. As long as this subset of blocks for the line matches that of the opened block stack then the line is being parsed down the tree to the current document location. If there's a difference, the stack of opened blocks gets trimmed so it agrees with the line stack.

OPEN: [Document][BlockQuote][BlockQuote][BlockQuote][Paragraph] LINE: [Document] "> > ## A HEADING INSIDE THE QUOTE! MWAHAHAHA" IS A BlockQuote Line OPEN: [Document][BlockQuote][BlockQuote][BlockQuote][Paragraph] LINE: [Document][BlockQuote] "> ## A HEADING INSIDE THE QUOTE! MWAHAHAHA" IS A BlockQuote Line OPEN: [Document][BlockQuote][BlockQuote][BlockQuote][Paragraph] LINE: [Document][BlockQuote][BlockQuote] "## A HEADING INSIDE THE QUOTE! MWAHAHAHA" IS A BlockQuote Line OPEN: [Document][BlockQuote][BlockQuote][BlockQuote][Paragraph] LINE: [Document][BlockQuote][BlockQuote][BlockQuote] "A HEADING INSIDE THE QUOTE! MWAHAHAHA" IS A AtxHeader2 Adding child: "AtxHeader2" with "A HEADING INSIDE THE QUOTE! MWAHAHAHA" to BlockQuote OPEN: [Document][BlockQuote][BlockQuote][BlockQuote] LINE: [Document][BlockQuote][BlockQuote][BlockQuote]

This strategy actually works well, especially considering that a line can have nested elements, like a block quote containing a heading. However, nested bullets prove tricky to deal with and required special handling for this method.

Visitors

My parser builds an in-memory tree representation of blocks which can get processed by visitor types. This allows interesting uses, like building a list of references on the page, and for generating a page outline with jumps to headings. The HTML generation itself is simply a visitor that produces HTMl elements at various points when entering and leaving and visiting block children.

It was overkill to use a concept for visitors, but I haven't gotten to play with them much, so it was a fun detour to avoid finishing the blog generator and writing content.

template <typename T> concept BlockVisitor = requires(T t, std::shared_ptr<Block>& block) { { t.enter(block) }; { t.exit(block) }; { t.visit(block) } -> std::convertible_to<VisitResult>; };

Difficulties

Writers don't often discuss difficulties, but I'd like to mention them. I got stumped thinking about block quotes, lists and list items for a while. It seems like a normal lex and parse issue, where the tokens are lines and the various block types are the syntactical elements you need to parse. The idea of the "line stack" originated separately from the "open block stack" as a way to resolve the difficulties of dealing with block quotes, and then special handling for bullets was introduced to allow continuation lines and nested bullet lists.

// Remove open blocks which don't match the given block stack. void Context::closeUnrelatedBlocks(BlockType lineType) { // Continuation line. if (isTopBlock(BlockType::ListItem) && lineType != BlockType::BulletedList) { return; } // Close all currently open bullets to the right of this bullet's indent. if (lineType == BlockType::BulletedList) { const auto indent = lineIndent; auto furtherBullet = std::find_if(openBlocks.begin(), openBlocks.end(), [indent](const auto& block) -> bool { return block->type == BlockType::BulletedList && block->indent > indent; }); if (furtherBullet != openBlocks.end()) { // Close bullets to the right. openBlocks.resize(static_cast<uint64_t>(std::distance(openBlocks.begin(), furtherBullet) - 1)); // Reset line stack to this location. clearLineStack(); for (const auto& block : openBlocks) { pushLineStack(block->type); } } return; } //...

There's a certain point in a program where you try a bunch of things that don't work, and then slowly it seems that little decisions start to magically make things come together or drift apart. The first few ideas I have about architecture usually end up convoluted or unclear. Only after a few cases get implemented, do the patterns emerge for me, and I know whether to keep or change the current program direction. This used to trouble me, and yet it feels comforting since I now expect it and try to enjoy the ride. Only by failing or getting stuck a bunch of times do I finally start to see through the trees to the finish line.

Inline formatting

CommonMark's description of inline structure is definitely quite interesting, using a full idea of a delimiter stack and a doubly-linked list to find appropriate insert locations.

This drives the question: Do I want to make a blog, or do I want to implement the CommonMark spec?

Inline formatting is more difficult than you'd expect. I started implementing the algorithm from CommonMark, but went with a simpler idea. The first pass it parses delimiter groups, the second it matches delimiters, and the third it prints the line. A lot more work, but it was a lot easier to reason through.

Images

I got basic inline formatting and links set up, but wanted to add images. I already have the direct links like []() and named links like [][] setup, so I needed to handle the ! beforehand for images.

Here's an example.

This is an image: ![image](image_url)

I realized I can treat ! like a delimiter in the inline formatter and then fallthrough into the link handling code, treating ! like a flag to emit an image or a plain link. The fallthrough results in double printing the [ so I set a flag and try again on next round trip. After fixing and issue with unmatched terminating brackets not printing correctly, it's time to move on.

Markdown extras

Having my own Markdown parser means that I can customize it as I want.

Metadata ("Front Matter")

There's an optional snippet of metadata allowed in my Markdown files. It's not much, but it allows me to customize each post, and lets me do things like add a generated table of contents and list of references.

--- title: Me making a blog description: Writing a static site generator and parsing Markdown! publish: true tags: blog,c++ bibliography: true date: 23 March, 2025 ---

This actually fit really well into the visitor interface I set up for blocks. If the document is empty, it creates a "Metadata" block stores all strings between a starting and heading "thematic break" (the fancy name for ---) and then the metadata collector parses each string element child of the Metadata block.

Syntax highlighting

With a fenced code block, the info string after the first fence can select the appropriate highlighter. With an acceptable lexer tied into some common backend functions to highlight element types you can get syntax highlighting. There's no automatic language detection, and I have to write support for every language I want to use, but there's also no JavaScript or dependencies involved.

type Example is new Integer;

I made a general C family formatter so I can swap out keywords to get most of the way to decent formatting for languages. It will need more improvements later, but it works well enough for now.

package main import "core:fmt" main :: proc() { // odin code }
void Foo::example() const { // ... }

Running Locally

Docusaurus lets you continuously host a site locally while editing and see the updates very quickly, so I wanted a quick way to iterate with my blog generator as well.

Implementing an HTTP server at some point sounds fun, but for now, I view locally with Python's built-in http.server. This isn't recommended for production, but python is often available and it's a simple way to test things locally.

python3 -m http.server

Reducing file saves

The build writes all outputs initially to memory. It then reads each output location and only writes to disk if the output has changed.

The gist for continuous local updates is to monitor the source directory of the blog and when I make a file change to rerun the build step.

I've used inotify on Linux for this in the past, but on Windows I needed to use DirectoryChangeNotifications.

This worked quite well with my output testing before writing and it only writes anything with an output file change. A neat side effect is that I can move around link reference definitions within my markdown files and it triggers a rebuild, but it doesn't rewrite the file since it won't have changes.

Continuous mode provided quick feedback when doing style updates, simplifying the process of testing out coloring changes.

Summary

This was a great learning experience and I thoroughly enjoyed this project much more than I expected.

References