Rewriting my site generator

Lately I've been rewriting my site generator behind my blog. The old one was really finicky, had a terrible parsing strategy, and also wasn't what I actually wanted long term. Also, due to how I wrote the generator it was getting hard to add new features. Plus, every time I started it up I had to deal with NPM giving me a bunch of security vulnerability alerts since my XML parser was outdated. But, the next version of the parser was a completely different API for parsing and traversing XML documents, which would have required a rewrite just to get rid of those printouts. Additionally, I've been rewriting all my posts from Markdown to LaTeX so that I can print them out nicely, and I didn't want to keep maintaining two versions of my blog (one for print and one for web).
Plus, I've been looking at moving to a different hosting solution to get page view counts without having to pay extra a month (plus CloudFlare paid analytics gives way more detail than I need, and I don't want to track info I don't need). That said, I'm still planning on using some of CloudFlare's solutions, just a different host. With some of the hosting other solutions I'm looking at, I'd have to build my VM images and get off of CloudFlare's automatic build process, so I needed something simpler than what I currently have. I wanted the ability to get a single executable so that it'd be easy to make slim VM for simply creating the output. Also, I really wanted to simplify a lot of the features I had which would let me start changing the page layout (such as add a sidebar or search). To do that, I'd first need to rewrite my site generator and change how everything works.
So, I rewrote my site generator.
My first step was to figure out what sort of document format I wanted. LaTeX is great, but I was struggling to find a LaTeX to HTML converter/generator that I liked. Also, it's very much designed for PDFs and print, so a lot of concepts don't translate well (such as footnotes). I took a look at Typst, but they're only doing PDFs right now and I don't feel like waiting for HTML support. I took a look at DocBook, and it's a very messy system - plus it's all XML based and I'm wanting to get rid of the little XML I have and not add to it. After being disatisfied with my options, I decided to write my own. Since I was already porting everything to LaTeX I decided I'd have it be LaTeX-like and only worry about the subset I need. For anything that's PDF/print specific, I decided to create my own LaTeX macro wrapper. I could then key off of my wrapper to make a more web-native experience (like having an inline link instead of a footnote to a URL).
This plan actually worked quite well. In fact, this current post is the first post written solely in the new generator. Also, by the time this post is published all of my other posts will have been converted over. That's not to say that the project was without it's challenges (there were plenty). But it works - at least for my use cases. To celebrate, I wanted to talk about my experience.
For my technologies, I wanted something that would compile to a single executable that I could run on a server. I also wanted to still have LaTeX math get precompiled instead of relying solely on a client-side renderer. That combination greatly restricted my options. TypeScript and JavaScript have the best LaTeX math rendering engines which aren't a full LaTeX implementation (mostly because math people want to share math on web pages and not just PDFs). Most of them are client-side focused, but a fair number have been ported to run over on NodeJS, including MathJax, KaTeX, and LaTeX.js. Unfortunately, I've had some serious issues trying to bundle NodeJS apps into a single executable.
Fortunately, NodeJS isn't the only option when it comes to server side JavaScript and TypeScript. Deno and Bun exist as well. I've used Bun before, and it's really nice. Plus, they do support making a single executable - not only that but it supports cross compilation as well[1]! And, unlike Deno 1.0 which did a lot of weird things, Bun really strives to be NodeJS compatible. And I've used Bun before, so that's what I chose (sorry Deno 2.0, but I just had enough issues with Deno 1.0 that I went with Bun this time).As for parsing, I just built a simple LL(1) parser that mostly handles correct syntax (LaTeX parsing is really hard[2]). I only worried about the subset I was dealing with, and added in a lot of special exceptions to the parser (and sometimes even the tokenizer) just to get what I needed parsed. Since I have over 20 blog posts by now - some of which are really long and really complex - I had a pretty good idea of what sort of functionality I needed and didn't need. I also made it easy enough to modify in the future to add additional commands/macros/environments. Plus, I did add a lot of asserts that would crash if I violated them, so once I hit the point that I could process my existing posts my test suite just became building everything all over again.
That said, I did make a few big oopsies with the parser. The first is that I don't have backtracking - which is kinda actually needed to get LaTeX parsing working properly. So, I have a bunch of hacks around command arguments and stuff. Which is part of the second oopsie.
LaTeX command arguments can hold other commands, but I didn't code for that originally since I didn't think I'd need it. Also, I didn't want to tokenize the commands since some of the commands I was using don't actually parse their arguments, so I had the tokenizer consume the start and end of a command at once and then I could choose to parse it later. However, this approach meant that I couldn't have nested commands, otherwise I would end up with multiple curly brace starts, and but I would terminate at the first curly brace end (i.e. \outer{\inner{a}}
would be parsed as \outer{\inner{a}
). And then, about a month into the project (and 15 posts ported over), it turns out I did need nested. So, I put a hacky fix in the tokenizer and not the parser since doing it in the parser would break all of the commands where I can't parse the contents. Not great, but it works well enough.
Another big issue I had was printing parsing error messages. When I've made parsers in the past, I usually have pointers to the string, so I can calculate my original position by doing pointer arithmetic. However, in JavaScript I don't have that. My first attempt was to simply keep track of a count. But, I messed up since commands don't always parse until much later, so I'd end up with incorrect numbers since I was running the parser a second time on a substring and all my counts got reset. I sort-of got it fixed, but it's still very unreliable.
One other oopsie I made (but I believe I got it fixed) was that I would often have an assertion where I should have thrown a parse error. This made it so when I had invalid input, my program would crash and I wouldn't know why, or even which file caused the issue. I got most errors as errors now rather than assertions, but it was pretty annoying.
I also have other quality-of-life issues, like no iterative compilation, no warnings, stuff like that.
Finally, I did end up with more 3rd party JavaScript and CSS. Originally, I tried to use MathJax to process math commands since MathJax does compile to an SVG. Unfortunately, due to how MathJax loads I can't include it in a Bun executable, which sucks. So I had to switch over to KaTeX. KaTeX doesn't compile to an SVG, instead it compiles to HTML (and MathML) tags which then require some CSS (and sometimes JavaScript) to get properly rendered. KaTeX can be self-hosted pretty easily, so I'm at least doing that.
Also, I removed all of my old XML. My XML was keeping track of "metadata" (what site was it, descriptions for posts, when posts were made, etc). However, I found that a lot of it could either be inferred from my content, or I could use my custom tags to annotate that information. Also, anything that couldn't be inferred could easily be put in a JSON file as well.
Overall, I really like the outcome. It's a nice breathe of fresh air, and it's a lot easier for me to maintain
Bibliography
- [1] "Single-file executable," Bun. https://bun.sh/docs/bundler/executables (accessed Jan. 18, 2025)
- [2] "Limitations when parsing TeX as a context-free grammar," LaTeX.js. https://latex.js.org/limitations.html#when-parsing-tex-as-a-context-free-grammar (accessed Feb. 20, 2025)