Creating Documents

When I was rewriting my blog, I was trying to decide what format to put the posts in. I wanted a format that works well for both print and web, and that can create HTML and PDF files. Additionally, I have a hard "self-host" requirement - whatever I do I need to be able to do on my own hardware and not a remote server. I did some research on different document formats. I came across three main categories (though I'm sure there are probably still more). The categories I came across are are: markup, typesetting, and GUI-based.
Each of these categories offers distinct advantages and disadvantages, and there's multiple formats for each. I evaluated a lot of them to figure out what I wanted to use, and ended up going with a reduced typesetting format. Here I explain what I learned, and also why I went with my preferred format.
Markup
Markup based languages decouple the document structure and semantics from how the document is rendered. This gives very little control to the document itself on how it should be rendered, but this lack of control does more easily allows mupltiple different styles of rendering.
The common two examples of markup documents is Markdown and HTML. In both formats, the document writer will write markup to suggest what the semantics should be. For instance, in Markdown asterisks and underscores are used to emphasize text, whereas in HTML the <b>
and <i>
are used instead. Likewise, Markdown uses multiple newlines to differentiate paragraphs while HTML uses the <p>
tag. In both cases, they rely on an external styling standard, such as CSS, to dictate how the document should be displayed (e.g. font sizes, fonts used, etc.). This decoupling allows for documents to be easily displayed differently depending on the device or medium (e.g. different print styles than screen styles, different layouts on mobile vs desktop, etc.).
Previously I had been using Markdown, but Markdown is very limited and inconsistent. First, most Markdown renderers have widely different opinions about how to render Markdown, and some of them add non-standardized extensions. These extensions are added to overcome Markdown's inherent limitations, epecially when it comes to lack of semantics or manually having to maintain links. Unfortunately, even without those extensions, Markdown document renderers are just incompatible enough that sometimes a page will break when switching renderers. Most of these differences are minor enough that it's generally okay (most people who don't use screen readers can't tell the difference between a <b>
and a <strong>
). However, for some more advanced features (e.g. automatic syntax highlighting), these differences are very breaking. In my case, I have posts where I quote from many different niche or newer languages, including languages which currently don't have syntax highlighting in some popular code highlighting JS frameworks. I also need to typeset math equations, which is something Markdown doesn't support natively, so I'd be relying on extensions. This greatly limits which Markdown renderers I can use, and it makes it so I cannot "just switch" when my renderer of choice is obsoleted or changes licenses.
Additionally, Markdown does allow for embedded HTML, which really limits how it can be rendered and hurts the ability to make pretty prints and PDFs. Also, this is an area of even more Markdown renderer nightmares since they all handle HTML differently (some only allow specific tags, some disallow attributes, some don't allow HTML, and others allow everything). HTML also complicates the PDF pipeline since I'd need to use an HTML to PDF converter - which I don't like since they tend to generate subpar PDFs and have incompatabilities between versions (I've spent way too much time debugging HTML to PDF converter issues).
HTML has a lot of the same issues as Markdown, with one additional note. Sometimes, HTML documents will split their letters across pages when printed. This is often a CSS issue, but it's one that's hard to track down, and it was a bug with my old generator. Additionally, HTML is just really bad to parse because it's a "parsers should fix issues but parsers need to figure out what that means" standard. Because of the issues with HTML (and the awful HTML-to-PDF pipelines), HTML as a source format is just a hard no.
One other notable (but more niche) format in this category is DocBook. DocBook is primarily made for publishing technical books and materials. It's an XML-based document which relies on external styling (e.g. CSS, XSL, XQuery) for displaying and rendering the final document. By having rendering decoupled from the document itself, it allows for creating both print books as well as digital books (e.g. EPUB, MOBI, online HTML, etc). DocBook leans heavily into semantics with the ability create annotations for automatically generating help documentation (e.g. auto-generate indexes, glossaries, etc.). DocBook looks interesting and useful, but it was very much overkill for what I needed, and I didn't really want to be writing a lot of XML for my personal blog (I previously had some XML and that was the most annoying aspect to edit manually so I was trying to remove it). Also, from what I can tell I'd need to setup a document rendering pipeline and probably write a lot of my own XSL/XQuery to get it to do what I want.
For now, because XML isn't what I want to use for this project and because of the time needed to get spun up in DocBook, I'm not going to use it. It looks promising, but I want my first use of it to be for a project where I'd be less frustratred with DocBook being a complete wash than for fixing my personal blog, especially when my current renderer is breaking constantly and making it hard to publish posts.
There were other markup formats I looked at, but none of note, so we'll transition into typesetting formats.
Typesetting
Typesetting formats couples the document structure with how it's rendered, which gives a lot of control for how a document should be rendered at the cost of only "one correct way" to render. LaTeX and TeX are by far the most notable typesetting formats out there. The sheer number of types of documents that can be made is insane. Additionally, there is a ton of control given to the document author to position elements however they want. Plus there is a great package ecosystem with CTAN.
Some of my favorite packages are Beamer, which allows for creating slideshows, Listings which does code highlighting (and it can load source files into the document with listings-ext), and fncychap which makes chapters look nice. There's a very good StackExchange community with lots of questions and answers, as well as other good package recommendations.
LaTeX also allows for really nice document splitting. My favorite way is to declare sub-documents as "subfiles" with the subfiles package. This allows me to treat each sub-document as it's own document with local overrides while still inheriting from the global document packages and macros.
However, there are some notable downsides to LaTeX. LaTeX is very much designed for makings PDFs intended for print. There's been a lot of effort into making LaTeX to HTML converters, but there's a lot of mixed success. Additionally, LaTeX is difficult to parse. This makes it extremely difficult to make additional LaTeX-to-anything-else tooling (which is why such tools have limitations and issues). It also means that, realistically, any tooling I build cannot be "true LaTeX" since I don't have the time or resources that existing incomplete tooling have, so there's no why I'll be able to come close to what's already out there.
Fortunately, there is some competition in the typesetting space. The most notable competition is Typst. Typst is much newer and is trying to compete with LaTeX. They have open-sourced their compiler, so it made it onto my list. They also have both HTML and PDF exporting. HTML output is now experimental (first released Feb. 19, 2025), but when I was working on this it wasn't available at the time (back in January, 2025). Since there was no HTML export at the time, I chose not to use Typst. If Typst was a little further along or I started this project a little later, than I probably would have gone with Typst and self-hosted the compiler. The main factor here was bad timing rather than a low-quality product. I have great hopes for Typst in the future, and will be keeping an eye on it.
GUI-based
The last category is a little different than the others. It's a format that's not designed for manual editing, but is rather designed for use by a specific graphical program to edit and manage the document as it's being rendered. Under the hood it may use either typsetting or markup (or both), but the actual details are hidden away behind the application used to perform the edits. Good examples are Google Docs, Microsoft Office 365, and LibreOffice.
Google Docs and Microsoft 365 are out of the question for me. I only want stuff I can run on my computer indefinitely without having to pay recurring fees (a one-time fee is okay so long as I get a perpetual license). So, both Google Docs and Microsoft 365 are out.
This leaves the non-365 Microsoft Office (ie. Office 2019, 2025, etc.), and open source alternatives such as LibreOffice. Both of which I don't like.
Let's tackle Office. I've done math equation typesetting in Word before, and it really sucks. The biggest problem is it slows the software down to a crawl after only a few equations. Because of this, I stopped using Office in college and started using LaTeX instead. I've also tried doing big documents or use document splitting features, and Word just stops responding at a certain size. Finally, the HTML export from Office is hot garbage, and I'd never use it on anything I tied my name to.
LibreOffice is fine, but I'm not a big fan of the interface and I struggle finding where the basic Word functionality is. I also have a lot of hesitation to use it for anything major or try to use advanced features (like math typesetting or document splitting) simply because some less-advanced features don't work intuitively. Also, shortcuts are either different or don't exist for some features, so I end up using the mouse more.
In short, this category was a total bust for me.
What I did
I ended up choosing to go with a subset of LaTeX that I could fairly easily parse. This allowed me to write my own HTML generator for the stuff I used, while using full LaTeX for my PDF output. Each post would be a subfile that could then be composed together to be a single PDF for my PDF output.
This setup also allows me to define my own "semantic" elements, such as linkinline
which would be an HTML anchor in the HTML output or a footnote in the PDF output.
Additionally, I was able to put any package imports, PDF styles, etc. in a root "document.tex" file which would be ignored by my HTML renderer. This gave me full, complete control over my output when in PDF land without breaking (or worrying about breaking) my output for HTML land. Combine this with my "semantic" elements, and I was able to customize both my HTML and PDF outputs to fit my idea of how each format should operate.
That said, there are some notable downsides to this approach. The first is that I'm extremely limited to what I can do inside my subfiles. I can't declare new commands or do advanced styling in there. For me, that's fine since I prefer doing all of the advanced stuff up-front rather than in with my content (I feel it's easier to maintain and proof-read that way).
Second, I am not parsing the LaTeX. Instead, I'm parsing a proprietary simplification of LaTeX. LaTeX needs a Turring-complete parser inorder to parse (no CFGs here). For difficulty level, writing a LaTeX parser is comparable to writing a full C++ parser complete with a macro preprocessor from scratch. In practice, what this means is that I will occasionally end up with documents that are valid LaTeX documents, but which my HTML generator cannot parse. My solution is to either change my document, or I have to change my parser to accomodate that specific use case. After doing this process for more than 20 posts (some of which are really long and complicated), I got the parser hacks pretty ironed out, so now it's usually just a document tweak.
Third, my solution is highly specific to me and my needs. I'm only parsing elements that I use. I'm only parsing syntax that I use. I'm only creating output that I want. Because of this, it's never going to be a general purpose project meant for other people to use. This also meand that I don't want to deal with bug reports because somebody doesn't use LaTeX the way I do. I don't want to deal with pull requests to add features that I don't (and won't) use, and which I definitely don't want to maintain other people's contributions. This has a big implication for whether or not I ever open source my code. Culturally, open source software is expected to respond to bug reports, and pull requests, and maintain contributions. But since that goes against what I want for this project, it's out of the question.
On the flip side, I feel like this generator is one of the more useful code projects I've written, and it may help someone else write a blog who wants something other than Markdown as well. I also really enjoy writing and building this project, and I want to share it. It's just that open source isn't correct for this model.Probably at some point I'll make the source publicly available under a liberal license, but not open to contributions. Before I do that, I might fix some bugs (mostly around error reporting), but I may not and just release it as is - warts and all.