Thoughts on docs

This is a snapshot of what I think about writing and reading digital prose, on technical topics, with other people. Stuff like docs, a collaborative wiki, a knowledge base, or a blog.

In this post I will argue for

CommonMark stored in Git rendered to HTML
Separation of content from presentation
Reproducible process of build and development
Hard to break: version control, automated checks

Content and presentation

I will refer to source markup as the content, and to its visual rendering as the presentation. In WYSWIG editors, like Notion or Google Docs, content and presentation are the same.

I believe this split is a useful tool. For instance typesetting mathematics in TeX is easy to edit and reproduce. Graphical equation editor is much harder to use. I admit it can make text harder to edit, especially for non-technical users. I believe benefits in reproducibility and automation outweight these costs.

Long-term reproducibility

I would like a repository of knowledge, developed collaboratively over many years, with evolving needs and conventions, that’s always readable as raw content and as “trivial” presentation, but continuously upgrades its preferred presentation.

I believe CommonMark is the only Markdown flavor likely to survive with no changes for a long time. There is also too much Markdown written for no Markdown at all to exist.

Most existing Markdown authoring frameworks, such as Hugo, Zola, or Docusaurus, depend on some combination of extensions like front matters, React components, or shortcodes. Some even require a special directory structure. Migrating between these frameworks can be hard. The framework shouldn’t matter that much in my opinion.

Configuration should be minimal. It’s easier to get started, and it’s harder to break. Eleventy does this well.

If storage is not an issue, record milestone releases in various formats like HTML, PDF or PNG.

Searchability

I would like a fast, fuzzy search over

Full text in content and/or presentation
Directory trees (hierachical categories)
Tags (flat categories)
Git metadata (who, when)

Depending on size, search could run in the browser, rg and fzf on a local clone, or online with an external index.

Directory trees

Directory trees as URL paths are hard to maintain because documents must have exactly one place. It may be time-consuming to place it well, and the file may fit elsewhere over time, leading to stale links or stale structure.

/where/do/i/put/this

Tagging

Tag systems, like those in social networks, enable somewhat natural indexing of topics. I’m not convinced yet they are useful for docs.

Consider reusing _italic_ and/or __bold__. They already represent emphasis in Markdown, and can be filtered, searched, or indexed. I can mention them in content

Using _words_ as _tags_.

or hide them in HTML comments

<!-- _tags_ _words_ -->

Version control

Git can record granular history of content and presentation. A tidy commit history is useful in finding authors (who to ask about some piece of knowledge). A version of docs is identified by a Git commit.

Commits should introduce related changes
Commit messages should be informative

First commit should be a readable version, and all subsequent commit should be readable. When making edits, try to remain in narrow scope.

Use git rebase -i to refine local history Refactors are good, but make sure it’s possible to filter these commits out for e.g. git blame for instance with refactor(scope) header.

For displaying different versions there are a few options to consider

Always present only the latest commit like docs.example.org/a/b/c
Hardcode commit in URL path like docs.example.org/v1/a/b/c
Softcode commit as URL parameter like docs.example.org/a/b/c?v=1
Forward to latest consistent version

Editing

If content is managed as plain text files, any local text editor or IDE can be used for editing. User can configure linters, formatters, autocomplete, snippets, macros and motions like textobjects or leaping.

A browser-first environment can be very useful.

Collaboration

Collaborating on prose, like on code, can happen asynchronously or synchronously. I believe there should be a process for both, but asynchronous should be the default.

Patches should be submitted and reviewed by individuals. Review commonly happens on forges like GitHub or GitLab, where the threads are not recorded in the Git repository. For reproducibility, I kind of wish it stayed in Git. There are solutions like Gerrit or git-appraise that manage reviews as Git refs through git-notes.

For live pairing, a real-time collaboration tool is useful.

Linting prose

The term was borrowed from the word lint, the tiny bits of fiber and fluff shed by clothing, as the command he wrote would act like a lint trap in a clothes dryer, capturing waste fibers while leaving whole fabrics intact.

Like software, prose can be linted in pre-commit stage of CI/CD or live inside an editor.

This includes CommonMark formatters, spelling and grammar checkers, and even readability analysis.

Markdown formatting errors should block release, and there should be good options for autoformatting.
Prose linters should only warn the author. Dictionary, thesaurus, and LLMs may be considered.

Semantic linefeeds

Also known as semantic breaks, ventilated prose or visual-syntactic text formatting. Markup languages like HTML, Markdown, or TeX rewrite single new-lines to spaces by default. This allows the writer to split lines semantically.

If there be any truth in the remark,
the crisis at which we are arrived
may with propriety be regarded as the era
in which that decision is to be made;
and a wrong election of the part we shall act may,
in this view, deserve to be considered
as the general misfortune of mankind.

This convention encourages short, well-punctuated sentences, and was shown to increase reading comprahension and reduce eyestrain. By limiting unrelated changes, line are meaningful for longer. Lines integrate well with other line-oriented software like Vim.

Reflowing to wrap at fixed column width or single-line paragraphs, albeit more common in practice, both break default Git (hunk) diffs.

Either way, consider a better diffing algorithm, like [Difftastic] or git --word-diff.

Reference links

Reference links move URLs out of the way. This is great for readability and Git.

Wrap a [link] in square brackets.
[Capitalisation] doesn't matter
and [spaces are allowed].

[link]: https://kszk.eu
[capitalisation]: /assets/example.pdf
[spaces are allowed]: /link/to/somewhere

Order of resolution is from bottom to top. That is, if I add a reference in content, it will override whatever comes later. This means I can safely append arbitrary references. For instance, all pages in some directory.

If references are missing, CommonMark will render it like [this]. It’s fairly readable, similar to IETF RFCs.

Policy on extensions

A syntax extension should only be considered if the presentation of its source remains readable in reference CommonMark.

That is, it should only improve the presentation, but not change the content significantly.

Front matters

Metadata is often stored in a custom YAML front matter. It’s hard to read in CommonMark. Additionally, schemas can be inconsistent, complicating upgrades or migrations to other frameworks.

---
title: Some title
slug: some-title
author: Bob Smith
date: 2023-10-10
---

# Some title

I, Bob, wrote this in October.

For titles, I believe the first top heading should be treated as the document title. Slugs should be derived from titles. Timestamps should come from Git.

I believe all metadata that’s not in content, should be tracked entirely in Git. It may be slow on big old repositories, as git blame goes through every commit. At such scale, one can use incremental builds.

TeX

Raw TeX source code, often between dollar signs, is arguably very readable to its intended audience, as most readers are also writers in this case.

For consumption, it’s still very useful to render. One might stare at a proof for a long time, working with the same symbols on a piece of paper.

Let ${X_i}$ be a collection of groups
indexed by a directed set $I$.

For $i<j$ let
$\pi^{j \to i} \colon X_j \to X_i$
be a homomorphism such that
$\pi^{i\to i}$ is identity
and if $i<j<k$ then
$\pi^{j\to i}\circ \pi^{k\to j}=\pi^{k\to i}$.

Tables

Presentation unreadable if not supported, hard to edit without editor assist, and moving whitespace pollutes Git diffs.

On the other hand, they look good in code, and may not change much. They are of course excellent for comparisons, before and after, paper results, etc.

As a compromise, I could write them in fenced blocks, agree on a type like table, and render them in presentation.

|One|Two|Three|
|---|---|-----|
| 1 | 2 |  3  |

Footnotes, sidenotes

Footnotes and sidenotes add a side-channel for communication. Paradoxically, this can be good for linearity and focus, as it signals to they are less important and can be deferred.

Here[^1] is a footnote.

[^1]: This is a footnote.

Reference links already can kinda do this. We can specify a page fragment as sidenote identifier, and the content can go into the link comment.

Look [here].

[here]: #some-id "This
    will show up
    only as alt."

Adamonitions

Another form of “communication sidechannel” are in-text “notification” paragraphs. Again, many Markdown implementations have their own way of doing this.

:::warning
This may render with a ⚠️
:::

[!WARNING]
Another one of those

A more CommonMark-friendly way could be to put it in a comment

<!-- ⚠️ One more warning -->

and render it in presentation. No special presentation, no adamonition. To always be seen, some people use blockquotes

> ⚠️ Third warning

which renders like this

⚠️ Third warning

They nest, look good in plaintext and CommonMark, Unicode is everywhere now, I like this.

Unfortunately, if it renders into <blockquote> this is not really semantically correct, and may be problematic for accessibility.

There are so many other things that can break in our fallback render scenario; it may be okay to assume some intent in presentation.

If possible, I would use the comment trick instead, unless the adamonition must always be seen, in which case perhaps it shouldn’t be an adamonition?

Rendering fenced blocks

A snippet of Mermaid or Graphviz source may replace itself with its SVG render.

This is a bit different to TeX rendering, as nobody can imagine these from source.

While content would be as unreadable as ![](/assets/diagram.svg), it’s another language to depend on, not a feature that ships with the browser.

On the other hand, images are hard to edit, one needs to know how to generate a new one, so there is an indirect dependency there as well.

It’s easiest to sketch a raster. I can also draw and edit SVGs in Graphite or https://draw.io.

Bibliography

There are extensions like citeproc but I like to instead use reference links.

They can link to publisher’s website, DOI or Arxiv, or internal notes.

Quantisation tends to outperform pruning.
See [kuzmin23] for details.

[kuzmin23]: https://arxiv.org/pdf/2307.02973.pdf

The links can be specified in-text, or auto-generated from a BibTex file or pages in /bibliography.

[angelopoulos22]: /bibliography/angelopoulos22.html

Depending on context, [kuzmin23] can be rendered in presentation as

Kuzmin et al. (2023)
(Kuzmin et al., 2023)
Kuzmin, Nagel, van Baalen, Behboodi, Blankevoort (2023)

Without presentation support, it’s still readable. Square brackets in text are already in use as punctuation to isolate text from it’s surroundings.

Typography

It seems that serious typography shouldn’t target HTML, but I believe adding much new syntax to content can be bad for reproducibility and writing experience.

Otherwise, rewriting some quotes and semantic CSS can improve presentation without needing changes to content. Practical Typography is a great resource for rules. See Pollen for details on limitations of web publishing.

Diataxis

A model of how docs are consumed, and how they should be structured.

2d plane with 4 quadrants

how-to for problem solving
references for looking up information
tutorials for learning new things
explanations for narrow depth

A few more categories could work for modelling how docs are produced

drafts for early iterations and experiments
bibliography for notes on publications
changelog as described in hsiao_2023 for accounting for major changes to the project; adrs or rfcs could work here as well

SECI

A model of knowledge creation from nonaka_1994. SECI stands for four subsequent phases of knowledge.

spiral

Socialisation is broadcast and capture of tacit knowledge
Externalisation is saving an explicit record of it
Combination is integration with other explicit records
Internalisation is reading everything together and reflecting

Knowledge is refined as it cycles between tacit and explicit. Docs as code map well to this framework

“S” is talking to people, adding features, fixing bugs
“E” is copy to local, branch, write, commit, push
“C” is integration, human review, merge, deploy
“I” is searching, clicking and reading everything

Editor assist can speed up “E”, Much in “C” can be automated. “I” and “S” are not controlled.

Programming as theory building

naur_85 argues that programming is more about development of a collaborative insight/theory than writing down the program itself.

Tools for thought

Tools for thought.

Just-in-time over just-in-case

hillmer_2016 proposed to focus on the intersection of

What software can do
What users wish to accomplish
What users can’t figure out

They also suggest collecting some metrics to

Unpublish low-traffic
Polish high-traffic
Understand reading behavior

ADR

Architectural Decision Records are notes on design decisions that significantly alter architecture.

A sequence of ADRs can also be great documentation, useful for onboarding new people to the project. For that purpose, consider a single ARCHITECTURE.md.