Bozhidar Batsov: Building Emacs Major Modes with TreeSitter: Lessons Learned

Wait 5 sec.

Over the past year I’ve been spending a lot of time building TreeSitter-poweredmajor modes for Emacs – clojure-ts-mode(as co-maintainer), neocaml (from scratch),and asciidoc-mode (also from scratch).Between the three projects I’ve accumulated enough battle scars to write about theexperience. This post distills the key lessons for anyone thinking about writinga TreeSitter-based major mode, or curious about what it’s actually like.Why TreeSitter?Before TreeSitter, Emacs font-locking was done with regular expressions andindentation was handled by ad-hoc engines (SMIE, custom indent functions, orpure regex heuristics). This works, but it has well-known problems: Regex-based font-locking is fragile. Regexes can’t parse nested structures,so they either under-match (missing valid code) or over-match (highlightinginside strings and comments). Every edge case is another regex, and the patternsbecome increasingly unreadable over time. Indentation engines are complex. SMIE (the generic indentation engine for non-TreeSitter modes) requires defining operatorprecedence grammars for the language, which is hard to get right. Customindentation functions tend to grow into large, brittle state machines. Tuareg’sindentation code, for example, is thousands of lines long. TreeSitter changes the game because you get a full, incremental, error-tolerantsyntax tree for free. Font-locking becomes “match this AST pattern, apply thisface” and indentation becomes “if the parent node is X, indent by Y”. The rulesare declarative, composable, and much easier to reason about than regex chains.In practice, neocaml’s entire font-lock and indentation logic fits in about 350lines of Elisp. The equivalent in tuareg is spread across thousands of lines.That’s the real selling point: simpler, more maintainable code that handles moreedge cases correctly.ChallengesThat said, TreeSitter in Emacs is not a silver bullet. Here’s what I ran into.Every grammar is differentTreeSitter grammars are written by different authors with different philosophies.The tree-sitter-ocamlgrammar provides a rich, detailed AST with named fields. Thetree-sitter-clojure grammar,by contrast, deliberately keeps things minimal – it only models syntax, notsemantics, because Clojure’s macro system makes static semantic analysisunreliable.1 This means font-locking def forms in Clojure requirespredicate matching on symbol text, while in OCaml you can directly matchlet_binding nodes with named fields.You can’t learn “how to write TreeSitter queries” generically – you need tolearn each grammar individually. The best tool for this is treesit-explore-mode(to visualize the full parse tree) and treesit-inspect-mode (to see the nodeat point). Use them constantly.Grammar quality varies wildlyYou’re dependent on someone else providing the grammar, and quality is all overthe map. The OCaml grammar is mature and well-maintained – it’s hosted under theofficial tree-sitter GitHub org. The Clojuregrammar is small and stable by design. But not every language is so lucky.asciidoc-mode uses athird-party AsciiDoc grammarthat employs a dual-parser architecture – one parser for block-level structure(headings, lists, code blocks) and another for inline formatting (bold, italic,links). This is the same approach used by Emacs’s built-in markdown-ts-mode,and it makes sense for markup languages where block and inline syntax are largelyindependent.The problem is that the two parsers run independently on the same text, and theycan disagree. The inline parser misinterprets * and ** list markers asemphasis delimiters, creating spurious bold spans that swallow subsequent inlinecontent. The workaround is to use :override t on all block-level font-lockrules so they win over the incorrect inline faces:12345678;; Block-level rules use :override t so block-level faces win over;; spurious inline emphasis nodes (the inline parser misreads `*';; list markers as emphasis delimiters).:language 'asciidoc:override t:feature 'list'((ordered_list_marker) @font-lock-constant-face (unordered_list_marker) @font-lock-constant-face)This doesn’t fix inline elements consumed by the spurious emphasis – thatrequires an upstream grammar fix. When you hit grammar-level issues like this,you either fix them yourself (which means diving into the grammar’s JavaScriptsource and C toolchain) or you live with workarounds. Either way, it’s areminder that your mode is only as good as the grammar underneath it.Getting the font-locking right in asciidoc-mode was probably the mostchallenging part of all three projects, precisely because of these grammarquirks. I also ran into a subtle treesit behavior: the default font-lock mode(:override nil) skips an entire captured range if any position within italready has a face. So if you capture a parent node like (inline_macro) and achild was already fontified, the whole thing gets skipped silently. The fix isto capture specific child nodes instead:123456;; BAD: entire node gets skipped if any child is already fontified;; (inline_macro) @font-lock-function-call-face;; GOOD: capture specific children(inline_macro (macro_name) @font-lock-function-call-face)(inline_macro (target) @font-lock-string-face)These issues took a lot of trial and error to diagnose. The lesson: budgetextra time for font-locking when working with less mature grammars.Grammar versions and breaking changesGrammars evolve, and breaking changes happen. clojure-ts-mode switched fromthe stable grammar to the experimentalbranchbecause the stable version had metadata nodes as children of other nodes, whichcaused forward-sexp and kill-sexp to behave incorrectly. The experimentalgrammar makes metadata standalone nodes, fixing the navigation issues butrequiring all queries to be updated.neocaml pins tov0.24.0 of theOCaml grammar. If you don’t pin versions, a grammar update can silently breakyour font-locking or indentation.The takeaway: always pin your grammar version, and include a mechanism todetect outdated grammars. clojure-ts-mode tests a query that changed betweenversions to detect incompatible grammars at startup.Grammar deliveryUsers shouldn’t have to manually clone repos and compile C code to use yourmode. Both neocaml and clojure-ts-mode include grammar recipes:1234567(defconst neocaml-grammar-recipes '((ocaml "https://github.com/tree-sitter/tree-sitter-ocaml" "v0.24.0" "grammars/ocaml/src") (ocaml-interface "https://github.com/tree-sitter/tree-sitter-ocaml" "v0.24.0" "grammars/interface/src")))On first use, the mode checks treesit-language-available-p and offers to installmissing grammars via treesit-install-language-grammar. This works, but requiresa C compiler and Git on the user’s machine, which is not ideal.2The Emacs TreeSitter APIs are a moving targetThe TreeSitter support in Emacs has been improving steadily, but each versionhas its quirks:Emacs 29 introduced TreeSitter support but lacked several APIs. For instance,treesit-thing-settings (used for structured navigation) doesn’t exist – youneed a fallback:123;; Fallback for Emacs 29 (no treesit-thing-settings)(unless (boundp 'treesit-thing-settings) (setq-local forward-sexp-function #'neocaml-forward-sexp))Emacs 30 added treesit-thing-settings, sentence navigation, and betterindentation support. But it also had a bug in treesit-range-settings offsets(#77848) that brokeembedded parsers, and another in treesit-transpose-sexps that requiredclojure-ts-mode to disable its TreeSitter-aware version.Emacs 31 has a bug in treesit-forward-comment where an off-by-one errorcauses uncomment-region to leave ` *)` behind on multi-line OCaml comments. Ihad to skip the affected test with a version check:123(when (>= emacs-major-version 31) (signal 'buttercup-pending "Emacs 31 treesit-forward-comment bug (off-by-one)"))The lesson: test your mode against multiple Emacs versions, and be preparedto write version-specific workarounds. CI that runs against Emacs 29, 30, andsnapshot is essential.No .scm file support (yet)Most TreeSitter grammars ship with .scm query files for syntax highlighting(highlights.scm) and indentation (indents.scm). Editors like Neovim andHelix use these directly. Emacs doesn’t – you have to manually translate the.scm patterns into treesit-font-lock-rules and treesit-simple-indent-rulescalls in Elisp.This is tedious and error-prone. You end up maintaining a parallel set ofqueries that can drift from upstream. Emacs 31 will introducedefine-treesit-generic-modewhich will make it possible to use .scm files for font-locking, which shouldhelp significantly. But for now, you’re hand-coding everything.Tips and tricksDebugging font-lockingWhen a face isn’t being applied where you expect: Use treesit-inspect-mode to verify the node type at point matches yourquery. Set treesit--font-lock-verbose to t to see which rules are firing. Check the font-lock feature level – your rule might be in level 4 while theuser has the default level 3. The features are assigned to levels viatreesit-font-lock-feature-list. Remember that rule order matters. Without :override, an earlier rule thatalready fontified a region will prevent later rules from applying. This can beintentional (e.g. builtin types at level 3 take precedence over generic types)or a source of bugs.Use the font-lock levels wiselyTreeSitter modes define four levels of font-locking viatreesit-font-lock-feature-list, and the default level in Emacs is 3. It’stempting to pile everything into levels 1–3 so users see maximum highlightingout of the box, but resist the urge. When every token on the screen has adifferent color, code starts looking like a Christmas tree and the importantthings – keywords, definitions, types – stop standing out.Less is more here. Here’s how neocaml distributes features across levels:12345(setq-local treesit-font-lock-feature-list '((comment definition) (keyword string number) (attribute builtin constant type) (operator bracket delimiter variable function)))And clojure-ts-mode follows the same philosophy:12345(setq-local treesit-font-lock-feature-list '((comment definition) (keyword string char symbol builtin type) (constant number quote metadata doc regex) (bracket deref function tagged-literals)))The pattern is the same: essentials first, progressively more detail at higherlevels. This way the default experience (level 3) is clean and readable, andusers who want the full rainbow can bump treesit-font-lock-level to 4. Betteryet, they can use treesit-font-lock-recompute-features to cherry-pickindividual features regardless of level:12345;; Enable 'function' (level 4) without enabling all of level 4(treesit-font-lock-recompute-features '(function) nil);; Disable 'bracket' even if the user's level would include it(treesit-font-lock-recompute-features nil '(bracket))This gives users fine-grained control without requiring mode authors toanticipate every preference.Debugging indentationIndentation issues are harder to diagnose because they depend on tree structure,rule ordering, and anchor resolution: Set treesit--indent-verbose to t – this logs which rule matched for eachline, what anchor was computed, and the final column. Use treesit-explore-mode to understand the parent chain. The key questionis always: “what is the parent node, and which rule matches it?” Watch out for the empty-line problem: when the cursor is on a blank line,TreeSitter has no node at point. The indentation engine falls back to the rootcompilation_unit node as the parent, which typically matches the top-levelrule and gives column 0. In neocaml I solved this with a no-node rule thatlooks at the previous line’s last token to decide indentation: 1(no-node prev-line neocaml--empty-line-offset) Build a comprehensive test suiteThis is the single most important piece of advice. Font-lock and indentation areeasy to break accidentally, and manual testing doesn’t scale. Both projects useButtercup (a BDD testingframework for Emacs) with custom test macros.Font-lock tests insert code into a buffer, run font-lock-ensure, and assertthat specific character ranges have the expected face:123(when-fontifying-it "fontifies let-bound functions" ("let greet name = ..." (5 9 font-lock-function-name-face)))Indentation tests insert code, run indent-region, and assert the resultmatches the expected indentation:1234(when-indenting-it "indents a match expression" "match x with" "| 0 -> \"zero\"" "| n -> string_of_int n")Integration tests load real source files and verify that both font-lockingand indentation survive indent-region on the full file. This catchesinteractions between rules that unit tests miss.neocaml has 200+ automated tests and clojure-ts-mode has even more.Investing in test infrastructure early pays off enormously – I can refactorindentation rules with confidence because the suite catches regressionsimmediately.A personal story on testing ROIWhen I became the maintainer ofclojure-mode many years ago, Ireally struggled with making changes. There were no font-lock or indentationtests, so every change was a leap of faith – you’d fix one thing and break threeothers without knowing until someone filed a bug report. I spent years workingon a testing approach I was happy with, alongside many great contributors, andthe return on investment was massive.The same approach – almost the same test macros – carried over directly toclojure-ts-mode when we built the TreeSitter version. And later I reused thepattern again in neocaml and asciidoc-mode. One investment in testinginfrastructure, four projects benefiting from it.I know that automated tests, for whatever reason, never gained much traction inthe Emacs community. Many popular packages have no tests at all. I hope storieslike this convince you that investing in tests is really important and pays off– not just for the project where you write them, but for every project you buildafter.Pre-compile queriesThis one is specific to clojure-ts-mode but applies broadly: compilingTreeSitter queries at runtime is expensive. If you’re building queriesdynamically (e.g. with treesit-font-lock-rules called at mode init time),consider pre-compiling them as defconst values. This made a noticeabledifference in clojure-ts-mode’s startup time.A note on namingThe Emacs community has settled on a -ts-mode suffix convention forTreeSitter-based modes: python-ts-mode, c-ts-mode, ruby-ts-mode, and soon. This makes sense when both a legacy mode and a TreeSitter mode coexist inEmacs core – users need to choose between them. But I think the convention isbeing applied too broadly, and I’m afraid the resulting name fragmentation willhaunt the community for years.For new packages that don’t have a legacy counterpart, the -ts-mode suffix isunnecessary. I named my packages neocaml (not ocaml-ts-mode) andasciidoc-mode (not adoc-ts-mode) because there was no prior neocaml-modeor asciidoc-mode to disambiguate from. The -ts- infix is an implementationdetail that shouldn’t leak into the user-facing name. Will we rename everythingagain when TreeSitter becomes the default and the non-TS variants are removed?Be bolder with naming. If you’re building something new, give it a name thatmakes sense on its own merits, not one that encodes the parsing technology in thepackage name.The road aheadI think the full transition to TreeSitter in the Emacs community will take3–5 years, optimistically. There are hundreds of major modes out there, manymaintained by a single person in their spare time. Converting a mode from regexto TreeSitter isn’t just a mechanical translation – you need to understand thegrammar, rewrite font-lock and indentation rules, handle version compatibility,and build a new test suite. That’s a lot of work.Interestingly, this might be one area where agentic coding tools can genuinelyhelp. The structure of TreeSitter-based major modes is fairly uniform: grammarrecipes, font-lock rules, indentation rules, navigation settings, imenu. If yougive an AI agent a grammar and a reference to a high-quality mode likeclojure-ts-mode, it could probably scaffold a reasonable new mode fairlyquickly. The hard parts – debugging grammar quirks, handling edge cases, gettingindentation just right – would still need human attention, but the boilerplatecould be automated.Still, knowing the Emacs community, I wouldn’t be surprised if a full migrationnever actually completes. Many old-school modes work perfectly fine, theirmaintainers have no interest in TreeSitter, and “if it ain’t broke, don’t fixit” is a powerful force. And that’s okay – diversity of approaches is part ofwhat makes Emacs Emacs.Closing thoughtsTreeSitter is genuinely great for building Emacs major modes. The code issimpler, the results are more accurate, and incremental parsing means everythingstays fast even on large files. I wouldn’t go back to regex-based font-lockingwillingly.But it’s not magical. Grammars are inconsistent across languages, the Emacs APIsare still maturing, you can’t reuse .scm files (yet), and you’ll hitversion-specific bugs that require tedious workarounds. The testing story isbetter than with regex modes – tree structures are more predictable than regexmatches – but you still need a solid test suite to avoid regressions.If you’re thinking about writing a TreeSitter-based major mode, do it. Theecosystem needs more of them, and the experience of working with syntax treesinstead of regexes is genuinely enjoyable. Just go in with realisticexpectations, pin your grammar versions, test against multiple Emacs releases,and build your test suite early.Anyways, I wish there was an article like this one when I was starting outwith clojure-ts-mode and neocaml, so there you have it. I hope thatthe lessons I’ve learned along the way will help build better modeswith TreeSitter down the road.That’s all I have for you today. Keep hacking! See the excellent scope discussion in the tree-sitter-clojure repo for the rationale. ↩︎ There’s ongoing discussion in the Emacs community about distributing pre-compiled grammar binaries, but nothing concrete yet. ↩︎