Wednesday, 31 March, 2021

Pauli Jaakkola

Technology

Structure and Interpretation of Malli Regex Schemas

Seqex schemas shipped with Malli 0.3.0. This is their story. A story with many twists and turns and technical details Malli users don't need to know.

IntroductionLink to Introduction

In December 2020 Tommi called on me to implement "regex schemas" for Malli. Now what do we even mean by "regex schemas"? Hasn't Malli had support for

(m/validate #"foo|bar" "foo") ;=> true

for quite a while now?

What I am talking about here are regular expressions of sequences and not just of strings. In an attempt to reduce confusion I will call them "seqexes" from now on, Seqexp having pioneered the concept before Clojure Spec was even a thing.

Most developers are probably familiar with regular expressions of strings and their basic forms. Here they are in (JVM) Clojure and Malli forms:

The empty string #"" or sequence [:cat]
Literal character #"a" or element schema :int
Concatenation #"ab" or [:cat :string :int]
Alternation #"a|b" or [:alt :string :int]
Kleene Star #"a*" or [:* :int]

The formalism also contains the empty set but regex and seqex engines do not because a regex/seqex/schema that never matches anything is not very useful in practice. Because Malli uses an S-expression or Hiccup-like syntax the empty sequence is just the concatenation of zero subexpressions [:cat].

I am quite the theoretical computer science person, which is why I was tasked with this feature. More pragmatic programmers might regard these additional operators as equally "basic":

Optional #"a?" or [:? :int]
Kleene Plus #"a+" or [:+ :int]
Repeat #"a{1, 5}" or [:repeat {:min 1, :max 5} :int]

even though they can often be implemented as syntactic sugar for the five fundamental constructs. There are many more regex operators in the wild but like Seqexp and Spec, Malli has a more minimal set of seqex operators. Many regex operators are text-specific while others are considered harmful. Seqexp does support lookaheads and non-greedy quantifiers, but even lookaheads may have been a mistake.

ExpressivityLink to Expressivity

Seqexes are useful for function schemas and they can handle quite complicated DSL:s too:

(m/parse
  [:schema {:registry {"hiccup" [:or
                                 [:catn
                                  [:name keyword?]
                                  [:props [:? [:map-of keyword? any?]]]
                                  [:children [:* [:schema [:ref "hiccup"]]]]]
                                 [:or nil? boolean? number? string?]]}}
   "hiccup"]
  [:div {:class [:foo :bar]}
   "Hello, world of data"])
;=> {:name :div, :props {:class [:foo :bar]}, :children ["Hello, world of data"]}

Macro arguments can of course be targeted as well as data DSL:s. Go for it!

Although Malli seqex schemas are actually limited to regular languages by design, so you can't do this:

(require '[clojure.spec.alpha :as s])

(s/def ::ws (s/* #(Character/isWhitespace %)))

(s/def ::sexpr
  (s/cat
    :lws ::ws
    :sexpr (s/alt
             :list (s/cat :lparen #(= \( %) :sexprs (s/* ::sexpr) :rparen #(= \) %))
             :symbol (s/cat :c #(Character/isJavaIdentifierStart %)
                            :cs (s/* #(Character/isJavaIdentifierPart %)))
             :int (s/+ #(Character/isDigit %)))
    :rws ::ws))

(s/valid? ::sexpr (seq "(foo (bar 1 2))"))
; => true

(s/valid? ::sexpr (seq "(foo (bar 1 2)"))
; => false

(Regular expressions are insufficient to handle recursively nested constructs like our lovely Lispy pairs of parentheses or HTML. That is also why we have Instaparse. And please don't do it with Spec either even if it happens to work as relying on emergent behaviour can be risky!)

RequirementsLink to Requirements

We have to extend the Malli schema functionalities to seqexes, which means:

Validation with validate; this is just language recognition. Language recognition is rarely sufficient but here we have it.
Getting validation errors with explain; similar to validate, but have to collect all errors from invalid seq elements as well as a seq that is too short or long.
Transformers (encode and decode); these are essentially parsing where the result just happens to be a seq of the same length with transformed elements (higher level encoder or decoder functions could of course slice and dice that seq later).
Destructuring and constructing sequences with parse and unparse; parse is well, parsing with seqexes. Full parsing with regular expressions is also a bit unusual; usually we just check whether the input matches and possibly extract some substrings of it. unparse is a kind of pretty printing but producing sequences instead of just strings and quite simple compared to the other functionalities. parse and unparse were actually added after the seqex schemas and are no-ops unless you use seqexes or :orn (which was only added with parse) or :multi (starting with Malli 0.4.0).

Possible Implementation StrategiesLink to Possible Implementation Strategies

There are three well-known approaches to implement regular expression matching:

Convert the regular expression to a deterministic finite automaton (DFA) (usually via a nondeterministic finite automaton (NFA) and the powerset construction) and run the DFA on the input. This is what foundational tools like grep and lex do.
Convert the regular expression to a NFA and run the NFA on the input. The Seqexp library uses the "PikeVM" variant of this, where the NFA is stored as a bytecode program for a highly multithreaded virtual machine (the threads are green and even more lightweight than general purpose green threads like running go blocks).
- One can also add an incrementally constructed DFA as a caching layer, sort of like a "DFA JIT". ANTLR 4's ALL(*) lookahead also works like this, except it has to simulate the augmented transition network (ATN) for a context-free or even context-sensitive grammar instead of just a NFA for a regular expression.
Use a backtracking parser similar to the typical implementation of parser combinators. This is how Perl, PCRE, Java etc. regular expressions work and must work because they have gone beyond regular languages with crazy stuff like backreferences. Caching can be added similarly to the memoizing Packrat parsing algorithms.

And then there is the forgotten jewel of Brzozowski's regular expression derivatives:

Derivatives can be used instead of NFA-conversion and the powerset construction to turn the regular expression into a DFA. This is used in the SML ml-ulex and Racket parser-tools/lex lexer generators.
It is also possible to do parsing by computing regex derivatives on the fly, which is how Clojure Spec regex operators are implemented.

Now suppose the length of the input is n and the length of the regex is r:

DFA matching takes $O(n)$ time and $O(1)$ space (just the index of the current state). However constructing the DFA takes $O(2^r)$ time and the size of the DFA is also $O(2^n)$ (although the usual case is much more reasonable of course).
NFA matching takes $O(rn)$ time and $O(r)$ space (e.g. the PikeVM needs $O(r)$ threads at most). Constructing the NFA only takes $O(r)$ time and so does storing it. The DFA cache gives speedup especially on longer inputs but does not change the bounds.
Backtracking takes $O(2^n)$ time and $O(n)$ space. Memoization can reduce the time to $O(n)$ while only increasing the space usage to $O(rn)$.
On the fly derivative matching takes $O(rn)$ time and $O(n^r)$ space. The optimizations from parsing with derivatives are surely applicable and might be able to reduce the space requirement to $O(r)$ but AFAIK Spec does not try to do that.

Obviously the backtracking strategy is a depth-first search. The other techniques are essentially breadth-first search. Typically depth-first search uses exponential time but linear space while breadth-first search uses linear time but exponential space. However because a string or even a general seq is quite a simple input there are tricks to make breadth-first search take very little space (converting to a DFA, deduplicating threads in the PikeVM) and the usual depth-first perf lifeline of memoization only requires linear space.

Overall the NFA approach seems like the winner; nice linear bounds without even the ahead of time DFA construction exponentiality issues. Actually since we are matching general seqs instead of strings the DFA approach isn't really applicable as it requires simulating running the NFA on every possible input string in advance. That is just fine on ASCII, tricky on UTF (but there are solutions) and a non-starter on every possible ~~Clojure datum~~ host object.

PikeVM Thread Deduplication Did not LastLink to PikeVM Thread Deduplication Did not Last

So I started implementing the regex operators based on the Seqexp code. We noticed that Seqexp is actually quite slow because its PikeVM engine is implemented in idiomatic Clojure with lots of intermediate sequences when ideally it would allocate just the thread table and no intermediate sequences at all. By writing Java in Clojure (or is it Rust in Clojure because I haven't written much Java at all...) I was able to make it 20x faster.

But then I realized that in Malli we need more than regular language recognition and subsequence extraction; at the time we already had transformers (encode/decode) and recently we added parse and unparse as well. Those produce sequences where potentially every element has been transformed somehow. The prefix of that sequence (or something from which it can be reconstructed) needs to be a part of the thread state, which makes the thread deduplication ineffective and we are back to $O(n^r)$ space usage. I saw no way around that, so had to pivot!

My perf Seqexp branch is still out there if anyone is interested.

Implementing Seqex ParsingLink to Implementing Seqex Parsing

So I went for (memoized) backtracking instead. Since the implementation is quite interesting both as a final achievement and as a journey I thought I'd explain it here. (Also, I do worry about the bus factor with this!)

Warning: Here Be Code (but, no Dragon Books!)

Introductions to parsing like to emphasize that the parser input is an abstract sequence of any "symbols of an alphabet" while in practise it is almost always a string or stream of characters or tokens. But our input really is any seq(uence) of any values, and it really does not affect this implementation much (but, cannot use DFA).

As usual with parsing combinators, our (sub)parsers are functions that take the input and return a tuple of the parsed prefix and the remaining input. Clojure vectors make bulky tuples (thus the attempts at clj-tuple and cambrian-collections) but we will later get rid of them rather incidentally. If the parser fails we just return :malli.core/invalid; explain has a separate implementation.

Parsing a single item consists of just using parse-item on the first element of the seq. parse-item should be a function that transforms the item or returns :malli.core/invalid:

(defn item-parser [parse-item]
  (fn [coll]
    (if (seq coll)
      (let [res (parse-item (first coll))]
        (if-not (miu/-invalid? res)
          [res (rest coll)]
          res))
      :malli.core/invalid)))

In practice the parse-item functions come from malli.core/-parser.

end-parser checks that the collection is empty. The old-school way to do this was to check for an 'EOF' sentinel or null byte, but with Clojure sequences we have empty?:

(defn end-parser [coll]
  (if (empty? coll)
    [nil coll]
    :malli.core/invalid))

As with many parsing libraries and parser generators, end-parser is needed to ensure that the entire input is consumed by the overall parser. Parsing just a prefix of the input makes the parsing functions composable, but the top level parser must not be satisfied with just a correct prefix -- that would lead to silent failures, the worst kind of bugs. Obviously all of the input must be validated when we have web-facing API:s; offering s/every in addition to s/coll-of is just another sign that Spec is not about web API:s.

The parser for :cat may look formidable, but really it is mostly just boring plumbing. The parsers are just run in sequence, the results collected into a vector with the remaining input threade through and any error returned immediately, also canceling the remaining parsers:

(defn cat-parser [& parsers]
  (reduce (fn [parser parser*]
            (fn [coll]
              (let [res (parser coll)]
                (if-not (miu/-invalid? res)
                  (let [[vs coll] res]
                    (let [res* (parser* coll)]
                      (if-not (miu/-invalid? res*)
                        (let [[v coll] res*]
                          [(conj vs v) coll])
                        res*)))
                  res))))
          (pure-parser []) parsers))

(And part of the bulk in cat-parser is just due to it being variadic (because we want :cat to be). In Haskell >>= and <*> are binary infix operators and the reduce (or, uh, foldr? foldl'?) part would not appear.)

cat-parser also uses a little utility pure-parser which just returns the fixed v and leaves the input alone:

(defn pure-parser [v] (fn [coll] [v coll]))

(In fancy terms, pure-parser 'embeds a pure value into the domain of parsing functions'; it is the 'monadic unit operation'.)

Complementarily, the parser for :alt tries the parsers for the alternatives in sequence and stops when it finds one that succeeds:

(defn alt-parser [& parsers]
  (reduce (fn [parser parser*]
            (fn [coll]
              (let [res (parser coll)]
                (if-not (miu/-invalid? res)
                  res
                  (parser* coll)))))
          parsers))

Note that the reduce in alt-parser and cat-parser is done at parser generation time. Combined with the JIT inlining the effect is similar to loop unrolling. This "compilation to closures" approach was pervasive in Malli already before the seqex schemas and is the reason to prefer e.g. m/validator over m/validate. The approach is familiar from Plumatic Schema, but is also described in SICP (section 4.1.7: Separating Syntactic Analysis from Execution), which mentions Jonathan Rees already using it in 1982! To get similarly specialized machine code for the PikeVM approach we would need an unlikely meta-tracing JIT.

For contrast, the parser for :* must loop at parsing time, because we do not know how many times its child schema will match:

(defn *-parser [parser]
  (letfn [(parser* [vs coll]
            (let [res (parser coll)]
              (if-not (miu/-invalid? res)
                (let [[v coll] res]
                  (recur (conj vs v) coll))
                [vs coll])))]
    (fn [coll] (parser* [] coll))))

Aside from the looping, it is similar to cat-parser in collecting the results to a vector and threading the remaining input through. But note how when the child parser fails, *-parser succeeds with the input and the previously collected vs; :* can just stop matching at any time or match nothing at all, so it never fails. (The overall parser can still fail thanks to the use of end-parser: consider e.g. (= false (m/validate [:* int?] ["foo"])).)

Since we have parser combinators, the parsers for :? and :+ can be derived in practice like in theory, by composing the combinators (Always Be Composing!):

(defn ?-parser [parser] (alt-parser parser (pure-parser nil)))

(defn +-parser [parser]
  (map-parser (fn [[v vs]] (cons v vs))
              (cat-parser parser (*-parser parser))))

+-parser also needs map-parser, which just parses with parser and, if successful, applies f to the parse result:

(defn map-parser [f parser]
  (fn [coll]
    (let [res (parser coll)]
      (if-not (miu/-invalid? res)
        (let [[v coll] res]
          [(f v) coll])
        res))))

Having map-parser technically makes our collection of parsing functions a functor (even if we are not extending any fancy Functor ~~typeclass~~ protocol). But don't worry, we won't make it a monad; that is only required for context-dependent parsing and we only want to handle regular grammars.

(Not that making it a monad would be any scarier, just change [] to ():

(defn flat-map-parser [f parser]
  (fn [coll]
    (let [res (parser coll)]
      (if-not (miu/-invalid? res)
        (let [[v coll] res]
          ((f v) coll))
        res))))

And a monad would also require pure-parser, but we had that already.)

In theory the :repeat parser could also be implemented with composition, but that would require duplicating the child parser max times and building some other closures besides. We prefer to implement the :repeat parser directly:

(defn repeat-parser [min max parser]
  (letfn [(compulsories [n vs coll]
            (if (< n min)
              (let [res (parser coll)]
                (if-not (miu/-invalid? res)
                  (let [[v coll] res]
                    (recur (inc n) (conj vs v) coll))
                  res))
              (optionals n vs coll)))

          (optionals [n vs coll]
            (if (< n max)
              (let [res (parser coll)]
                (if-not (miu/-invalid? res)
                  (let [[v coll] res]
                    (recur (inc n) (conj vs v) coll))
                  [vs coll]))
              [vs coll])]
    (fn [coll] (compulsories 0 [] coll))))

Quite similar to *-parser, but we also have a counter to deal with. Brings me back to Programming 101... after switching to FP I rarely get such juicy chances for off-by-one errors.

An NFA implementation like Seqexp has not choice but to add at least max nodes to the NFA. Well, the PikeVM used by Seqexp could be extended with a stack of registers for loops, but that would marr its elegance...

A GotchaLink to A Gotcha

Consider

(m/parse [:cat [:* pos?] [:= 4]] [2 4])

which boils down to

(let [parse (cat-parser (*-parser (item-parser (m/parser pos?)))
                        (item-parser (m/parser [:= 4])))]
  (parse [2 4]))

According to the semantics we have for seqex parsing, it should return [[2] 4]. But the result is :malli.core/invalid.

The problem is that not only is :* greedy (as it should be); it also commits to every consumption of a sequence element. Since (pos? 4) is true, [:* pos?] parses out [2 4], leaving [:= 4] with an empty seq when it wants [4]. For the overall parse to succeed, it needs to backtrack to when [:* pos?] only had [2] and exit the loop at that point, leaving [:= 4] with the input it needs. I think I have run into this already on my previous parser combinator adventures (of which I have had more than I care to admit).

CPS-Conversion and TrampoliningLink to CPS-Conversion and Trampolining

But how can the backtracking into :* be implemented? The (implicit) loop in *-parser cannot actually be restarted after it has been exited. We could give up on the 'compilation to closures' approach and make seqex parsing 'interpret' the schema tree instead. Then backtracking could be handled by resetting the current position in the schema, similar to Rob Pike's elegant C regex demo implementation which just rewinds a char *. But that would lose all the performance gained by precomputation when building the closures.

The solution is simple but potentially mindblowing: capture the continuation inside the loop. Since Clojure does not have any sort of call-with-current-continuation we have to convert the parsers to continuation passing style (CPS) ourselves to get the continuation. And because Clojure does not do general tail-call optimization either we also have to implement that by trampolining the CPS-converted code to replace the recur in *-parser. Buckle up, it is going to be a wild ride!

As in standard CPS, the converted parsers take a continuation parameter k that they can call to return or pass to other functions in tail calls. Usually continuations would have just one parameter, the return value. But we pass the parsed value and the remainder of the input as separate arguments, conveniently avoiding the vector allocations as promised. This threading through of the input coll resembles the handling of the state value in State monads, if you are familiar with those (and their implementation).

More unusually even for CPS the converted parsers also get a stack of fallback continuations. Calls to parsers may be pushed to that stack by 'parking' them. This is similar to how Prolog implementations mark backtracking points on their stacks. The Prolog connection is no accident; parsing can be conceptualized as a search on the grammar that is also guided by the input string and Prolog also uses a depth-first search.

*-parser parks epsilon on each iteration to make those available as restart locations (the line marked "remember fallback"):

(defn *-parser [parser]
  (letfn [(epsilon (fn [_ vs coll k] (k vs coll)))

          (parser* [stack vs coll k]
            (park*! stack epsilon vs coll k) ; remember fallback
            (parser stack coll
                    (fn [v coll] (park*! stack parser* (conj vs v) coll k))))] ; TCO
    (fn [stack coll k] (parser* stack [] coll k))))

This arrangement requires an external control loop to restart the call on top of the stack when parsing fails. We can also use that loop as a trampoline for TCO, as happens on the line marked "TCO".

The converted repeat does the same tricks, it just has more bookkeeping (as before):

(defn repeat-parser [min max parser]
  (letfn [(epsilon (fn [_ vs coll k] (k vs coll)))

          (compulsories [stack n vs coll k]
            (if (< n min)
              (parser stack coll
                      (fn [v coll] (park**! stack compulsories (inc n) (conj vs v) coll k))) ; TCO
              (optionals stack n vs coll k)))

          (optionals [stack n vs coll k]
            (if (< n max)
              (do
                (park*! stack epsilon vs coll k) ; remember fallback
                (parser stack coll
                        (fn [v coll] (park**! stack optionals (inc n) (conj vs v) coll k)))) ; TCO
              (k vs coll)))]
    (fn [stack coll k] (compulsories stack 0 [] coll k))))

We use parking for all backtracking, including alt-parser:

(defn alt-parser [& parsers]
  (reduce (fn [parser parser*]
            (fn [stack coll k]
              (park! stack parser* coll k) ; remember fallback
              (parser stack coll k)))
          parsers))

That is not strictly necessary at this point, but will be useful when we add caching later on.

The other basic parsers are just converted to our flavor of CPS, taking k and stack and pushing coll around:

(defn item-parser [parse-item]
  (fn [_ coll k]
    (when (seq coll)
      (let [res (parse-item (first coll))]
        (when-not (miu/-invalid? res)
          (k res (rest coll)))))))

(defn end-parser [_ coll k]
  (when (empty? coll)
    (k nil coll)))

(defn cat-parser [& parsers]
  (let [p (reduce (fn [rest-parser parser]
                    (fn [stack vs coll k]
                      (parser stack coll
                              (fn [v coll] (rest-parser stack (conj vs v) coll k)))))
                  (reverse parsers))]
    (fn [stack coll k] (p stack [] coll k))))

Since ?-parser and +-parser are just defined in terms of the other combinators, they don't need to change. But the pure-parser and map-parser that they use do need to be CPS-converted:

(defn pure-parser [v] (fn [_ coll k] (k v coll)))

(defn map-parser [f parser]
  (fn [stack coll k]
    (parser stack coll
            (fn [v coll] (k (f v) coll)))))

parser makes sure that the entire input is consumed by appending end-parser. It also contains the backtracking and trampolining loop, which just performs calls from the stack until one of them finishes successfully by calling the final success continuation which sets success and result or until the backtracking stack is empty:

(defn parser [p]
  (let [p (cat-parser p (end-parser))]
    (fn [coll]
      (if (sequential? coll)
        (let [stack (make-stack)
              success (volatile! false)
              result (volatile! nil)]
          (p stack coll (fn [v _ _] (vreset! success true) (vreset! result v)))
          (if @success
            (first @result)
            (loop []
              (if-some [thunk (pop-thunk! stack)]
                (do
                  (thunk)
                  (if @success
                    (first @result)
                    (recur)))
                :malli.core/invalid))))
        :malli.core/invalid))))

Before adding caching, the stack is just a java.util.ArrayDeque or a JS array:

(defn make-stack [] #?(:clj (ArrayDeque.), :cljs #js []))

(defn- empty-stack? [^ArrayDeque stack]
  #?(:clj (.isEmpty stack), :cljs (zero? (alength stack))))

(defn park! [^ArrayDeque stack parser coll k] (.push stack #(parser stack coll k)))

(defn pop-thunk! [^ArrayDeque stack] (when-not (empty-stack? stack) (.pop stack)))

We can use volatile! and mutable stacks freely because the parsers we create are single threaded. parse-item can (at least theoretically) be multi-threaded, but that cannot affect the parsing thread since we do not pass continuations into parse-item or anything like that. Besides, the parsers returned by parser are externally pure, just like into uses transients internally whenever possible.

MemoizationLink to Memoization

Opposite to the NFA, backtracking is very space-efficient but has exponential worst-case time behaviour. As is often the case, we can use caching to gain linear time behaviour at the expense of more space usage. Happily, our cache will only use linear space also, as previously promised.

We can avoid visiting backtracking points more than once by simply discarding the parsing function and its arguments instead of pushing them to the stack when we have already seen the particular combination of input and seqex positions. The position in the input is already captured by the remainder of the input coll but I added the input index pos everywhere because integers make much better hash map keys than arbitrary sequences. The position in the seqex could be implemented as a reference to the schema subtree but instead I found it convenient to just guarantee that every schema subtree gets its own parsing function which can be used as the key (by pointer equality and identity hash).

Perhaps surprisingly the counter in :repeat adds quite a bit of complication to this memoization. Separate iterations of :repeat with a different counter should be recognized as separate positions in the seqex, so the counter needs to be added to the hash key along with the seq index and the parsing function. Furthermore there can be nested :repeats so a stack of all in-progress :repeat counters is actually required. Oh well, at least the counters themselves are just integers.

Having dealt with all these algorithmic complications I was faced with a more industrial one; I wanted a mutable hash set for the memoization cache (the parsers and arguments going to the separate stack), but Javascript only got one in ES6. I did try to use transient persistent sets in volatile!s for #?(:cljs ... but the performance was completely unacceptable. Google Closure can polyfill ES6 sets with the right options, but those options can only be set in the project that actually does the compilation to JS, e.g. your frontend app. And Malli is so general (and divisive) that it is likely to end up in some applications as a transitive dependency only.

So I made my own mutable hash set in some 40 lines of Clojure. It uses quadratic probing with power of two sizes and triangular numbers, another neat trick that could fill another blog post. Amusingly it was actually slightly faster than java.util.HashSet on Java 8 and roughly equivalent with it on newer JDK:s so I switched to it on JVM as well. But more importantly Malli seqexes now had good performance on the ClojureScript side too.

You can see the actual Malli code for the final algorithm with all the complications from memoization and associated issues tackled and also all the variations required to implement validator, explainer, decoder and encoder in addition to parse. The unparse implementation is as straightforward for seqexes as for everything else and has nothing to do with all these parsing techniques.

PerformanceLink to Performance

As the README says and as you might have gathered by now:

Although a lot of effort has gone into making the seqex implementation fast

(require '[clojure.spec.alpha :as s])
(require '[criterium.core :as cc])

(let [valid? (partial s/valid? (s/* int?))]
  (cc/quick-bench (valid? (range 10)))) ; Execution time mean : 27µs

(let [valid? (m/validator [:* int?])]
  (cc/quick-bench (valid? (range 10)))) ; Execution time mean : 2.7µs

it is always better to use less general tools whenever possible:

(let [valid? (partial s/valid? (s/coll-of int?))]
  (cc/quick-bench (valid? (range 10)))) ; Execution time mean : 1.8µs

(let [valid? (m/validator [:sequential int?])]
  (cc/quick-bench (valid? (range 10)))) ; Execution time mean : 0.12µs

Going BeyondLink to Going Beyond

The implementation here is quite similar to GLL parsing except we don't have to care about continuations in the caching since regular languages don't include (nontail) recursion. https://epsil.github.io/gll/ is a nice explanation of GLL parser combinators and has links to papers etc. It also inspired Instaparse, which Engelberg had a presentation about at Clojure/West 2014.
If you caught the parsing bug Parsing Techniques is the hefty tome to scratch that itch / scare you away.

AssessmentLink to Assessment

The Malli seqex implementation was an exceptional project. It is not often that parsing theory is so directly applicable to a practical problem and there is no pre-existing parsing tool that can just be used (although Seqexp did come close!). In retrospect all the issues from pivot-level to merely annoying also made the experience greater; it is exhilarating when you are able to pull aces like CPS and custom hash tables (with bit-twiddling and all) from your sleeve instead of having to admit defeat.

I feel that the most masterful code comes from "hunting high and low", reaching for mindbending theory as well as the bits and JITs at hand. Of course that is also the recipe to find amazing ways to shoot yourself in the foot with C++. As with many other things, the hard-earned skill is to know the difference (and no, it is not just C++!).

Pauli Jaakkola

Wednesday, 31 March, 2021

Pauli Jaakkola

Technology

Structure and Interpretation of Malli Regex Schemas

Seqex schemas shipped with Malli 0.3.0. This is their story. A story with many twists and turns and technical details Malli users don't need to know.

IntroductionLink to Introduction

In December 2020 Tommi called on me to implement "regex schemas" for Malli. Now what do we even mean by "regex schemas"? Hasn't Malli had support for

(m/validate #"foo|bar" "foo") ;=> true

for quite a while now?

Most developers are probably familiar with regular expressions of strings and their basic forms. Here they are in (JVM) Clojure and Malli forms:

The empty string #"" or sequence [:cat]
Literal character #"a" or element schema :int
Concatenation #"ab" or [:cat :string :int]
Alternation #"a|b" or [:alt :string :int]
Kleene Star #"a*" or [:* :int]

I am quite the theoretical computer science person, which is why I was tasked with this feature. More pragmatic programmers might regard these additional operators as equally "basic":

Optional #"a?" or [:? :int]
Kleene Plus #"a+" or [:+ :int]
Repeat #"a{1, 5}" or [:repeat {:min 1, :max 5} :int]

ExpressivityLink to Expressivity

Seqexes are useful for function schemas and they can handle quite complicated DSL:s too:

(m/parse
  [:schema {:registry {"hiccup" [:or
                                 [:catn
                                  [:name keyword?]
                                  [:props [:? [:map-of keyword? any?]]]
                                  [:children [:* [:schema [:ref "hiccup"]]]]]
                                 [:or nil? boolean? number? string?]]}}
   "hiccup"]
  [:div {:class [:foo :bar]}
   "Hello, world of data"])
;=> {:name :div, :props {:class [:foo :bar]}, :children ["Hello, world of data"]}

Macro arguments can of course be targeted as well as data DSL:s. Go for it!

Although Malli seqex schemas are actually limited to regular languages by design, so you can't do this:

(require '[clojure.spec.alpha :as s])

(s/def ::ws (s/* #(Character/isWhitespace %)))

(s/def ::sexpr
  (s/cat
    :lws ::ws
    :sexpr (s/alt
             :list (s/cat :lparen #(= \( %) :sexprs (s/* ::sexpr) :rparen #(= \) %))
             :symbol (s/cat :c #(Character/isJavaIdentifierStart %)
                            :cs (s/* #(Character/isJavaIdentifierPart %)))
             :int (s/+ #(Character/isDigit %)))
    :rws ::ws))

(s/valid? ::sexpr (seq "(foo (bar 1 2))"))
; => true

(s/valid? ::sexpr (seq "(foo (bar 1 2)"))
; => false

RequirementsLink to Requirements

We have to extend the Malli schema functionalities to seqexes, which means:

Validation with validate; this is just language recognition. Language recognition is rarely sufficient but here we have it.
Getting validation errors with explain; similar to validate, but have to collect all errors from invalid seq elements as well as a seq that is too short or long.
Transformers (encode and decode); these are essentially parsing where the result just happens to be a seq of the same length with transformed elements (higher level encoder or decoder functions could of course slice and dice that seq later).
Destructuring and constructing sequences with parse and unparse; parse is well, parsing with seqexes. Full parsing with regular expressions is also a bit unusual; usually we just check whether the input matches and possibly extract some substrings of it. unparse is a kind of pretty printing but producing sequences instead of just strings and quite simple compared to the other functionalities. parse and unparse were actually added after the seqex schemas and are no-ops unless you use seqexes or :orn (which was only added with parse) or :multi (starting with Malli 0.4.0).

Possible Implementation StrategiesLink to Possible Implementation Strategies

There are three well-known approaches to implement regular expression matching:

Convert the regular expression to a deterministic finite automaton (DFA) (usually via a nondeterministic finite automaton (NFA) and the powerset construction) and run the DFA on the input. This is what foundational tools like grep and lex do.
Convert the regular expression to a NFA and run the NFA on the input. The Seqexp library uses the "PikeVM" variant of this, where the NFA is stored as a bytecode program for a highly multithreaded virtual machine (the threads are green and even more lightweight than general purpose green threads like running go blocks).
- One can also add an incrementally constructed DFA as a caching layer, sort of like a "DFA JIT". ANTLR 4's ALL(*) lookahead also works like this, except it has to simulate the augmented transition network (ATN) for a context-free or even context-sensitive grammar instead of just a NFA for a regular expression.
Use a backtracking parser similar to the typical implementation of parser combinators. This is how Perl, PCRE, Java etc. regular expressions work and must work because they have gone beyond regular languages with crazy stuff like backreferences. Caching can be added similarly to the memoizing Packrat parsing algorithms.

And then there is the forgotten jewel of Brzozowski's regular expression derivatives:

Derivatives can be used instead of NFA-conversion and the powerset construction to turn the regular expression into a DFA. This is used in the SML ml-ulex and Racket parser-tools/lex lexer generators.
It is also possible to do parsing by computing regex derivatives on the fly, which is how Clojure Spec regex operators are implemented.

Now suppose the length of the input is n and the length of the regex is r:

DFA matching takes $O(n)$ time and $O(1)$ space (just the index of the current state). However constructing the DFA takes $O(2^r)$ time and the size of the DFA is also $O(2^n)$ (although the usual case is much more reasonable of course).
NFA matching takes $O(rn)$ time and $O(r)$ space (e.g. the PikeVM needs $O(r)$ threads at most). Constructing the NFA only takes $O(r)$ time and so does storing it. The DFA cache gives speedup especially on longer inputs but does not change the bounds.
Backtracking takes $O(2^n)$ time and $O(n)$ space. Memoization can reduce the time to $O(n)$ while only increasing the space usage to $O(rn)$.
On the fly derivative matching takes $O(rn)$ time and $O(n^r)$ space. The optimizations from parsing with derivatives are surely applicable and might be able to reduce the space requirement to $O(r)$ but AFAIK Spec does not try to do that.

PikeVM Thread Deduplication Did not LastLink to PikeVM Thread Deduplication Did not Last

My perf Seqexp branch is still out there if anyone is interested.

Implementing Seqex ParsingLink to Implementing Seqex Parsing

Warning: Here Be Code (but, no Dragon Books!)

Parsing a single item consists of just using parse-item on the first element of the seq. parse-item should be a function that transforms the item or returns :malli.core/invalid:

(defn item-parser [parse-item]
  (fn [coll]
    (if (seq coll)
      (let [res (parse-item (first coll))]
        (if-not (miu/-invalid? res)
          [res (rest coll)]
          res))
      :malli.core/invalid)))

In practice the parse-item functions come from malli.core/-parser.

end-parser checks that the collection is empty. The old-school way to do this was to check for an 'EOF' sentinel or null byte, but with Clojure sequences we have empty?:

(defn end-parser [coll]
  (if (empty? coll)
    [nil coll]
    :malli.core/invalid))

(defn cat-parser [& parsers]
  (reduce (fn [parser parser*]
            (fn [coll]
              (let [res (parser coll)]
                (if-not (miu/-invalid? res)
                  (let [[vs coll] res]
                    (let [res* (parser* coll)]
                      (if-not (miu/-invalid? res*)
                        (let [[v coll] res*]
                          [(conj vs v) coll])
                        res*)))
                  res))))
          (pure-parser []) parsers))

cat-parser also uses a little utility pure-parser which just returns the fixed v and leaves the input alone:

(defn pure-parser [v] (fn [coll] [v coll]))

(In fancy terms, pure-parser 'embeds a pure value into the domain of parsing functions'; it is the 'monadic unit operation'.)

Complementarily, the parser for :alt tries the parsers for the alternatives in sequence and stops when it finds one that succeeds:

(defn alt-parser [& parsers]
  (reduce (fn [parser parser*]
            (fn [coll]
              (let [res (parser coll)]
                (if-not (miu/-invalid? res)
                  res
                  (parser* coll)))))
          parsers))

For contrast, the parser for :* must loop at parsing time, because we do not know how many times its child schema will match:

(defn *-parser [parser]
  (letfn [(parser* [vs coll]
            (let [res (parser coll)]
              (if-not (miu/-invalid? res)
                (let [[v coll] res]
                  (recur (conj vs v) coll))
                [vs coll])))]
    (fn [coll] (parser* [] coll))))

Since we have parser combinators, the parsers for :? and :+ can be derived in practice like in theory, by composing the combinators (Always Be Composing!):

(defn ?-parser [parser] (alt-parser parser (pure-parser nil)))

(defn +-parser [parser]
  (map-parser (fn [[v vs]] (cons v vs))
              (cat-parser parser (*-parser parser))))

+-parser also needs map-parser, which just parses with parser and, if successful, applies f to the parse result:

(defn map-parser [f parser]
  (fn [coll]
    (let [res (parser coll)]
      (if-not (miu/-invalid? res)
        (let [[v coll] res]
          [(f v) coll])
        res))))

(Not that making it a monad would be any scarier, just change [] to ():

(defn flat-map-parser [f parser]
  (fn [coll]
    (let [res (parser coll)]
      (if-not (miu/-invalid? res)
        (let [[v coll] res]
          ((f v) coll))
        res))))

And a monad would also require pure-parser, but we had that already.)

(defn repeat-parser [min max parser]
  (letfn [(compulsories [n vs coll]
            (if (< n min)
              (let [res (parser coll)]
                (if-not (miu/-invalid? res)
                  (let [[v coll] res]
                    (recur (inc n) (conj vs v) coll))
                  res))
              (optionals n vs coll)))

          (optionals [n vs coll]
            (if (< n max)
              (let [res (parser coll)]
                (if-not (miu/-invalid? res)
                  (let [[v coll] res]
                    (recur (inc n) (conj vs v) coll))
                  [vs coll]))
              [vs coll])]
    (fn [coll] (compulsories 0 [] coll))))

Quite similar to *-parser, but we also have a counter to deal with. Brings me back to Programming 101... after switching to FP I rarely get such juicy chances for off-by-one errors.

A GotchaLink to A Gotcha

Consider

(m/parse [:cat [:* pos?] [:= 4]] [2 4])

which boils down to

(let [parse (cat-parser (*-parser (item-parser (m/parser pos?)))
                        (item-parser (m/parser [:= 4])))]
  (parse [2 4]))

According to the semantics we have for seqex parsing, it should return [[2] 4]. But the result is :malli.core/invalid.

CPS-Conversion and TrampoliningLink to CPS-Conversion and Trampolining

*-parser parks epsilon on each iteration to make those available as restart locations (the line marked "remember fallback"):

(defn *-parser [parser]
  (letfn [(epsilon (fn [_ vs coll k] (k vs coll)))

          (parser* [stack vs coll k]
            (park*! stack epsilon vs coll k) ; remember fallback
            (parser stack coll
                    (fn [v coll] (park*! stack parser* (conj vs v) coll k))))] ; TCO
    (fn [stack coll k] (parser* stack [] coll k))))

The converted repeat does the same tricks, it just has more bookkeeping (as before):

(defn repeat-parser [min max parser]
  (letfn [(epsilon (fn [_ vs coll k] (k vs coll)))

          (compulsories [stack n vs coll k]
            (if (< n min)
              (parser stack coll
                      (fn [v coll] (park**! stack compulsories (inc n) (conj vs v) coll k))) ; TCO
              (optionals stack n vs coll k)))

          (optionals [stack n vs coll k]
            (if (< n max)
              (do
                (park*! stack epsilon vs coll k) ; remember fallback
                (parser stack coll
                        (fn [v coll] (park**! stack optionals (inc n) (conj vs v) coll k)))) ; TCO
              (k vs coll)))]
    (fn [stack coll k] (compulsories stack 0 [] coll k))))

We use parking for all backtracking, including alt-parser:

(defn alt-parser [& parsers]
  (reduce (fn [parser parser*]
            (fn [stack coll k]
              (park! stack parser* coll k) ; remember fallback
              (parser stack coll k)))
          parsers))

That is not strictly necessary at this point, but will be useful when we add caching later on.

The other basic parsers are just converted to our flavor of CPS, taking k and stack and pushing coll around:

(defn item-parser [parse-item]
  (fn [_ coll k]
    (when (seq coll)
      (let [res (parse-item (first coll))]
        (when-not (miu/-invalid? res)
          (k res (rest coll)))))))

(defn end-parser [_ coll k]
  (when (empty? coll)
    (k nil coll)))

(defn cat-parser [& parsers]
  (let [p (reduce (fn [rest-parser parser]
                    (fn [stack vs coll k]
                      (parser stack coll
                              (fn [v coll] (rest-parser stack (conj vs v) coll k)))))
                  (reverse parsers))]
    (fn [stack coll k] (p stack [] coll k))))

Since ?-parser and +-parser are just defined in terms of the other combinators, they don't need to change. But the pure-parser and map-parser that they use do need to be CPS-converted:

(defn pure-parser [v] (fn [_ coll k] (k v coll)))

(defn map-parser [f parser]
  (fn [stack coll k]
    (parser stack coll
            (fn [v coll] (k (f v) coll)))))

(defn parser [p]
  (let [p (cat-parser p (end-parser))]
    (fn [coll]
      (if (sequential? coll)
        (let [stack (make-stack)
              success (volatile! false)
              result (volatile! nil)]
          (p stack coll (fn [v _ _] (vreset! success true) (vreset! result v)))
          (if @success
            (first @result)
            (loop []
              (if-some [thunk (pop-thunk! stack)]
                (do
                  (thunk)
                  (if @success
                    (first @result)
                    (recur)))
                :malli.core/invalid))))
        :malli.core/invalid))))

Before adding caching, the stack is just a java.util.ArrayDeque or a JS array:

(defn make-stack [] #?(:clj (ArrayDeque.), :cljs #js []))

(defn- empty-stack? [^ArrayDeque stack]
  #?(:clj (.isEmpty stack), :cljs (zero? (alength stack))))

(defn park! [^ArrayDeque stack parser coll k] (.push stack #(parser stack coll k)))

(defn pop-thunk! [^ArrayDeque stack] (when-not (empty-stack? stack) (.pop stack)))

MemoizationLink to Memoization

PerformanceLink to Performance

As the README says and as you might have gathered by now:

Although a lot of effort has gone into making the seqex implementation fast

(require '[clojure.spec.alpha :as s])
(require '[criterium.core :as cc])

(let [valid? (partial s/valid? (s/* int?))]
  (cc/quick-bench (valid? (range 10)))) ; Execution time mean : 27µs

(let [valid? (m/validator [:* int?])]
  (cc/quick-bench (valid? (range 10)))) ; Execution time mean : 2.7µs

it is always better to use less general tools whenever possible:

(let [valid? (partial s/valid? (s/coll-of int?))]
  (cc/quick-bench (valid? (range 10)))) ; Execution time mean : 1.8µs

(let [valid? (m/validator [:sequential int?])]
  (cc/quick-bench (valid? (range 10)))) ; Execution time mean : 0.12µs

Going BeyondLink to Going Beyond

The implementation here is quite similar to GLL parsing except we don't have to care about continuations in the caching since regular languages don't include (nontail) recursion. https://epsil.github.io/gll/ is a nice explanation of GLL parser combinators and has links to papers etc. It also inspired Instaparse, which Engelberg had a presentation about at Clojure/West 2014.
If you caught the parsing bug Parsing Techniques is the hefty tome to scratch that itch / scare you away.

AssessmentLink to Assessment

Pauli Jaakkola

Wednesday, 31 March, 2021

Structure and Interpretation of Malli Regex Schemas

IntroductionLink to Introduction

ExpressivityLink to Expressivity

RequirementsLink to Requirements

Possible Implementation StrategiesLink to Possible Implementation Strategies

PikeVM Thread Deduplication Did not LastLink to PikeVM Thread Deduplication Did not Last

Implementing Seqex ParsingLink to Implementing Seqex Parsing

A GotchaLink to A Gotcha

CPS-Conversion and TrampoliningLink to CPS-Conversion and Trampolining

MemoizationLink to Memoization

PerformanceLink to Performance

Going BeyondLink to Going Beyond

AssessmentLink to Assessment

Related posts

From Agentic AI to Operational Intelligence

What SaaStr's AI Agent Deployment Actually Teaches Us

The AI Tools Work. The Hard Part Is Everything Around Them.

Structure and Interpretation of Malli Regex Schemas

IntroductionLink to Introduction

ExpressivityLink to Expressivity

RequirementsLink to Requirements

Possible Implementation StrategiesLink to Possible Implementation Strategies

PikeVM Thread Deduplication Did not LastLink to PikeVM Thread Deduplication Did not Last

Implementing Seqex ParsingLink to Implementing Seqex Parsing

A GotchaLink to A Gotcha

CPS-Conversion and TrampoliningLink to CPS-Conversion and Trampolining

MemoizationLink to Memoization

PerformanceLink to Performance

Going BeyondLink to Going Beyond

AssessmentLink to Assessment

Related posts

From Agentic AI to Operational Intelligence

What SaaStr's AI Agent Deployment Actually Teaches Us

The AI Tools Work. The Hard Part Is Everything Around Them.