Hand-written markup language parser in ClojureScript

Broadly speaking, there are two ways to write a parser:

Obviously, using parser generators usually allows for a faster time-to-market. I wanted, however, to write a parser by hand for these reasons:

  • To see how hard it is in ClojureScript
  • No external dependencies
  • No need to have a separate compilation step
  • Informal, i.e. no grammars involved

Markup language

I’m going to develop a parser for a markup language with the following rules:

  • Header
= Header 1
== Header 2
=== Header 3
  • List
- Unordered 11
  # Ordered 21
  # Ordered 22
- Unordered 12
  • Table
|>header1|header2|
|cell11|cell12|
|cell21|cell22|
  • Code with language specification
!clojure
(def a 1)
!
  • Inline code
Some `inline code` here
  • Bold, italics
*bold*
_italics_
_italics and *bold italics* words_
  • Horizontal line
---
  • Manual and automatic links
[http://example.com/|my http site]
http://example.com/

[ftp://example.com/|my ftp site]
ftp://example.com/

[ssh://example.com/|ssh link]
ssh://example.com/

[mailto:me@example.com|my mail]
mailto:me@example.com
  • Image
@/favicon.png|My favorite icon

Parser

The goals are:

  • Make sure all cases are covered by tests
  • Make it work for usual cases
  • Error handling is not the focus
  • Make a parser that produces an AST
  • Have an AST->HTML output producer

I’ll do it in several steps:

  • Make the line-based parser
  • Enhance it to combine blocks
  • Make the HTML output functionality

I did some research on this and I think the most similar methodology is one described in this paper: A Nanopass Framework for Compiler Education.

In this post, I’ll focus on the first part – parsing all of the above functionality on a per-line basis.

Individual line tests and parsers

Here are line-restricted parsers for each of the blocks:

Header

  • Test:
(deftest parse-header-line-test
  (is (= {:type :header
          :level 1
          :child {:type :unprc
                  :text "header"}}
         (parse-header-line \= "= header")))
  (is (= {:type :header
          :level 3
          :child {:type :unprc
                  :text "header"}}
         (parse-header-line \= "=== header"))))
  • Implementation
(defn parse-header-line [first-imp-ch line]
  (let [level (count (take-while #(= \= %1) line))]
    {:type :header
     :level level
     :child {:type :unprc
             :text (triml (.substring line level))}}))

Horizontal line

  • Test
(deftest parse-hor-line-line-test
  (is (= {:type :hor-line}
         (parse-hor-line-line \- "---"))))
  • Implementation
(defn parse-hor-line-line [first-imp-ch line]
  {:type :hor-line})

List item

  • Test
(deftest parse-list-line-test
  (is (= {:type :list-item
          :sub-type :unord
          :indent 0
          :child {:type :unprc
                  :text "unord"}}
         (parse-list-line \- "- unord")))
  (is (= {:type :list-item
          :sub-type :unord
          :indent 1
          :child {:type :unprc
                  :text "unord"}}
         (parse-list-line \- "  - unord")))
  (is (= {:type :list-item
          :sub-type :ord
          :indent 0
          :child {:type :unprc
                  :text "ord"}}
         (parse-list-line \# "# ord")))
  (is (= {:type :list-item
          :sub-type :ord
          :indent 2
          :child {:type :unprc
                  :text "ord"}}
         (parse-list-line \# "    # ord"))))
  • Implementation
[(defn parse-list-line [first-imp-ch line]
  {:type :list-item
   :sub-type (case first-imp-ch
               \- :unord
               \# :ord)
   :indent (/ (.indexOf line (int first-imp-ch)) 2)
   :child {:type :unprc
           :text (triml (.substring (triml line) 1))}})

Table row

  • Test
(deftest parse-table-line-test
  (is (= {:type :table-row
          :sub-type :data
          :cols [{:type :unprc
                  :text "c1"}
                 {:type :unprc
                  :text "c2"}]}
       (parse-table-line \| "|c1|c2|")))
  (is (= {:type :table-row
          :sub-type :header
          :cols [{:type :unprc
                  :text "h1"}
                 {:type :unprc
                  :text "h2"}]}
       (parse-table-line \| "|>h1|h2|"))))
  • Implementation
[(defn parse-table-line [first-imp-ch line]
  {:type :table-row
   :sub-type (case (second (triml line))
               \> :header
               :data)
   :cols (map (fn 
                {:type :unprc
                 :text text})
              (split (replace-first line #"^\|>?" "") #"\|"))})

Code rim

  • Test
(deftest parse-code-line-test
  (is (= {:type :code-rim
          :language ""}
         (parse-code-line \! "!")))
  (is (= {:type :code-rim
          :language "clojure"}
         (parse-code-line \! "! clojure"))))
  • Implementation
(defn parse-code-line [first-imp-ch line]
  {:type :code-rim
   :language (triml (.substring line 1))})

Image

  • Test
(deftest parse-image-line-test
  (is (let [line "@/favicon.png"]
        (= {:type :image
            :href "/favicon.png"
            :title "/favicon.png"}
           (parse-image-line (first line) line))))
  (is (let [line "@/favicon.png|My favorite png"]
        (= {:type :image
            :href "/favicon.png"
            :title "My favorite png"}
           (parse-image-line (first line) line)))))
  • Implementation
(defn parse-image-line [first-imp-ch line]
  (let [[href title] (split (.substring line 1) #"\|")
        good-title (if (blank? title) href title)]
    {:type :image
     :href href
     :title good-title}))

These are all quite readable.

Top level test

Here’s a top level test:

(deftest parse-test
  (let [lines ["= Header _italic *bold*_"
               "== Header 2"
               "---"
               "- Item 11"
               "  # Item `21`"
               "  # Item *22*"
               "- Item 12"
               "|>header1|header2|"
               "|c11|*c12*|"
               "|_c21_|*c _22_*|"
               "!clojure"
               "(def a 1)"
               "!"
               "http http://example.com/ _my website_"
               "mailto:me@example.com"
               "@/favicon.png|My favorite icon"]
        text (join "\n" lines)]
    (let [exp [{:type :header
                :level 1
                :child {:type :unprc
                        :text "Header _italic *bold*_"}}
               {:type :header
                :level 2
                :child {:type :unprc
                        :text "Header 2"}}
               {:type :hor-line}
               {:type :list-item
                :sub-type :unord
                :indent 0
                :child {:type :unprc
                        :text "Item 11"}}
               {:type :list-item
                :sub-type :ord
                :indent 1
                :child {:type :unprc
                        :text "Item `21`"}}
               {:type :list-item
                :sub-type :ord
                :indent 1
                :child {:type :unprc
                        :text "Item *22*"}}
               {:type :list-item
                :sub-type :unord
                :indent 0
                :child {:type :unprc
                        :text "Item 12"}}
               {:type :table-row
                :sub-type :header
                :cols [{:type :unprc
                        :text "header1"}
                       {:type :unprc
                        :text "header2"}]}
               {:type :table-row
                :sub-type :data
                :cols [{:type :unprc
                        :text "c11"}
                       {:type :unprc
                        :text "*c12*"}]}
               {:type :table-row
                :sub-type :data
                :cols [{:type :unprc
                        :text "_c21_"}
                       {:type :unprc
                        :text "*c _22_*"}]}
               {:type :code-rim
                :language "clojure"}
               {:type :unprc
                :text "(def a 1)"}
               {:type :code-rim
                :language ""}
               {:type :unprc
                :text "http http://example.com/ _my website_"}
               {:type :unprc
                :text "mailto:me@example.com"}
               {:type :image
                :href "/favicon.png"
                :title "My favorite icon"}]
          act (parse text)
          pairs (map vector exp act)]
      (doseq [[exp act] pairs]
        (is (= exp act))))))

It covers a lot of the above in one shot. You can see that so far it doesn’t work on some items, which are left as :unprc until the code for them is ready.

Top level parser

This is a top level parser:

(defn parse-line [line]
  (let [trimmed (triml line)
        first-imp-ch (first trimmed)]
    (case first-imp-ch
      \= (parse-header-line first-imp-ch line)
      (\- \#) (if (= "---" line)
                (parse-hor-line-line first-imp-ch line)
                (parse-list-line first-imp-ch line))
      \| (parse-table-line first-imp-ch line)
      \! (parse-code-line first-imp-ch line)
      \@ (parse-image-line first-imp-ch line)
      {:type :unprc
       :text line})))

(defn parse-lines [lines]
  (for [line lines]
    (parse-line line)))

(defn root-parser [lines]
  (parse-lines lines))

(defn parse [str]
  (root-parser (split str #"\n")))

Basically it will:

  • Split the given text into lines
  • Process each line one by one
  • The parse-line function is a router that routes to individual parsers for each of the above categories

Each line is substituted by a node describing it’s type:

  • Header
  • List item
  • Table row
  • Code rim (i.e. start / end of the code block)
  • Image
  • Unprocessed yet (:unprc symbol)

where each of these are parsed separately by the functions mentioned previously.

Conclusion and next steps

At the end, the code is a bit spread out (mostly tests), but I’m happy about the way it turned out.

With the above, the grand test passes fine, so all is good so far. Next time I’ll work on gluing these together, so e.g. a number of list items forms a list at the end and similar for other multi-line blocks. Also, I need to implement the things that are contained within line (such as bold and italic markup).