Hand-written markup language parser in ClojureScript

Broadly speaking, there are two ways to write a parser:

Obviously, using parser generators usually allows for a faster time-to-market. I wanted, however, to write a parser by hand for these reasons:

  • To see how hard it is in ClojureScript
  • No external dependencies
  • No need to have a separate compilation step
  • Informal, i.e. no grammars involved

Markup language

I’m going to develop a parser for a markup language with the following rules:

  • Header
= Header 1
== Header 2
=== Header 3
  • List
- Unordered 11
  # Ordered 21
  # Ordered 22
- Unordered 12
  • Table
|>header1|header2|
|cell11|cell12|
|cell21|cell22|
  • Code with language specification
!clojure
(def a 1)
!
  • Inline code
Some `inline code` here
  • Bold, italics
*bold*
_italics_
_italics and *bold italics* words_
  • Horizontal line
---
  • Manual and automatic links
[http://example.com/|my http site]
http://example.com/

[ftp://example.com/|my ftp site]
ftp://example.com/

[ssh://example.com/|ssh link]
ssh://example.com/

[mailto:me@example.com|my mail]
mailto:me@example.com
  • Image
@/favicon.png|My favorite icon

Parser

The goals are:

  • Make sure all cases are covered by tests
  • Make it work for usual cases
  • Error handling is not the focus
  • Make a parser that produces an AST
  • Have an AST->HTML output producer

I’ll do it in several steps:

  • Make the line-based parser
  • Enhance it to combine blocks
  • Make the HTML output functionality

I did some research on this and I think the most similar methodology is one described in this paper: A Nanopass Framework for Compiler Education.

In this post, I’ll focus on the first part – parsing all of the above functionality on a per-line basis.

Individual line tests and parsers

Here are line-restricted parsers for each of the blocks:

Header

  • Test:
(deftest parse-header-line-test
  (is (= {:type :header
          :level 1
          :child {:type :unprc
                  :text "header"}}
         (parse-header-line \= "= header")))
  (is (= {:type :header
          :level 3
          :child {:type :unprc
                  :text "header"}}
         (parse-header-line \= "=== header"))))
  • Implementation
(defn parse-header-line [first-imp-ch line]
  (let [level (count (take-while #(= \= %1) line))]
    {:type :header
     :level level
     :child {:type :unprc
             :text (triml (.substring line level))}}))

Horizontal line

  • Test
(deftest parse-hor-line-line-test
  (is (= {:type :hor-line}
         (parse-hor-line-line \- "---"))))
  • Implementation
(defn parse-hor-line-line [first-imp-ch line]
  {:type :hor-line})

List item

  • Test
(deftest parse-list-line-test
  (is (= {:type :list-item
          :sub-type :unord
          :indent 0
          :child {:type :unprc
                  :text "unord"}}
         (parse-list-line \- "- unord")))
  (is (= {:type :list-item
          :sub-type :unord
          :indent 1
          :child {:type :unprc
                  :text "unord"}}
         (parse-list-line \- "  - unord")))
  (is (= {:type :list-item
          :sub-type :ord
          :indent 0
          :child {:type :unprc
                  :text "ord"}}
         (parse-list-line \# "# ord")))
  (is (= {:type :list-item
          :sub-type :ord
          :indent 2
          :child {:type :unprc
                  :text "ord"}}
         (parse-list-line \# "    # ord"))))
  • Implementation
[(defn parse-list-line [first-imp-ch line]
  {:type :list-item
   :sub-type (case first-imp-ch
               \- :unord
               \# :ord)
   :indent (/ (.indexOf line (int first-imp-ch)) 2)
   :child {:type :unprc
           :text (triml (.substring (triml line) 1))}})

Table row

  • Test
(deftest parse-table-line-test
  (is (= {:type :table-row
          :sub-type :data
          :cols [{:type :unprc
                  :text "c1"}
                 {:type :unprc
                  :text "c2"}]}
       (parse-table-line \| "|c1|c2|")))
  (is (= {:type :table-row
          :sub-type :header
          :cols [{:type :unprc
                  :text "h1"}
                 {:type :unprc
                  :text "h2"}]}
       (parse-table-line \| "|>h1|h2|"))))
  • Implementation
[(defn parse-table-line [first-imp-ch line]
  {:type :table-row
   :sub-type (case (second (triml line))
               \> :header
               :data)
   :cols (map (fn 
                {:type :unprc
                 :text text})
              (split (replace-first line #"^\|>?" "") #"\|"))})

Code rim

  • Test
(deftest parse-code-line-test
  (is (= {:type :code-rim
          :language ""}
         (parse-code-line \! "!")))
  (is (= {:type :code-rim
          :language "clojure"}
         (parse-code-line \! "! clojure"))))
  • Implementation
(defn parse-code-line [first-imp-ch line]
  {:type :code-rim
   :language (triml (.substring line 1))})

Image

  • Test
(deftest parse-image-line-test
  (is (let [line "@/favicon.png"]
        (= {:type :image
            :href "/favicon.png"
            :title "/favicon.png"}
           (parse-image-line (first line) line))))
  (is (let [line "@/favicon.png|My favorite png"]
        (= {:type :image
            :href "/favicon.png"
            :title "My favorite png"}
           (parse-image-line (first line) line)))))
  • Implementation
(defn parse-image-line [first-imp-ch line]
  (let [[href title] (split (.substring line 1) #"\|")
        good-title (if (blank? title) href title)]
    {:type :image
     :href href
     :title good-title}))

These are all quite readable.

Top level test

Here’s a top level test:

(deftest parse-test
  (let [lines ["= Header _italic *bold*_"
               "== Header 2"
               "---"
               "- Item 11"
               "  # Item `21`"
               "  # Item *22*"
               "- Item 12"
               "|>header1|header2|"
               "|c11|*c12*|"
               "|_c21_|*c _22_*|"
               "!clojure"
               "(def a 1)"
               "!"
               "http http://example.com/ _my website_"
               "mailto:me@example.com"
               "@/favicon.png|My favorite icon"]
        text (join "\n" lines)]
    (let [exp [{:type :header
                :level 1
                :child {:type :unprc
                        :text "Header _italic *bold*_"}}
               {:type :header
                :level 2
                :child {:type :unprc
                        :text "Header 2"}}
               {:type :hor-line}
               {:type :list-item
                :sub-type :unord
                :indent 0
                :child {:type :unprc
                        :text "Item 11"}}
               {:type :list-item
                :sub-type :ord
                :indent 1
                :child {:type :unprc
                        :text "Item `21`"}}
               {:type :list-item
                :sub-type :ord
                :indent 1
                :child {:type :unprc
                        :text "Item *22*"}}
               {:type :list-item
                :sub-type :unord
                :indent 0
                :child {:type :unprc
                        :text "Item 12"}}
               {:type :table-row
                :sub-type :header
                :cols [{:type :unprc
                        :text "header1"}
                       {:type :unprc
                        :text "header2"}]}
               {:type :table-row
                :sub-type :data
                :cols [{:type :unprc
                        :text "c11"}
                       {:type :unprc
                        :text "*c12*"}]}
               {:type :table-row
                :sub-type :data
                :cols [{:type :unprc
                        :text "_c21_"}
                       {:type :unprc
                        :text "*c _22_*"}]}
               {:type :code-rim
                :language "clojure"}
               {:type :unprc
                :text "(def a 1)"}
               {:type :code-rim
                :language ""}
               {:type :unprc
                :text "http http://example.com/ _my website_"}
               {:type :unprc
                :text "mailto:me@example.com"}
               {:type :image
                :href "/favicon.png"
                :title "My favorite icon"}]
          act (parse text)
          pairs (map vector exp act)]
      (doseq [[exp act] pairs]
        (is (= exp act))))))

It covers a lot of the above in one shot. You can see that so far it doesn’t work on some items, which are left as :unprc until the code for them is ready.

Top level parser

This is a top level parser:

(defn parse-line [line]
  (let [trimmed (triml line)
        first-imp-ch (first trimmed)]
    (case first-imp-ch
      \= (parse-header-line first-imp-ch line)
      (\- \#) (if (= "---" line)
                (parse-hor-line-line first-imp-ch line)
                (parse-list-line first-imp-ch line))
      \| (parse-table-line first-imp-ch line)
      \! (parse-code-line first-imp-ch line)
      \@ (parse-image-line first-imp-ch line)
      {:type :unprc
       :text line})))

(defn parse-lines [lines]
  (for [line lines]
    (parse-line line)))

(defn root-parser [lines]
  (parse-lines lines))

(defn parse [str]
  (root-parser (split str #"\n")))

Basically it will:

  • Split the given text into lines
  • Process each line one by one
  • The parse-line function is a router that routes to individual parsers for each of the above categories

Each line is substituted by a node describing it’s type:

  • Header
  • List item
  • Table row
  • Code rim (i.e. start / end of the code block)
  • Image
  • Unprocessed yet (:unprc symbol)

where each of these are parsed separately by the functions mentioned previously.

Conclusion and next steps

At the end, the code is a bit spread out (mostly tests), but I’m happy about the way it turned out.

With the above, the grand test passes fine, so all is good so far. Next time I’ll work on gluing these together, so e.g. a number of list items forms a list at the end and similar for other multi-line blocks. Also, I need to implement the things that are contained within line (such as bold and italic markup).


Maze generator in ClojureScript

Tags: clojurescript, maze generation

Here’s a simple maze generation program in ClojureScript. I’m developing this using a relatively standard stack for ClojureScript development:

Maze generator

The generator is basically a depth-first graph traversal, which generates a random spanning tree out of a maze that’s looked at as a graph. I’ll dissect the code in smaller chunks.

(ns com.icyrock.clojurescript.prb.maze-gen
  (:require-macros [dommy.macros :refer [sel1]])
  (:require [clojure.string :refer [join]]
            [dommy.core :as dommy]))

Standard ns declaration with a couple of requires.

(def board-dim {:h 24 :w 48})
(def doors (atom #{}))
(def visited-rooms (atom #{}))

Define:

  • The board dimensions (:h and :w stand for height and width, respectively)
  • Set of doors
  • Set of visited rooms
(defn is-door [pt1 pt2]
  (let [curr-doors @doors]
    (or (contains? curr-doors [pt1 pt2])
        (contains? curr-doors [pt2 pt1]))))

(defn set-door [pt1 pt2]
  (swap! doors conj [pt1 pt2]))

(defn visit-room [room]
  (swap! visited-rooms conj room))

A few helper functions:

  • The first determines whether there’s a door between two points / rooms in the maze
  • The second actually sets the doors there (i.e. connects the two rooms)
  • The third marks the room visited, so we don’t visit it twice when traversing
(defn pending-rooms [[row col]]
  (remove nil?
          (for [[rd cd] [[-1 0] [0 1] [1 0] [0 -1]]]
            (let [trow (+ row rd)
                  tcol (+ col cd)
                  room [trow tcol]]
              (when (and (<= 0 trow)
                         (<= 0 tcol)
                         (< trow (:h board-dim))
                         (< tcol (:w board-dim))
                         (not (@visited-rooms room)))
                room)))))

This one is used to determine which way we can go. Given any room (defined as [row col] point of a maze), it will try the four possible directions clockwise ([-1 0] [0 1] [1 0] [0 -1]) – i.e.:

  • Previous row (direction up)
  • Next column (direction right)
  • Next row (direction bottom)
  • Previous column (direction left)

For each of these rooms, it will make sure each of them is not:

  • Out of the bounds of the maze (the first four checks in the and call)
  • Already visited

It will collect all the rooms using for, then remove all that are nil (meaning they do not satisfy the above conditions).

(def maze-stack (atom []))

(defn push-maze-stack [pt]
  (swap! maze-stack conj pt))

(defn pop-maze-stack []
  (let [val @maze-stack]
    (reset! maze-stack (rest val))
    (first val)))

This piece defines:

  • The maze stack atom (i.e. the stack of rooms visited used for depth-first search)
  • Function to push a room to the stack
  • Function to pop the room off the stack
(defn start-building-maze [start-room]
  (reset! doors #{})
  (reset! maze-stack [start-room])
  (reset! visited-rooms #{start-room}))

This one resets all the state to start a new maze building cycle.

(defn maze-building-step []
  (loop []
    (when-let [room (first @maze-stack)]
      (let [pending (pending-rooms room)]
        (if (empty? pending)
          (do
            (pop-maze-stack)
            (recur))
          (let [shuffled (shuffle pending)
                next-room (first shuffled)]
            (push-maze-stack next-room)
            (visit-room next-room)
            (set-door room next-room)))))))

Guts of the maze building:

  • Get the top room from the stack, if there are any
  • Get the pending rooms (all that can be visited given the constraints, as provided above) from that room
  • If no pending rooms, discard this room and recur into the loop
  • Otherwise, shuffle the pending rooms (to randomize the maze), pick the next room, push it to the stack, mark it visited and make a door between the current room and that one
(def content (sel1 :#content))

(defn make-gui []
  (dommy/replace-contents! content [:pre {:id "maze"}]))

(def doors-to-char 
  (into cljs.core.PersistentHashMap/EMPTY
        {[false false false false]  "⬞"
         [true  false false false]  "╵"
         [false true  false false]  "╶"
         [true  true  false false]  "└"
         [false false true  false]  "╷"
         [true  false true  false]  "│"
         [false true  true  false]  "┌"
         [true  true  true  false]  "├"
         [false false false true ]  "╴"
         [true  false false true ]  "┘"
         [false true  false true ] "─"
         [true  true  false true ] "┴"
         [false false true  true ] "┐"
         [true  false true  true ] "┤"
         [false true  true  true ] "┬"
         [true  true  true  true ] "┼"}))

(defn board-str []
  (join "\n"
        (for [row (range (:h board-dim))]
          (join
           (for [col (range (:w board-dim))]
             (let [top-door (is-door [row col] [(dec row) col])
                   right-door (is-door [row col] [row (inc col)])
                   bottom-door (is-door [row col] [(inc row) col])
                   left-door (is-door [row col] [row (dec col)])]
               (doors-to-char [top-door right-door bottom-door left-door])))))))

(defn print-board []
  (dommy/replace! (sel1 :#maze) [:pre {:id "maze"} (board-str)]))

GUI-related code:

  • content is the parent div that will contain the maze
  • make-gui will create a pre HTML element with ID maze and put it inside the above content div

I’m actually using this pre element to output the Unicode string representation of the maze. Thus the doors-to-char map used for character lookup. It’s build like this to speed it up – see this StackOverflow question and David’s answer for explanation.

The lookup map is simple – the key is a vector of [top-door right-door bottom-door left-door], where each of the items are true / false monikers signifying whether the door is present or not, and the value is an Unicode character representing the correct room “image” given the possible exits out of that room.

  • board-str is actually making the string representation by traversing all rows, then all cells within each of the rows and getting the correct character to represent the door
  • print-board is just a helper function to output this to the pre element we created in make-gui
(defn run-maze-gen []
  (let [runs 50]
    (if (not-any? nil? (repeatedly runs maze-building-step))
      (do
        (print-board)
        (js/setTimeout run-maze-gen 50))
      (do (print-board)
          (dommy/append! content [:pre "Finished"])))))

(defn main []
  (make-gui)
  (start-building-maze [(quot (:h board-dim) 2)
                        (quot (:w board-dim) 2)])
  (run-maze-gen))

Top-level functions:

  • run-maze-gen runs the maze building step in batches of 50 (let [runs 50]), prints the board after each, then sets a timer to 50ms to run these again until the whole maze is generated
  • main makes the GUI, resets the maze and sets the starting room to the middle of the maze (parameters to start-building-maze), then calls run-maze-gen to actually start building the maze.

The result is something like this:

sample-maze

Note that maze corridors here are actually the lines, not the empty space between them, which I guess is more often how a maze is represented.


ClojureScript development environment

Here’s what you need to set up a pretty nice ClojureScript development environment.

Editor

I’m using Emacs. I actually have it set up through Clojure before I started playing with ClojureScript. There’s a tutorial on the main site:

  • https://github.com/clojure/clojurescript/wiki/Emacs-%26-inferior-lisp-mode

where you can also find tutorials for other editors:

  • https://github.com/clojure/clojurescript/wiki#tools

Lein + lein-cljsbuild

Install Leiningen, make a new project:

$ lein new com.icyrock.clojurescript
Generating a project called com.icyrock.clojurescript based on the 'default' template.
To see other templates (app, lein plugin, etc), try `lein help new`.

and add lein-cljsbuild plugin to your project.clj:

(defproject com.icyrock.clojurescript "0.1.0-SNAPSHOT"
  :description "FIXME: write description"
  :url "http://example.com/FIXME"
  :license {:name "Eclipse Public License"
            :url "http://www.eclipse.org/legal/epl-v10.html"}
  :dependencies [[org.clojure/clojure "1.4.0"]]
  :plugins [[lein-cljsbuild "0.3.0"]]
  :cljsbuild {:builds [{:source-paths ["src-cljs"]
                        :compiler {:output-to "resources/public/js/cljsbuild-main.js"
                                   :optimizations :whitespace
                                   :pretty-print true}}]})

The above has some default configuration for the plugin:

  • The source folder for ClojureScript files is src-cljs
  • Turn some basic optimization and pretty-print the output
  • The output goes to resources/public/js/cljsbuild-main.js

Usually, you’d run cljsbuild in the background with:

$ lein cljsbuild auto

which will watch for the files in the source path for changes and then perform the necessary steps (compile, optimize, bundle) to get you the target .js file.

Ring

Ring is a Web server. It’s very nice for development, as it compiles the Clojure files on the fly. There’s also lein-ring which is a helper plugin to simplify usage when used with lein. Here’s the expanded project.clj:


(defproject com.icyrock.clojurescript "0.1.0-SNAPSHOT"
  :description "icyrock.com ClojureScript"
  :url "https://icyrock.com/clojure-script"
  :license {:name "Eclipse Public License"
            :url "http://www.eclipse.org/legal/epl-v10.html"}
  :dependencies [[org.clojure/clojure "1.5.1"]
                 [ring/ring-core "1.1.8"]]
  :plugins [[lein-cljsbuild "0.3.0"]
            [lein-ring "0.8.3"]]
  :ring {:handler com.icyrock.clojure.ring-main/handler
         :port 3000}
  :cljsbuild {:builds [{:source-paths ["src-cljs"]
                        :compiler {:output-to "resources/public/js/cljsbuild-main.js"
                                   :optimizations :whitespace
                                   :pretty-print true}}]})

As per Ring’s Getting Started, create a sample handler in src/com/icyrock/clojure/ring_main.clj:

(ns com.icyrock.clojure.ring-main)

(defn handler [request]
  {:status 200
   :headers {"Content-Type" "text/html"}
   :body "Welcome to com.icyrock.clojure.ring-main" })

You can run the server now with:

$ lein ring server
2013-04-15 23:16:21.815:INFO:oejs.Server:jetty-7.6.1.v20120215
2013-04-15 23:16:21.966:INFO:oejs.AbstractConnector:Started SelectChannelConnector@0.0.0.0:3000
Started server on port 3000

which will open your browser. You can use lein ring server-headless if you don’t want it to open your browser.

Compojure

The next step is to add some file serving to our Web server. Compojure allows for that. It’s a Clojure routing DSL that sits on top of Ring. Add this as a dependency and change the ring handler:

(defproject com.icyrock.clojurescript "0.1.0-SNAPSHOT"
  :description "icyrock.com ClojureScript"
  :url "https://icyrock.com/clojure-script"
  :license {:name "Eclipse Public License"
            :url "http://www.eclipse.org/legal/epl-v10.html"}
  :dependencies [[org.clojure/clojure "1.5.1"]
                 [ring/ring-core "1.1.8"]
                 [compojure "1.1.5"]]
  :plugins [[lein-cljsbuild "0.3.0"]
            [lein-ring "0.8.3"]]
  :ring {:handler com.icyrock.clojure.compojure-main/app
         :port 3000}
  :cljsbuild {:builds [{:source-paths ["src-cljs"]
                        :compiler {:output-to "resources/public/js/cljsbuild-main.js"
                                   :optimizations :whitespace
                                   :pretty-print true}}]})

then make a site router (see the routes.clj from the example project):

(ns com.icyrock.clojure.compojure-main
  (:require [compojure.core :refer :all]
            [hiccup.middleware :refer (wrap-base-url)]
            [compojure.route :as route]
            [compojure.handler :as handler]
            [compojure.response :as response]))

(defroutes main-routes
  (GET "/" [] "Welcome to com.icyrock.clojure.compojure-main")
  (route/resources "/")
  (route/not-found "Page not found"))

(def app
  (-> (handler/site main-routes)
      (wrap-base-url)))

The important part above is the (route/resources "/"). This allows for static resources to be served by ring. The default is to serve from the (first) public on the classpath. Lein accepts the parameter called :resources-path, which (as per this) defaults to resources. Thus, if we create a folder resources/public, the above would make ring serve all files from that folder under the root (i.e. /) of the site.

As an example:

$ mkdir -p resources/public
$ echo "I'm a resource" > resources/public/a-res.txt

Now do lein ring server and browse to http://localhost:3000/a-res.txt – you’ll get I'm a resource text, as expected.

Auto-recompiling ClojureScript

Let’s add:

  • jayq, which is a jQuery wrapper for ClojureScript
  • jQuery externs that allow Google Closure to recognize and not obfuscate jQuery calls (see this)
  • dommy, a small and fast ClojureScript templating engine

to our project.clj:

(defproject com.icyrock.clojurescript "0.1.0-SNAPSHOT"
  :description "icyrock.com ClojureScript"
  :url "https://icyrock.com/clojure-script"
  :license {:name "Eclipse Public License"
            :url "http://www.eclipse.org/legal/epl-v10.html"}
  :dependencies [[org.clojure/clojure "1.5.1"]
                 [ring/ring-core "1.1.8"]
                 [compojure "1.1.5"]
                 [hiccup "1.0.3"]
                 [jayq "2.3.0"]
                 [prismatic/dommy "0.1.0"]]
  :plugins [[lein-cljsbuild "0.3.0"]
            [lein-ring "0.8.3"]]
  :ring {:handler com.icyrock.clojure.compojure-main/app
         :port 3000}
  :cljsbuild {:builds [{:source-paths ["src-cljs"]
                        :compiler {:output-to "resources/public/js/cljsbuild-main.js"
                                   :externs ["externs/jquery-1.9.1.js"]
                                   :optimizations :whitespace
                                   :pretty-print true}}]})

Now we can make a sample ClojureScript file. All these go to src-cljs. This file is within that folder under com/icyrock/clojurescript/prb/hello_world.cljs and the contents are simple:

(ns com.icyrock.clojurescript.prb.hello-world
    (:require [jayq.core :as $]
              [dommy.core :as dommy]
              [dommy.template :as tpl]))

(defn hello-world []
  (dommy/append! js/document.body [:h1 "Hello, world!"])
  (let [body ($/$ js/document.body)
        para (-> ($/$ "<p>")
                 ($/text "This is a sample paragraph"))
        link (tpl/node [:a {:href "https://icyrock.com/"} "icyrock.com"])]
    ($/append body para)
    ($/append body link)))

This is the main file that will be used to call child functions, just for separation:

(ns com.icyrock.clojurescript.main
  (:require [jayq.core :as $]
            [com.icyrock.clojurescript.prb.hello-world :refer [hello-world]]))

($/$ hello-world)

Let’s run lein-cljsbuild in another terminal in auto-mode:

$ lein cljsbuild auto
Compiling "resources/public/js/cljsbuild-main.js" from ["src-cljs"]...
Successfully compiled "resources/public/js/cljsbuild-main.js" in 12.54909577 seconds.

Add a simple HTML to resources/public/main.htm:

<!doctype html>
<html lang="en">
  <head>
    <meta charset="utf-8" />
    <title>com.icyrock.clojurescript.main</title>
    <script src="lib/jquery-1.9.1.min.js"></script>
    <script src="js/cljsbuild-main.js"></script>
  </head>
  <body>
  </body>
</html>

Download the required files as needed:

Browse to http://localhost:3000/main.htm and you’ll see something like this:

hello_world

Note that further updates to the .cljs files are automatically recompiled by Google Closure compiler, that is used by ClojureScript through lein-cljsbuild