home | lab | find me | science | publications | software | toolbox | site map

Using monads in Clojure: parse BibTeX and output HTML

Monads are high-order code flow elements. The same way you would use if/else, for loops or variable assignments (e.g. x = 10;), monads are another technique for organizing computations. Three detailed tutorials explain all the basics of monads in clojure: by Konrad Hinsen, Jim Duey and Andrew Brehaut. To make the most of these tutorials and to learn how to use monads, I wrote my own monad-based program, with the purpose of finding out whether I could write better clojure programs by using monads. The short answer is that the disadvantages outweight the advantages.

The three monad tutorials above mention and exemplify the use of a transformed state monad to implement a parser (think regular expressions). To test, and learn first-hand about the usage of monads in clojure I wrote a parser for part of the BibTeX specification. The specific practical goal was to build a program that would take in the list of BibTeX entries referring to scientific publications that cite or mention the image registration and neural circuit reconstruction software TrakEM2, and output a bullet list of bibliographic entries usable within HTML, which I needed. Here, I detail and comment on every aspect of the clojure program I wrote, for my own future reference regarding the use of the state monad in clojure. I share it here with the hope that you will find it useful.

The code for the program is available in this github gist, and depends on clojure-1.4 and clojure.algo.monads-1.1.

As a preliminary step I list the declaration of namespace, library imports, and enable warnings on reflexion and number autoboxing (two sources of performance issues in the JVM):

(ns my.parse.bib
  (:use [clojure.algo.monads :only [domonad with-monad state-t maybe-m fetch-state set-state m-seq m-plus m-result]])
  (:use [clojure.string :only [lower-case]]))

(set! *warn-on-reflection* true)

Then I define the parser monad. As Jim Duey points out, the parser monad is nothing else than the application of the state monad transformer to the maybe monad. In other words, code structured to represent a state machine where operations can fail and state can be backtracked, as Brehaut clearly spells out.

(def parser-m (state-t maybe-m))

The parser monad declared above is complete and does not require any further action from our part. But if we were to define it ourselves with defmonad, it would consist of four functions that would look like these:

Let's reiterate that the four functions above are not declared anywhere. You could, and then give them to defmonad to define the parser monad explicitly. But I understand it is the state monad transformer function applied to the maybe monad what generates the equivalent of these four functions by combining the two monads, and then name the whole thing parser-m. Once again:

(def parser-m (state-t maybe-m))

On the state monad and it's relation to a state machine: like Brehaut writes, "I like to think of the value as the register of the state machine, and state as the sole memory location." That cleared it for me. Every step of the state monad returns a vector of two elements, the accumulated values and the new state.

Then we need the functions to read the current state and to alter it. These functions (fetch-state and set-state) are already defined in clojure.algo.monads so we don't have to define them. They would look like this:

(defn fetch-state
  []
  (fn [state]
    [state state]))

(defn set-state
  [new-state]
  (fn [old-state] ; notice the argument is ignored
    [nil new-state]))

In other words, the fetch-state function places the current state in the "register" (following the analogy from Brehaut) so that it can be read, as well as in the "memory" (so that it can be passed on to the next operation); both places of the vector pair contain the state, and the state is not altered. In the set-state, the old-state is ignored (and could and is written as an underscore in the clojure.algo.monads library, to prevent its superfluous binding) and the new state is set in the "memory" part of the pair.

You may have noticed something in common across all functions defined so far: they all return a new function. Monads add a level of indirection to computations, to enable bindings of variables: as arguments to the outer function, for the inner function to see them as constants.

With the parser monad and the set/get state functions defined, we can now build the library of functions for parsing. These are defined in the context of the parser monad using the macro with-monad to spare us from writing "domonad parser-m" everytime; within a with-monad, we need only write "domonad". While shorter to write, one pays the price of incorrect line numbers in exception stack traces.

For the parser library, first I declare the set of core functions which interact with the value and state pair directly, and which Brehaut likes to call the primatives, the basic building blocks. It is assumed throughout this example using the parser monad that the state is a generic sequence of generic values. Later, we will use a java String, which clojure manipulates like a sequence.

(with-monad parser-m
  ; Primatives: the basic building blocks
  (defn get-one
    "Gets the next item from the input and returns it,
    while also updating the state to be the sequence starting
    at the next element."
    []
    (domonad [input (fetch-state)
              _ (set-state (next input))]
      (first input)))

  (def
    ^{:doc "Returns nil to signal that the end of the sequence has been reached,
    otherwise fails in a monadic way to indicate that the end can't be found here,
    and therefore the parsing has to backtrack to try something else or fail altogether."}
    eof
    (domonad [remaining (fetch-state)
              :when (nil? remaining)]
      nil)))

Notice that the domonad block acts as a let form, with each line creating a binding (using m-bind behind the curtains, in the domonad macro) that is visible in successive lines and in the body. Notice as well that get-one is defined as a function with defn, returning a function (the one created by the execution of the domonad block), whereas eof is defined as a binding, therefore being itself the function created by the domonad macro. This difference is crucial for understanding how these are invoked later.

With the above primative functions we build the matching functions, or parsers. These deal with matching individual elements of the sequence by using predicates (i.e. functions) provided to them as arguments. The matching function is the most generic, accepting any predicate that, when returning true after having been given the element as argument, then returns the element. The other four functions one, not-one, one-of, and not-one-of are built by invoking the matching function with predicates that use the equality operator.

(with-monad parser-m
  ; Basic, generic parsers, built on top of the primatives
  (defn matching
    "The most basic matching parser. Tests the next item
    in the sequence against the predicate provided. If true,
    returns the item, otherwise fails."
    [pred]
    (domonad [one (get-one)
              :when (pred one)]
      one))

  (defn one
    "Next element matches x exactly. What this function does
    is to invoke the matching function with a new anonymous function
    that uses the equality operator to compare x with the next element."
    [x]
    (matching #(= x %)))

  (defn not-one
    "Next element will not match x. Enforces that the element to match
    cannot be nil, which would get confused with the end of sequence."
    [x]
    (matching (fn [y]
                (and (not= x y)
                     (not= nil y)))))

  (defn one-of
    "Matches any one item in the provided set. This function invokes
    the matching function with the provided set as argument; the set
    works as a predicate, because sets in clojure are functions of
    their elements, that is, sets are also functions, which return
    the stored element when given an equal element as argument."
    [s]
    (matching s))

  (defn not-one-of
    "Matches any one item not in the provided set, and returns it.
    To make this work the set cannot be used directly as predicate
    of the matching function, but instead an anonymous function is
    constructed that returns true when the element is not part of the set.
    Enforces that the element to match cannot be nil, which would
    get confused with the end of the sequence."
    [s]
    (matching (fn [y]
                (and (nil? (s y))
                     (not= nil y))))))

With these basic matching functions now we define functions that enable sequences of matches. These are referred to as the combinators.

(with-monad parser-m
   ; Combinators
   (defn optional
     "Return a parser that makes the given parser optional.
     This is accomplished by using m-plus to combine two monadic
     functions: the one provided (the parser) and also (m-result nil)
     which signals a void, but valid monadic return value. If the
     parser doesn't match, then (m-result nil) is returned, signaling
     that the state machine did not advance to the next step."
     [parser]
     (m-plus parser (m-result nil)))

   (defn one-or-more
     "Matches the same parser one or more times until it fails,
     then it returns a sequence of the matched results. Given
     its recursive call, this function can overflow the stack
     when positively matching sequences longer than the possible
     stack depth.
     First the parser is used to match the first item in the sequence,
     and if successful, an optional recursive call is done to match
     further consecutive items. Finally a flattened sequence is returned
     with all matched items in order.
     Given that the parser-m is a modification of the maybe-m, the
     second operation will not be attempted unless the first operation
     succeeded."
     [parser]
     (domonad [r parser
               rs (optional (one-or-more parser))]
       (if rs
         (into [r] (flatten rs))
         [r])))

   (defn none-or-more
     "Matches the same parser zero or more times until it fails,
     then it returns a sequence of the matched results."
     [parser]
     (optional (one-or-more parser)))

   (defn skip-one-or-more
     "Matches the same parser one of more times until it fails,
     then it returns true. Or nil if it doesn't match at least once.
     Given its recursivity this function can overflow the stack.
     This function works like one-or-more, except that it doesn't
     bind neither return the matched values."
     [parser]
     (domonad [_ parser
               _ (optional (skip-one-or-more parser))]
       true))

   (defn skip-none-or-more
     "Matches the same parser zero or more times until it fails,
     then returns true."
     [parser]
     (optional (skip-one-or-more parser))))

Then we create the parser combinators, which combine multiple parsing operations together. The function match-one enables branching computations, i.e. backtracking the state machine to try out different parsers until one succeeds.

(with-monad parser-m
   ; Parser combinators
   (defn match-one
     "Match at least one of the given parsers, as evaluated in order,
     or else fail. What this function does is to return a nested
     set of functions of the state using m-plus. When executed,
     when one matches the chain stops and the current matched item
     or sequence of items is returned, according to the parser."
     [& parsers]
     (reduce m-plus parsers))
   
   (defn match-all
     "Match all given parsers, else fail. Returns a flattened sequence
     with all results. This is accomplished by generating a sequence of
     nested functions which, when invoked with the state as argument,
     thread the state altering it as each is invoked, while the results
     are accumulated in a sequence."
     [& parsers]
     (m-bind (m-seq parsers)
             (comp m-result flatten))))

All the above provide the necessary functions for a generic parsing library that operates on sequences. With these we can now build a specific parser. The example chosen here is that of parsing the entries of a BibTeX file.

First we create generic functions for matching text characters. Granted, these could have been formulated much better. If you have specific suggestions do please send me an email.

(with-monad parser-m
  (let [la "abcdefghijklmnopqrstuvwxyz"
        ua (set (.toUpperCase la))
        la (set la)
        letters (into la ua)
        ext-la "áéí&oactue;úàèìòùäëïöüâêîôûβçñμ" ; Add more at will
        ext-ua (into ua (.toUpperCase ext-la))
        ext-la (into la ext-la)
        ext-letters (into letters (into ext-la ext-ua))
        sp (set " \t\n\r")
        numbers (set "1234567890")
        symbols (set "'{}\:;,.()[]$_-#@%^&*+=!?<>/|~`") ; lacks a " in purpose
        non-syntax-symbols (set "':;,.()[]!@#$%^&*()-_+=<>?/|~`")]
    (def
      ^{:doc "Match any whitespace character."}
      whitespace (one-of sp))

    (def
      ^{:doc "Match any letter from the alphabet, in both lower and upper case."}
      letter (one-of letters))

    (def
      ^{:doc "Match any upper-case letter."}
      upper-case-letter (one-of ua))

    (def
      ^{:doc "Match any lower-case letter."}
      lower-case-letter (one-of la))

    (def
      ^{:doc "Match any letter of the alphabet, in both lower and upper case,
             and also letters with tilde, etc."}
      ext-letter (one-of ext-letters))

    (def
      ^{:doc "Match any upper-case letter of the alphabet including letters
             with tilde, umlauts, etc."}
      ext-upper-case-letter (one-of ext-ua))

    (def
      ^{:doc "Match any numeric character."}
      number (one-of numbers))

    (def
      ^{:doc "Match an allowed non-alphabetic character"}
      non-letter (one-of symbols))

    (def
      ^{:doc "Match a symbol that is not part of the supported latex syntax."}
      non-syntax-symbol (one-of non-syntax-symbols))

    (def
      ^{:doc "Match any amount of text with extended alphabetical or whitespace characters."}
      plain-text
      (one-or-more (match-one ext-letter
                              whitespace)))
    (def
      ^{:doc "Match any character allowed inside the text of a property of a BibTeX entry."}
      latex-char
      (match-one ext-letter
                 whitespace
                 number
                 (match-all (one \) (one \") letter) ; This is e.g. ä in latex: \"a
                 non-letter))))

With the above text character parsers, we can now build a parser for a property of a BibTeX entry, and then a parser for the whole entry. Here is an example of the simplest entry supported by the above parsers:

@article{Gerhard2011,
  author = "Stephan Gerhard and Alessandro Daducci and Alia Lemkaddem and Reto Meuli and Jean-Philippe Thiran and Patric Hagmann",
  year ="2011",
  title = "{The Connectome Viewer Toolkit: An Open Source Framework to Manage, Analyze, and Visualize Connectomes}",
  journal = "Frontiers in Neuroinformatics",
  issue = "5",
  pages ="3",
}

Below, the entry is constructed as a map of its alias and its map of properties. Notice that the parsers are invoked within a domonad macro that does all the work and enables writing only the desired bindings. In other words, the domonad macro provides the means for syntactic sugar, so that we never have to write all the plumbing, which is always the same. The entries function expresses the entire list contained in a BibTeX file, and pretty much all it does is merge the maps that detail each entry:

(with-monad parser-m
  (def
    ^{:doc "One property of a BibTex entry, returned as e.g. {:author \"Albert Cardona\"}"}
    property
    (domonad [_ (skip-none-or-more whitespace)
              prop-name (one-or-more letter)
              _ (match-all
                  (skip-none-or-more whitespace)
                  (one \=)
                  (skip-none-or-more whitespace)
                  (one \"))
              prop-value (none-or-more latex-char)
              _ (match-all
                  (one \")
                  (skip-none-or-more whitespace)
                  (one \,))]
      {(keyword (lower-case (apply str prop-name))) (apply str prop-value)})) 

  (def
    ^{:doc "One entry in the BibTeX file, returned as a one-entry map with
    the alias as key and the properties as a map, e.g. {Cardona2012 {:author
    \"Albert Cardona\" :year \"2012\" :journal \"Nature Methods\"}}"}
    entry
    (domonad [_ (skip-none-or-more whitespace)
              _ (one \@)
              kind (one-or-more letter) ; article, book, inproceedings, etc.
              _ (one \{)
              _ (none-or-more whitespace)
              alias (one-or-more (not-one \,))
              _ (one \,)
              ps (none-or-more property)
              _ (none-or-more whitespace)
              _ (one \})]
      {(apply str alias) (merge {:kind (lower-case (apply str kind))}
                                (apply merge ps))}))

   (def
     ^{:doc "A map of all entries in the BibTeX file, each with the alias
     as key and the map of properties as value."}
     entries
     (domonad [es (none-or-more entry)]
       (apply merge es))))

Notice that the three functions above are not defined as functions but as variables. The domonad macro will create all the plumbing and return a function representing a state machine, which, when invoked with the state as argument, advances and returns a vector pair containing the accumulated entries in the "register" (using again Brehaut's nomenclature) and whatever remaining, unmatched sequence of characters in the "memory" (the remaining state).

Here is an example of applying the entries function to a BibTeX file. First we bind the "state" to the whole content of the BibTeX file. Then we invoke the entries function with the state as argument; the returned result is a pair of the map of entries and the remaining state (nil, in this case) which we destructure. Then we take two entries from the es map and pretty-print them:

(ns my.bibtex.parser.test
  (:use [clojure.algo.monads :only [domonad with-monad state-t maybe-m fetch-state set-state m-seq m-plus m-result]])
  (:use [clojure.string :only [lower-case]])
  (:use [clojure.pprint :only [pprint]]))

(use '[clojure.pprint :only [pprint]])

(let [state (slurp (str (System/getProperty "user.home")
                        \/
                        "webs/trakem2/trakem2_citations.bib"))
      [es state] (entries (lazy-seq state))]
  (doseq [[k v] (take 2 es)]
    (println k) ; the alias, e.g. Sprecher2011
    (pprint v)  ; the map of properties of this entry
    (println)
    (.flush *out*))) ; the pretty-printer doesn't flush the *out* writer for some reason

Which prints:

Sprecher2011
{:author "S Sprecher and A Cardona and V Hartenstein",
 :year "2011",
 :title
 "The \emph{Drosophila} larval visual system: High-resolution analysis of a simple visual neuropil",
 :journal "Developmental Biology",
 :volume "358",
 :number "1",
 :pages "33-43",
 :kind "article"}

OReilly2011
{:author
 "Martin {O'Reilly} and Arnd Roth and Michael H\"ausser and Lewis D. Griffin",
 :year "2011",
 :title
 "{A circle-based method for detection of neural fibre cross-sections in classically stained 2D electron micrographs}",
 :journal "MIAAB",
 :kind "article"}

So: success! We have a working parser for BibTex files, at least for the format of BibTeX that I use.

What we have done: we created a parser for BibTeX files using the state monad transforming the maybe monad, which results in the so-called parser monad. The parser monad enables us to thread the state through a number of potential operations, altering the state as we go and backtracking to earlier points if an operation fails and there is another one to try. Unlike a parser built with regular expressions (which are also state machines), the parser we built offers us the full power of the clojure language at every step of the process. The price we pay is high: parsing is about 150 times slower. In the REPL:

>>> (let [input "  author = \"Cardona A and Saalfeld S\",\n"
      regex #"( |\t|\n|\r)*([a-zA-Z]+)( |\t|\n|\r)*=( |\t|\n|\r)*(.*)( |\t|\n|\r)*"]
  (let [s (re-find regex input)]
    (doseq [[i v] (zipmap (range (count s)) s)] (println i "=>" v)))
  (time
    (dotimes [i 1000]
      (property (lazy-seq input))))
  (time
    (dotimes [i 1000]
      (re-find regex input))))

"Elapsed time: 318.724443 msecs"
"Elapsed time: 2.126855 msecs"

While for my current purposes the speed difference doesn't matter, it is significant. There are a number of places where speed ups could be obtained. For example, the one-or-more and similarly the skip-one-or-more functions are recursive. Instead, one could formulate them with a loop, or using a strategy similar to that used for match-one where m-plus is used, but with a lazy sequence rather than with reduce. I do not know at this time whether these modifications would improve execution speed (suggestions welcome). A modification that would clearly improve speed would involve, as always, specialization: to make the primatives (the core functions) work not with generic sequences but with arrays of characters, advancing the index rather than a new, shorter sequence, at every step of the state machine. In such a setting, functions like one-or-more which return a sequence would return, instead, a pair of starting and ending indices.

The upside of generic code is that, if we figure out how to speed up the core set of generic parser functions, all parsers built using this library will get a boost. Same for fixing errors. That is the main gain and purpose of writing generic code.

Developing the parser functions above was a painful exercise. Given the use of the macros with-monad and domonad, clojure is unable to report meaningful line numbers from where exceptions are thrown. The macro with-monad is the worse offender, given that it generally encloses multiple domonad and an error in any of them gets reported as an error at the line where with-monad was used.

Beyond the line number being misreported, the high-order code within the monad macros result in one never being sure of exactly what an error means beyond "something is wrong". The only way in which I could progress towards correct code was by writing a test for every single function described above. These tests took the following form:

>>> ; Test: must print [[b a c a b b c] nil]
>>> (println
  ((domonad parser-m
     [a (one-or-more (one-of (set "abc")))
      _ eof]
     a) (lazy-seq "bacabbc")))

[[b a c a b b c] nil]

Notice above how the domonad can take as argument the name of the monad to use, rather than relying on the binding provided by with-monad. Also notice that domonad doesn't execute, but rather, returns a function which, when given the state as argument--a String, which clojure interprets as a sequence of characters--, executes and delivers a result in the form of a vector of two values representing the value/state pair. A nil indicates that the state has been consumed in full.

Notice as well that I use (lazy-seq "bacabbc") rather than using the String directly. In my tests, the former is faster than the latter by about 10% for large sequences. Both should be using internally a StringSeq, but clearly there is a difference (which I do not understand).

In summary:

  1. Working with monads means working with functions that take functions as arguments and return functions (welcome to lambda calculus), with all the plumbing hidden from view via macros that provide the syntactic sugar. This is all fine as long as you have a firm grasp of the concepts involved.
  2. Error messages related to code that uses monads are not useful neither point to the right lines of code; related to the extensive use of macros and clojure's poor error reporting. An example: perfectly looking code (regarding syntax highlighting) can give you an "ExceptionInInitializerError" which says only "there is an error somewhere in the with-monad block"--the cause of the cause of the Exception points at the line containing the with-monad parser-m. Forces you to wrap every definition individually, or add the parser-m as argument to every invocation of domonad.
  3. Cannot use macros that expand to monad-using functions within the domonad block. The macro match-string defined above works when used independently, but throws a compilation error when used within a domonad block. This issue, about which I remain clueless, forced verbose code.
  4. Code that uses monads can be dreadfully slow, and will certainly be in your first code iteration. It is understandable given the presence of deeply nested sets of functions and numerous recursive calls.
  5. Monads feel bolted on top of what clojure normally does, as opposed to in Haskell, where monads are all there is for binding values and syntactic sugar makes you not even aware of their existence.
  6. In the state monad every step of the state machine incurs in the creation of multiple objects, including the vector representing the value/state pair. The JVM may not be able to completely inline the actual operations; all these visits to the heap add up to a large cost.

Will I use monads in the future in production code? Would be arrogant to say that I won't given my preliminary experience detailed here. But the above points are real. There is a development barrier that is hard to overcome, and a significant performance cost.

One fix to clojure would go a long way towards enhancing usability: Better line reporting in exceptions thrown in code altered by a macro. Clearly this feature must be hard to implement or it would have been already long ago. Clojure users have reported this issue in numerous occasions.

Beyond these first impressions, realize we are only half-way towards our goal of creating a limited but working BibTeX parser and converter to HTML. We need now code to transform latex text into html:

; Set of functions to transform latex text into html text
(with-monad parser-m
  (defmacro match-string
    [text]
    `(match-all ~@(map one text)))

  (defn fn-latex-to-html
    "Cope with 'declare' not working for calls to the referred name inside domonad,
    so can't declare ahead 'latex-to-html'. Here, we find it at runtime."
    [state]
    (((ns-interns 'my.parse.bib5) 'latex-to-html) state))

  (def
    ^{:doc "If \emph{} is matched, returns a sequence of <i>text</i>"}
    emph
    (domonad [;_ (match-string "\emph{") ; Cannot use macros here: throws ExceptionInInitializationError
              ;_ `(match-all ~@(map one "\emph{")) ; Fails: Cons cannot be cast to IFn, logically
              _ (match-all (one \) (one \e) (one \m) (one \p) (one \h) (one \{))
              words fn-latex-to-html ; Cannot use latex-to-html inside domonad when declared prior to defined.
              _ (one \})]
      (flatten (map seq ["<i>" words "</i>"]))))

  (def
    ^{:doc "If \textbf{} is matched, returns a sequence of <b>text</b>"}
    textbf
    (domonad [_ (match-all (one \) (one \t) (one \e) (one \x) (one \t) (one \b) (one \f) (one \{))
              words fn-latex-to-html
              _ (one \})]
      (flatten (map seq ["<b>" words "</b>"]))))

  (def
    ^{:doc "If \\"a, or another vowel, is matched, returns ä etc." }
    umlaut
    (domonad [_ (one \\)
              _ (one \")
              vowel (one-of (set "aeiou"))]
      [\& vowel \u \m \l \;]))

  (def
    ^{:doc "If \'a, or another letter, is matched, returns á etc."}
    acute
    (domonad [_ (one \\)
              _ (one \')
              l letter]
      [\& l \a \c \u \t \e \;]))

  (def
    ^{:doc "If \`a, or another vowel, is matched, returns à etc."}
    grave
    (domonad [_ (one \\)
              _ (one \`)
              vowel (one-of (set "aeiou"))]
      [\& vowel \g \r \a \v \e \;]))

  (def
    ^{:doc "If \v{a}, or another letter, is matched, returns the letter alone.
    Simplification serves the purpose."}
    v-hat
    (domonad [_ (one \\)
              _ (one \v)
              _ (one \{)
              l letter
              _ (one \})]
      l))

  (def
    ^{:doc "Matches a TeX block like {one two} and simply removes the {}"}
    block
    (domonad [_ (one \{)
              words fn-latex-to-html ; can't use latex-to-html: doesn't work recursively, or can't see the ahead declaration or this one doesn't get resolved.
              _ (one \})]
      (flatten words)))

  (def
    ^{:doc "Returns a sequence of matched text which may include
           emph, textbf, etc. translated into their html equivalents.
           This function is recursive via the calls to it from other
           functions called within."}
    latex-to-html
    (none-or-more
      ; Order matters:
      (match-one block
                 ext-letter
                 whitespace
                 number
                 acute
                 grave
                 umlaut
                 v-hat
                 emph
                 textbf
                 non-syntax-symbol)))

  (def
    ^{:doc "A valid word in the author field."}
    author-word
    (domonad [word (match-one
                     (match-all ext-upper-case-letter
                                (none-or-more (not-one-of (set " \t\n\r"))))
                     block)
              _ (optional (match-one (one \.) (one \,)))]
      word))

  (def
    ^{:doc "Match for example 'Cardona, A.' or 'Albert Cardona' or 'Albert T. Cardona'
           or just about anything that starts with upper case and ends with optionally
           a whitespace or an 'and'."}
    one-author
    (domonad [_ (skip-none-or-more whitespace)
              first-word author-word
              rest-words (none-or-more (match-all (one-or-more whitespace)
                                                  author-word))
              _ (optional (match-all (one-or-more whitespace) (one \a) (one \n) (one \d) whitespace))]
      (apply str (flatten (if rest-words
                            [first-word rest-words]
                            first-word)))))

  (defn format-authors
    "Formats authors that all 'and' are replaced by a comma except the last.
    Returns a String."
    [^String author-field]
    ; domonad creates a function, invoked here with the author-field as argument.
    ; The function returns the vector pair of value and remaining state,
    ; from which the first, the value, is returned.
    (first
      ((domonad [authors (one-or-more one-author)]
        ; authors contains a sequence of String, one per author. String doesn't get flattened.
        (if (= 1 (count authors))
          (first authors)
          (str (clojure.string/join ", " (drop-last authors))
               " and "
               (last authors))))
         author-field)))

  (defn html
    "Translate latex text into its HTML equivalent. Considers only
    a small fraction of the possibilities."
    [^String latex]
    (let [f (domonad [t latex-to-html]
              (apply str (flatten t)))
          [value state] (f (seq latex))]
     value)))

The parser also must be aware of the different types of latex entries that I chose to support. The obvious choice is to use multimethods that dispatch on the :kind tag of every entry (check the entry function above). I chose the "apalike" style of formatting bibliographic references:

; Multimethods to interpret every entry and output as structured text
; Will dispatch on the kind of entry: article, book, inproceedings and incollection
(defmulti apalike
  (fn [x] (x :kind)))

(defmethod apalike "article"
  [entry]
  (str
    (if-let [v (entry :author)] (str (html (format-authors v)) \.) "")
    (if-let [v (entry :year)] (str \space v \.) "")
    (if-let [v (entry :title)] (str \space (html v) \.) "")
    (if-let [v (entry :journal)] (str \space v \space) "")
    (let [vol (entry :volume)
          num (entry :number)
          num (if (nil? num) (entry :issue) num)]
      (cond
        (and vol num) (str vol \( num \))
        (and vol (nil? num)) vol
        (and (nil? vol) num) num
        :else ""))
    (if-let [v (entry :pages)] (str \: v))
    \.))

(defn apalike-in
  "Serves both incollection and inproceedings"
  [entry]
  (str
    (if-let [v (entry :author)] (str (html v) \.) "")
    (if-let [v (entry :year)] (str \space v \.) "")
    (if-let [v (entry :title)] (str \space (html v) \.) "")
    (if-let [v (entry :booktitle)] (str " In: " (html v) \.) "")
    (if-let [v (entry :series)] (str \space (html v) \.) "")
    (if-let [v (entry :editor)] (str " Edited by: " (html v) \.) "")
    (if-let [v (entry :publisher)] (str \space (html v) \.) "")
    (let [vol (entry :volume)
          pages (entry :pages)]
      (cond
        (and vol pages) (str "Vol. " vol ", pages " pages \.)
        (and (nil? vol) pages) (str "Pages " pages \.)
        (and vol (nil? pages)) (str "Vol. " vol \.)
        :else ""))))

(defmethod apalike "inproceedings"
  [entry]
  (apalike-in entry))

(defmethod apalike "incollection"
  [entry]
  (apalike-in entry))

(defmethod apalike "book"
  [entry]
  (str
    (if-let [v (entry :author)] (str (html v) \.) "")
    (if-let [v (entry :year)] (str \space v \.) "")
    (if-let [v (entry :title)] (str \space (html v) \.) "")
    (if-let [v (entry :booktitle)] (str " In: " (html v) \.) "")
    (if-let [v (entry :editor)] (str " Edited by: " (html v) \.) "")
    (let [chapter (entry :chapter) ; chapter number
          pages (entry :pages)]
      (cond
        (and chapter pages) (str " Chapter " chapter ", pages " pages \.)
        (and (nil? chapter) pages) (str " Pages " pages \.)
        (and chapter (nil? pages)) (str " Chapter " chapter \.)
        :else ""))))

While there may be better strategies than to use an ordered list of if-let statements, stringing together the return values of these was the easiest and first approach I could think of that would also be safe to missing properties in a BibTeX entry.

For my purposes, I wanted the list of references sorted by year, done by the following function that takes the map of entries as argument:


(defn sort-by-year
  "Return a sorted collection made from the map of entries,
  where entries are sorted first by year and second by alias."
  [es]
  (vals
    (reduce
      (fn [s [k v]]
        (assoc s (str (v :year) k)))
      {}
      es)))

Finally, we call the above machinery to accomplish the goal: to read a BibTeX file and output a list of html-formatted entries in apalike style, sorted by year:

; Print out parsed and formatted BibTeX entries
(let [state (slurp "trakem2/trakem2_citations.bib")
      [es state] (entries state)]
  (println "<ul>")
  (doseq [[k v] es]
    (println (str "  <li>" (apalike v) "</li>")))
  (println "</ul>"))

The result (cropped) is the following:

All the code above is available in this gist in github. If you come up with improvements or have further insight into any parts of it, I would apprecite an email.



Last updated: 2012-09-18 10:15 New York time. Copyright Albert Cardona.