About Me
Twitter
LinkedIn
Github
Talks
Email

Norvig's Spell Corrector in OCaml

Tue 10 December 2013
By []

Recently, I stumbled on a essay by Peter Norvig on how to write a simple, but effective spelling corrector using Python, Bayes' Theorem, and a bit of Sir Arthur Conan Doyle.

At the bottom of the essay, Norvig provides links to implementations of the same algorithm in other programming languages. I noticed that OCaml was missing from the list, and since I needed a refresher in OCaml, I decided to give it a shot and wrote spellcorrector.ml.

In this post, I will not review the algorithm and math behind this program, so well explained on Peter Norvig's page. If you want to understand why the algorithm works, please see Norvig's page.

In writing the code, I followed two guidelines:

  • I used only the standard OCaml libraries distributed with the compiler;
  • I tried to keep the code structure as similar as possible to Peter Norvig's to make it more accessible to those less familiar with OCaml.

I also performed some test and profiling, with some disappointing results for a naive OCaml-er like me.

Implementation

Let's take a look at the implementation.

First we create a new module WordDict as a copy of the standard module Hashtbl (BTW OCaml authors, tbl... Srsl(y)? :-) ).

module MyHash = struct
   type t = string
   let equal : string -> string -> bool = (=)
   let hash : string -> int = Hashtbl.hash
end

module WordDict = Hashtbl.Make(MyHash)

Notice that now, WordDict is an entirely new type, and that it is not interchangeable with Hashtbl.

Then we create a list of characters for our alphabet, and define a function for creating a string from one character, and one for reading the contents of a file into a string.

let alphabet = ['a';'b';'c';'d';'e';'f';'g';'h';'i';
                'j';'k';'l';'m';'n';'o';'p';'q';'r';
                's';'t';'u';'v';'w';'x';'y';'z']

(** Return a [String] of value [c] **)
let string_of_char = String.make 1

(** Return a [String] representing the content of the 
    specified [filename] **)

let read_file filename =
  let channel = open_in filename in
  let channel_length = in_channel_length channel in
  let result = String.create channel_length in
  really_input channel result 0 channel_length;
  result

Next, we define a function, find_elem_one, that returns a default value for elements missing in the dictionary, a feature that is available in Python dictionaries: When the an element is not found in the dictionary, we return the default value 1.

(** Return the integer value associated with the specified [key] in the
    specified [dictionary] if present or 1 if not present **)

let find_elem_one elem dict =
  try 
    WordDict.find dict elem
  with Not_found -> 1

Here, we load the dictionary nwords from the file big.txt:

(** Dictionary that maps words in a text to the number of occurrences of
    those words in the file "big.txt" **)

let nwords =
  let words text = Str.split (Str.regexp "[^a-z]+") (String.lowercase text) in
  let dict = WordDict.create 0 in
  let train features =
    List.iter
      (fun elem -> (WordDict.add dict elem ((find_elem_one elem dict) + 1)))
      features 
  in 
  let () = train (words (read_file "big.txt")) in
  dict

Now, we define the function edits1 that is going to return all the words that have a "typo" distance of 1 from a specified word. One thing that I found very clever is that the edits are generated from the pairs of strings in which the original word can be split.

(** Return the list of all possible strings obtained from the specified
    [word] by:
    1. deleting any one character of [word];
    2. inserting any one character of the English alphabet in [word];
    3. replacing any one character in [word] with any character of the
       English alphabet;
    4. transposing any pair of adjacent characters in [word]. **)

let edits1 word = 
  let splits = 
    let rec all_splits word n =
      let split = {
        first = (Str.string_before word n);
        second = (Str.string_after word n) 
      } 
      in
      if n == 0 then [ split ]
      else split :: all_splits word (n - 1)
    in
    all_splits word (String.length word)
  in
  let replaces =
    let filtered =
      List.filter (fun (a, b) -> (String.length b) > 0) splits
    in
    let replace c =
      List.map 
    (fun (a, b) -> a ^ (string_of_char c) ^ (Str.string_after b 1))
    filtered 
    in
    List.concat (List.map replace alphabet)
  in
  let deletes =
    let filtered =
      List.filter (fun (a, b) -> (String.length b) >= 1) splits
    in
    List.map (fun (a, b) -> a ^ (Str.string_after b 1)) filtered
  in
  let transposes = 
    let filtered =
      List.filter (fun (a, b) -> (String.length b) > 1) splits
    in
    let transpose (a, b) = 
      a ^ string_of_char(b.[1]) 
      ^ string_of_char(b.[0]) 
      ^ (Str.string_after b 2)
    in
    List.map transpose filtered
  in
  let inserts =
    let insert c = List.map (fun (a, b) -> a ^ string_of_char c ^ b) splits in
    List.concat (List.map insert alphabet)
  in
  inserts @ replaces @ transposes @ deletes

Next, we define the function that returns the list of all the words that are at "typo" distance of 2 from the specified word.

(** Return the list of string result of removing from the specified [words]
    all the strings that are not part of an implementation defined list of
    known words **)

let known words = 
  List.filter (fun word -> WordDict.mem nwords word) words

(** Return the list of all possible strings obtained from applying twice the
    function [edits1] to the specified [word] **)

let known_edits2 word = 
  let rec uniq = function
    | a :: (b :: _ as t) -> if a = b then uniq t else a :: uniq t
    | smaller -> smaller
  in
  let uniq l = uniq_imp l [] in 
  let distance1 = uniq (edits1 word) in
  let distance2 = distance1 @ List.concat (List.map edits1 distance1) in
  let distance2 = uniq (known distance2) in
  distance2

Finally, the function correct returns the word that, according to the dictionary nwords, is the most probable correction of the specified input.

Please pay attention to the below implementation of the function correct, keeping in mind that the original Python version reads:

def correct(word):
candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
return max(candidates, key=NWORDS.get)

Here's the OCaml version I wrote at first:

(** Return the string [c] the maximizes the probability that
the specified [word] was written when [c] was intended, according to an
implementation defined dictionary **)

let correct word =
  let choose a b c d = 
    match a, b, c with
  | hd::tl,      _ ,     _ -> a
  |      _, hd::tl,      _ -> b
  |      _,      _, hd::tl -> c
  |      _,      _,      _ -> d
  in
  let candidates =
    choose (known [word]) (known (edits1 word)) (known_edits2 word) ([word])
  in

There's a (not-so-subtle) difference in behavior compared to the original Python code. See it? I didn't notice until I ran the tests... More (and the fix) later!

And now, the rest of the function correct: We obtain a list of word-frequency pairs, and we return the word having the highest frequency.

let pairs =
    List.combine
      candidates
      (List.map (fun word -> find_elem_one word nwords) candidates) 
  in
  let max pairs =
    List.fold_left
      (fun (a,b) (c,d) -> if d > b then (c,d) else (a,b))
      ("", 0)
      pairs
  in
  fst(max pairs)

Performance

The complete source spellcorrector.ml includes the same tests as Norvig's version. The code was compiled with the OCaml compiler 4.01, both into bytecode and native code, on a virtual machine running Debian unstable on an iMac i7. The tests reported below were also run native on Mac Os X, showing pretty much identical timings for both OCaml and Python.

The commands used to compile are:

ocamlopt str.cmxa spellcorrector.ml -o spellcorrector-native
ocamlc str.cxa spellcorrector.ml -o spellcorrector-bytecode

The test results were appalling: it took about 27 seconds for the OCaml (native) code to analyze the first input set, compared to the 4 seconds of the Python original. The bytecode version was drifting towards 50 seconds.

A quick run of ocamlopt -p, and subsequently using gprof revealed a bottleneck due to the the string comparisons used in Map (a binary search tree). Switching to Hashtbl improved the situation dramatically, to a whopping 24 seconds!! Still light years behind Python.

What was I doing wrong? Except for the timing, all the other results were matching the Python output and I was pretty sure that I wrote the same algorithm as in the original version. So, what did I do wrong?

Increasing the size of the input I observed a super-linear increase of the test time compared to the linear increase observable for the Python version. After a closer examination, I eventually realized that:

let correct word =
  let choose a b c d = 
    match a, b, c with
  | hd::tl,      _ ,     _ -> a
  |      _, hd::tl,      _ -> b
  |      _,      _, hd::tl -> c
  |      _,      _,      _ -> d
  in
  let candidates =
    choose (known [word]) (known (edits1 word)) (known_edits2 word) ([word])
  in

and

def correct(word):
candidates = known([word]) or known(edits1(word))
           or known_edits2(word) or [word]

return max(candidates, key=NWORDS.get)

were not equivalent. The OCaml version did not implement the short-circuit evaluation of the Python version and was thus always evaluating the known_edits2 function.

Changing the code to:

let candidates =
  let empty list = match list with [] -> true | _ -> false in
  let c = (known [word]) in
  if not (empty c) then c
  else
    let c = (known (edits1 word)) in
    if not (empty c) then c
    else 
      let c = (known_edits2 word) in
      if not (empty c) then c
      else ([word])

brought the native code to about 7 seconds. Still not as fast as what I expected. Note how the syntax of Python in this case is much more convenient.

Removing a redundant copy of the word dictionary in the test function, brought down the time to around 6 seconds- still remarkably slower than interpreted Python.

I then ran the second tests, with an input set almost twice as big, to get more accurate profiling results. The timing for this second set is about 10 seconds (I also doubled the input again to verify the consistency of the observations). According to gprof about 20% of the time is spent doing garbage collection, while more than 60% of the time is spent in the function correct, most of which, in calls to the module Hashtbl (split between hashing and comparing string).

Other experiments that did not affect (relevantly) the timing include:

  • Changing the pairs in which words are split from tuples to records as suggested by Jane Street.

  • Running a major garbage collection right before starting measuring the time using OCaml Gc.major_full.

  • Transforming alphabet from a list of char to a list of string elements before the test, in order to avoid useless allocations.

  • Replacing the string operations inside edits1 with operations on a single Buffer, in order to avoid the creation of temporary strings. Note that this would be an unfair comparison w.r.t. Python, because the original code does not use string buffers.

  • Changing the implementation of uniq to use a Hashtbl.

Conclusions

This was a useful exercise to regain familiarity with the OCaml language, and to play with some fun machine learning.

Writing OCaml is fun and I enjoy using the language. The support offered by the standard library shipped with the compiler is basic but sufficient, although there are some inconsistencies in the interfaces (e.g., Hashtbl and Map have a different ordering of the input parameters). However, the performance of the written code, does not meet my expectations: Judging from the previous analysis, a faster implementation of the Hashtbl module may help, or there could be a defect in the code. If you have any useful insights or clues on the performance disparity between the Python and OCaml versions of this program, please get in touch by commenting on this article or by emailing me directly at info@pacifico.cc!

comments powered by Disqus