Monday, 4 November 2019

Making OCaml more like Excel

Excel, spreadsheet programming in general, is broadly considered a very accessible style of programming. Could we make functional programming more like Excel programming? Or, rather, how can we extract what is appealing about spreadsheet programming and graft it onto a functional programming language?

I will give an answer here. This is an answer rather than the answer because whether you accept it or not is contingent on whether you agree or not with what I will identify as appealing when it comes to Excel.

I will start with a concrete example, and I will conclude with some brief general remarks. The concrete example is running a simple (but not trivial) simulation in Excel, then showing the same simulation in TSD, our new OCaml language extension.

For the Excel simulation I used an existing (and popular) tutorial from SpreadsheetWEB. It is quite important that you at least browse this tutorial before moving on to the rest of this post.

The idea is to construct a "profit estimator" based on historical sales data, then use a Monte Carlo simulation to predict a range of possible future profits. This latter part is the most non-trivial part, relying on a feature called Data Table, part of the What-if Analysis toolset of Excel. As far as I can tell these features are not available in Office Libre or Mac OS's Numbers. I found it rather hacky, but it gets the job done. Another way of running Monte Carlo simulations is using VBA, as seen is this tutorial.

The simulation is conceptually organised as follows:

We have cells containing Cost and Price data, along with Historical Sales, which are a list of numbers. From the Historical Sales we compute Average and Standard Deviation, which are used to generate a Random Sale, using a Gaussian distribution with said average and standard deviation. Together with cost and price this will compute a Profit, as a random variable. Running this experiment 1,000 times we process the data and display it as a histogram. 

Creating a model

In TSD we create cells storing the Cost, Price, and Historical sales as follows:

let cost  = cell [%dfg  8.]
let price = cell [%dfg 12.]
let hs    = lift [122.; 138.; 163.; 161.; 139.; ...  
let historical_sales = cell [%dfg hs]

The let keyword introduces a variable declaration. The cell function creates a cell (think an Excel cell, but identified by name rather than coordinates). The %dfg tag indicates a "data-flow graph", i.e. the data stored in cells. Unlike the native data of the host programming language, data-flow consists not only of values but also of dependencies on other cells. Finally, the lift function creates a data-flow out of ordinary data -- in this case a list.

Up to this point the key difference between TSD and Excel is that cells are identified by names, and that any data type can be stored in a cell, including lists. This is different than Excel, where a list must be stored in a range of cells.

Next, we set up dependencies between cells:

let average_hs = [%dfg average historical_sales]
let std_dev_hs = [%dfg std_dev historical_sales]
let random_hs  = cell [%dfg gauss average_hs std_dev_hs]            
let profit     = [%dfg random_hs * (price - cost)]   

The random sale based on the Gaussian distribution with historical average and standard deviation is placed in a cell, but the average and the standard deviation themselves do not need to be placed in cells. They establish data-dependencies between the cell holding the historical data and the randomly generated sale, but do not require storage in a separate cell, although we could have. The functions average, std_dev, and gauss have standard implementations (not shown).

Using the model

We now have the model above, except for the Profit Histogram cell, which we will implement later. And we can experiment with it, Excel-style. Except that instead of the Excel GUI we use the OCaml interactive top-level environment (REPL). To load the TSD code and the PPX extension:

OCaml version 4.07.1

Findlib has been successfully loaded. Additional directives:
  #require "package";;      to load a package
  #list;;                   to list the available packages
  #camlp4o;;                to load camlp4 (standard syntax)
  #camlp4r;;                to load camlp4 (revised syntax)
  #predicates "p,q,...";;   to set these predicates
  Topfind.reset();;         to force that packages will be reloaded
  #thread;;                 to enable threads

# #ppx "tsd_ext";;
# #require "tsd";;
/Users/danghica/.opam/default/lib/tsd: added to search path
/Users/danghica/.opam/default/lib/tsd/tsd.cma: loaded
# open Tsd;;

After this we load the model:

# #use "" ;;
val average : (float list -> float) Tsd.graph = <abstr>
val std_dev : (float list -> float) Tsd.graph = <abstr>
val gauss : (float -> float -> float) Tsd.graph = <abstr>
val cost : float Tsd.graph = <abstr>
val price : float Tsd.graph = <abstr>
val hs : float list Tsd.graph = <abstr>
val historical_sales : float list Tsd.graph = <abstr>
val average_hs : float Tsd.graph = <abstr>
val std_dev_hs : float Tsd.graph = <abstr>
val random_hs : float Tsd.graph = <abstr>
val profit : float Tsd.graph = <abstr>

And we are ready to experiment. First, let's use the peek function to inspect what are the computed values in various cells:

#  peek average_hs;;
- : float = 148.95
# peek std_dev_hs;;
- : float = 25.443024584353175
# peek random_hs;;
- : float = 137.839327531968763
# peek profit;;
- : float = 551.357310127875

Now we can recalculate the randomly generated data using the command step, which recomputes the data-flow graph (DFG):

# step ();;
- : bool = true
# peek average_hs;;
- : float = 148.95
# peek std_dev_hs;;
- : float = 25.443024584353175
# peek random_hs;;
- : float = 154.58015036351415
# peek profit;;
- : float = 618.3206014540566

The function step returns true, which means some data was re-computed. This is because the random function call, buried inside the gauss cell generated a new value whenever the DFG is recomputed. As you would hope, the average and standard deviation stay the same, but the random sales and the profit are recomputed.

We can also change some cells and recompute:

# set price 120.;;
- : unit = ()
# step ();;
- : bool = true
# peek profit;;
- : float = 20345.062816955171

Note the function set which raised the Price cell to 120, leading to a much higher Profit value.

Monte Carlo simulation

This is where our idiom diverges from Excel's. We can take advantage of two useful features that Excel does not have: storing lists in cells and creating circular dependencies between cells:

let profit_series = let s = cell [%dfg []] in 
                    link s [%dfg hd profit s]; s

We first create a cell s which is initialised to the empty list, written as '[]' in OCaml. We then create a dependency by linking cell s to the DFG hd profit s. Which can be read as "store in s the list obtained from putting profit at the head of s". This is not quite the same as imperative assignment, because the command link does not simply change s. What it does it creates a dependency between it and an arbitrary DFG.

Now whenever we run our DFG a new value will be prepended to profit_series:

# peek profit_series ;;
- : float list = []
# step();;
- : bool = true
# peek profit_series ;;
- : float list = [363.304693159913768]
# step();;
- : bool = true
# peek profit_series ;;
- : float list = [206.299286780092984; 363.304693159913768]
# step();;
- : bool = true
# peek profit_series ;;
- : float list =

[335.831250389538866; 206.299286780092984; 363.304693159913768]

Finally, we can put it all together using a function that runs n experiments and presents the results as a histogram:

let experiment n =
  assign profit_series [];
  for i = 1 to n do
    step ()
  profit_series |> peek |> histogram 100. 10.

The assign command instantly changes the value of a cell but not its dependencies, resetting the experiment. Finally histogram x0 n is a function that presents the results as a histogram starting at x0 in buckets of size n. The x |> f syntax in OCaml is a pipeline-style operator for function application, the same as f(x)

The result is:

# experiment 1000;;
- : int list =
[0; 0; 0; 0; 0; 0; 1; 3; 3; 11; 12; 18; 36; 30; 38; 44; 51; 69; 83; 83; 93; 60; 55; 71; 49; 47; 34; 22; 18; 13; 12; 3; 1; 0; 0; 2; 1]


So what are the nice features of Excel we added to OCaml?

The first one is the use of a modal programming style. TSD has two modes: constructing the DFG and running the DFG. The second one is related to the "running" mode: it is transparent. Changes made to some cells are automatically propagated along the DFG.

There are other programming languages that use this idiom, such as TensorFlow or Zélus. TSD is somewhere on that spectrum. It is (a lot more) functional and safe than TensorFlow, but more imperative and less stream-oriented than Zélus. Another OCaml extension which is also somewhat similar is Incremental, with the caveat that it does not allow cycles in the DFG.

Making OCaml more like Excel

Excel, spreadsheet programming in general, is broadly considered a very accessible style of programming. Could we make functional programmin...