Monday 30 July 2018

Haskell: If monads are the solution, what is the problem?

Monads have exerted a curious fascination on programmers who turn to Haskell. The incomprehension of the newbies is matched in intensity only by the smugness of the experts. The official monadic doctrine, as espoused by the Haskell Wiki, is that

Monads in Haskell can be thought of as composable computation descriptions. The essence of monad is thus separation of composition timeline from the composed computation's execution timeline, as well as the ability of computation to implicitly carry extra data, as pertaining to the computation itself, in addition to its one (hence the name) output, that it will produce when run (or queried, or called upon). This lends monads to supplementing pure calculations with features like I/O, common environment, updatable state, etc.
Some of the words above, especially the meandering second sentence, seem to be written to baffle rather than inform. (It reminds me of what Orwell criticises here.) The lay interpretation is that monads are mathemagical constructs that allow you to add "computations" (also sometimes called "effects") to a "pure" programming language. This lay interpretation is largely wrong and the Haskell monadic dogma is rather overblown. In this post I will point out its inaccuracies then explain the real reason why monads are required in Haskell to handle interactive computations. In the process maybe we will understand monads a bit better. Or, at least, maybe we will manage to demystify them.

Monads were introduced by Moggi in his seminal paper Notions of Computation and Monads as a mathematical (categorical) framework for something that he calls computations. The paper is one of the most influential papers in semantics of programming languages. It is mathematically brilliant, well written, and it changed the field. Here are Moggi's original examples of computations:
  • partiality, i.e. non-terminating computations
  • nondeterminism, i.e. computations that may produce different values
  • side-effects, such as reading and writing from state
  • exceptions, i.e. computations that may terminate abruptly
  • continuations, a powerful mechanism for manipulating flow of control 
  • interactive input and output
Note that partiality and exceptions are features of what passes for "pure" Haskell. Partial behaviour is what happens when you run a recursive function which never stops, and exceptional behaviour when you try to take the head of an empty list. Such programs are standard Haskell programs which do not produce compiler errors or warnings. And, of course, they are computations in the sense that they do not produce values, as 1+7 or head[0] do. 

So it turns out you don't need monads in the type system to have computations. Monads, as a mathematical concept, can be always used as a part of the semantic model, but whether they transpire into the type system of the object language is a matter of design choice. Indeed, one of the most interesting contributions of Moggi's original paper is showing how a monadic meta-language can interpret a language with computations, and how different evaluation strategies require different translations of the (simple) types of the object language into monadic meta-language types.

And of course many functional programming languages have computations but do not require monads. The ML family of languages  (OCaml, SML) have a rich set of computational features which are not reflected in the type system, and so do the grandfather of modern functional languages, ALGOL 60, or more recent  inventions such as Scala.

So that is a bit strange, isn't it? Some computations (partiality, exceptions) do not require language-level monads while others (side-effects, input-output) do. What is the reason for this disparity in Haskell? Is it just a matter of taste? Why do some (most) functional languages make all computations irrelevant to the typing discipline, whereas Haskell makes them relevant -- but only some? 

My answer has two parts. In the first part we will see why some Haskell computations need monads when other languages don't. In the second part we will understand why Haskell doesn't need to track all computations monadically.

First lets see why Haskell needs monads.


Most functional languages (ML-family, Scala, etc.) have an "eager" (call-by-value)  evaluation strategy in which a function application f m is evaluated by first bringing the function f and the argument m to a "value" form and then performing the function call. This implies an "evaluation order" so that for example f is evaluated first, then m. The order is language-specific. In this case, if f and m are computations that interact with the world in some way (read or write from state or files) then all the interactions of f will happen before those of m. The schedule of interactions is predictable from the syntactic form of the application f m.  So consider for example assignment, where x:=m can be thought of simply as the infix syntactic form of assign x m. If an assignment is used then the arguments will be evaluated to values (e.g. x to some memory location, m to some integer value) in the language-specific order. The assignment (a computation) will be executed last. If x and m are just variables then we know that variables are bound to values, so they will produce no interactions. The only interaction produced by the term is the assignment itself.  

ALGOL 60 works differently, being a "call-by-name" (non-strict) language. This means that in an application f m the argument f is evaluated but m is not, and it's passed as a thunk (Think something like f (fun ()->m) in an ML language.). If f and m are computations then no interactions happen in the application f m. To find the interaction you have to look deeper into the code and find the terms that force evaluations. To repeat the example above, assignment is one of those context when an evaluation must be forced, in order to obtain the location and the assigned value. So the behaviour is similar to the above. However, if x and m are just variables, the behaviour is different because variables are bound to thunks rather than values. And these thunks will be executed. So in a term such as x:=m+m the interactions of x are executed once and those of m twice. The assignment itself, another interaction, will be performed at the end. The schedule of interactions in ALGOL 60 takes more getting used to than in eager languages, because some contexts force evaluation (e.g. arithmetic-logic operators, assignments, etc.) and some don't.  Irregular behaviour is usually conducive to programmer mistakes, so one could speculate that this is a reason why call-by-name didn't catch on. (Note that Haskell also enjoys this sometimes baffling mixing of forcing ("eager") and non-forcing ("lazy") behaviour in pattern matching. To have a principled mix of function call strategies my colleague Paul Levy has proposed a calculus called call-by-push-value.) 

However, in both cases of call-by-value and call-by-name the schedule of interactions is syntactically identifiable by a local code inspection. What would happen if we added interactive computations to Haskell in the same way? Does an evaluation f m trigger any interactions? No, because the calling mechanism is lazy. What about an assignment x:=m? Here is the crazy part: we wouldn't know just by looking at the assignment statement itself! In a lazy language the terms bound to x and m are evaluated once, then their values are memoized. And, by necessity, the interactions which are embedded in the computation are carried out during the evaluation. Once we obtain the value the interactions are lost. So in order to know whether an occurrence of m or x produces any interactions you have to inspect the entirety of the code and figure out if it had been evaluated anywhere earlier or not. This is a difficult problem. 

This is where monads rescue Haskell. The key relevant construct of the monad is what Moggi called the extension of a function f to a function f*, i.e. the way functions are to be "lifted" to be interpreted in the monadic setting. It is defined as f* c = let x <= c in f x. This helps because it forces the evaluation of argument c in a predictable way. That is all that the monads really need to do in Haskell: turn a language with an unpredictable evaluation strategy into a language with a predictable, sequential evaluation strategy. (The order of evaluation becomes even more apparent when the Haskell "do" notation is used.) This forcing action is emphasised in Moggi's paper: 
Remark 2.6 The let-constructor plays a fundamental role: operationally it corresponds to sequential evaluation of programs [...]
The way this is practically achieved in Haskell is to allow the combination of computations only through special "binding operators" (such as >>= and >>) whereas in languages with more predictable evaluation strategies any term may contain sub-terms which contribute computational interactions. Here is a typical implementation (also from the Haskell wiki), where you can see how action1 is explicitly scheduled to be performed before action2:

(>>=) :: IO a -> (a -> IO b) -> IO b
(action1 >>= action2) world0 =
   let (a, world1) = action1 world0
       (b, world2) = action2 a world1
   in (b, world2)

(An aside about the example: The let (x,y)=z pattern matching in Haskell is lazy, so the evaluation is not forced by the pattern matching, as one might reasonably expect, but presumably by the actions. It looks like some monads defined using this recipe might not actually even force sequential evaluation at all, so they may not be monads, technically speaking. Perhaps bang patterns should be used instead?)

There is a bit more to the monad story though. Monads can be merely used as a functional design pattern which has nothing to do with effects -- for example the list monad or the option monad. These monads can be used in any functional languages if so desired. This is why conflating monads and effects is confusing.

Let us revisit our running example of an assignment operation, which you would write as x:=m in most languages. In Haskell you would need to write something like do v <- x; e <-m; assign v e which first force terms x then m to produce values v then and exercise their interactions in the process, then produce the actual assignment via assign. (The actual syntax may end up being more complicated due to difficulties of mixing computations via mixing monads.) At any rate, whether such elaborations are to be considered an improvement is certainly a matter of personal taste.

Why are partiality and exceptions invisible to the Haskell type system? 


The answer is a pragmatic one. Partial and exceptional computations produce at most one answer, whereas interactive (or non-deterministic) computations may produce multiple answers. Reading variable x for example produces different values at different times. Or, to put it otherwise, m = m evaluates to true, if it evaluates at all, in the presence of partiality and exceptions, whereas m = m may evaluate to false in the presence of interactive computations. The fact that m = m is true-or-bust is what is called "referential transparency" in Haskell, and is a powerful property which allows terms to be treated as if they are values and not computations, thus enabling a wide range of compiler optimisations. (See a related observation in an earlier post.)

I say "as if" because they are not values but computations -- but if the computation is manifested then the whole program either terminates abruptly (exceptions) or never (partiality), so the wrong answer is never produced. This is the idea of "partial correctness", by the way. So these computations are in some sense "benign" and don't need to be tracked.  Because monads are used in association with certain kinds of interactive computations, and not for benign computations, in Haskell they can serve as a simple kind of "effect type system", which is I suppose a genuine but incidental bonus.

By contrast, languages such as Agda take purity very seriously, only allowing computations which are guaranteed to produce one (and only one) value. These languages require the static checking both of exhaustive pattern matching and of termination, which makes them harder to program in. But they are "like maths", or rather "like logic", in a formal sense (see the Curry-Howard correspondence) and can be used dually as programming languages or as proof assistants. 

To conclude, if I had control of the Haskell wiki, this is how I would start the page on Monads:
Monads in Haskell are used as a mechanism for scheduling evaluation, thus turning a language with a hard-to-predict evaluation strategy into a language with predictable, sequential, interactions. This makes it possible to add interactive computations such as state and input-output to the language, noting that benign (non-interactive) computations are already a part of the language, transparent to the type system.  

Monday 23 July 2018

Reflections of a PC Chair

We are please to inform you that your paper has been accepted for publication.
These are the words that you wait for as an academic researcher. But how do we get there? What is the process driving the decision of acceptance versus rejection? It is something known as "peer review". To paraphrase Churchill, it is a broken, frustrating, and error-prone process. Just that a better one has yet to be invented.

I am an academic researcher in theoretical computer science. Our area of research is peculiar in the elevated importance it gives to conference publications as opposed to journal publications. This has never been a deliberate decision but is something that just happened and we seem to be culturally stuck in this model, for better or worse (probably worse rather than better).

I have been involved in several aspects of academic research service as programme committee member, journal editor, and in a couple of instances as programme committee chair for established conferences. As a PC member I learned many things about how the process works, and as a PC chair I have tried to improve it.

How the process usually works

Most conferences I participate in are hosted by a an online system called EasyChair. It has a decidedly Web 1.0 flavour, neither elegant nor easy to use. But it has been used successfully in countless instances so its reputation is cemented. One feature of EasyChair exerts a direct, subtle, and sometimes unconscious bias on the way the papers are ultimately ranked. I doubt that a lot of thought went into that feature, but it has become established by repeated usage. It is essentially a preliminary ranking based on "acceptance" scores ranging for example from +2 ("strong accept") to -2 ("strong reject").

We are by now quite used to these ratings of "strong accept", "weak accept", "weak reject", "strong reject" even though of course ultimately accept/reject is a binary decision. To refine things further a reviewer is sometimes allowed to express a degree of confidence in their reviews, self-evaluating from "expert" (+4) down to "null" (0). These scores are then used to compute a weighted sum of all the scores of a paper, which are then used to do a preliminary, evolving, ranking of the papers.

The system works well enough to identify papers which are unanimously strong/weak accepts or strong/weak rejects, but a band of inconclusive results is often left in the middle, out of which papers must be selected. It is generally accepted that at this point the scores stop working. The weighted average of a "weak reject" by an "expert" and a "strong accept" by an "informed outsider" is for all purposes a meaningless quantity. I have no evidence that is so, but in a discussion I ran on my Facebook page, in which many senior academics participated, there seemed to be a clear consensus on this point. For these middling papers it is the role of PC discussions to sort out the final decision.

This idea seems nice in principle but it is unworkable in practice. The main reason is that each PC member is assigned a batch of papers to discuss and rank, so the PC member has a local view guided by their own batch of papers. On the other hand, the acceptance decision can only be done globally. If we accept 40% of the papers, that is a global ratio and it does not mean that for each reviewer 40% of their own batch is accepted. The local picture can be very different. In a conference I have recently chaired this is the accept / reject ratio per PC member:


Accept Reject
7
6
5
7
6
6
7
6
5
8
5
8
7
5
9
4
3
10
7
6
4
8
3
9
7
5
1
6
5
7
4
8
6
7
3
10
4
9
7
6
3
10

Depending on what you have on your plate, a paper may seem relatively good (or bad), but that view is biased by the vagaries of distribution.

Moreover, the discussions are further biased by the fact that the preliminary rankings based on the meaningless aggregation of reviewer accept/reject scores (possibly weighted by self-assessed expertise) is visible during the discussions. Even though the rankings are based on meaningless data, the position of a paper will inevitably mark it as a potential winner or loser, with the potential winners drawing the more energetic support of those who are willing to champion it.

I have seen this phenomenon happen, also in grant allocation panels. Preliminary rankings which are officially declared as meaningless, and putatively used only to sort out the order in which grant proposals are discussed, have a crucial impact on final outcomes.

An alternative ranking system

In a conference I have recently chaired I made two changes.

The first change is that I separated scores based on technical merit (i.e. whether a paper is wrong or trivial) from scores based on more subjective factors (i.e. whether a paper is important or interesting). The second thing is that no aggregate score was computed and no preliminary ranking was made available to PC members.

The technical scores were used in a first stage, to reject papers with potential problems. The last thing you want to do is to publish a paper that has mistakes, is trivial, or merely an insignificant increment on existing work. Because of this, if any of the reviewers indicated a technical problem with a paper, and that indication went unchallenged by the other reviewers, the paper was rejected. That seems to be the right way to consider technical scores overall. If one of three reviewers finds a mistake, and that mistake stands, it does not matter that the others didn't spot it, even if they thought that was an otherwise excellent paper. The correct aggregate is minimum not average.

The remaining papers after this stage were all in a techincal sense acceptable. Their publication would not have been embarrassing to us. A case can be made that they should have been accepted, but for logistic reasons outside of my control the total number of papers that could have been accepted was much smaller.

If the first stage focussed on rejection, the second stage focussed on acceptance. From the technically acceptable papers, a ranking was made on the basis of expressed interest. Aggregating a subjective measure, such as interest is not easy. Also, relying on discussions on a matter of taste is also not likely to be very helpful because more often than not interest is stemming out of familiarity with the topic. Few people are likely to become highly interested in papers out of their area, and those must be exceptionally interesting papers indeed.

Two ways of aggregating interest seem plausible: maximal interest vs. average interest. A test case would be a paper with a profile of +2 and -2 (like a lot / dislike a lot) versus +1 and 0 (like a little/indifferent). The first paper wins on maximal interest, the second one on average. Both of them make sense, although I am inclined to prefer the first one to the second one since it is likely to make for a more interesting talk. This should matter in a conference.

There is another aspect for which I preferred the maximal to the average. In the discussion stage the referees can adjust their interest scores following their reading of other reviews and discussions. By taking the average, in limit cases we give the referees an effective veto over what papers get accepted because a downward adjustment can easily take a paper out of contention. By taking the maximum we empower the reviewers who like the paper. In effect, the process is more acceptance-oriented rather than rejection-oriented. This aspect I liked a lot.

In the end, the outcomes of the two systems are not radically different, as one might expect. About 15% of the accepted papers under the maximal interest system would have been rejected under the average interest system, because one reviewer disliked it. They would have been replaced by less controversial papers (in terms of interest, because technically they are all already vetted).

The level of technical virtuosity of a paper is quite strongly correlated to the level of interest. The number of technically strong papers, as unanimously rated by their reviewers which didn't make the programme represented only about 13% of the programme. A technically strong paper is not necessarily interesting, but it seems that it often is.

Conclusions

Was I happy with the resulting programme? Yes, in particular with the fact that several 'controversial' papers were accepted. One thing research should not do is regress to the mean, especially in matters of taste. There should be some room for those who want to stir things up -- provided their arguments are technically sound. I am more happy to see those in, rather than technically more complex papers which elicit little interest or dissent.

My main frustration is that the acceptance rate was artificially lowered by the total number of papers we could accept. Expanding the programme by 15% would have ensured that all papers considered interesting by at least one member of the PC were accepted. As it stands, all papers strongly liked by two or three members were accepted, and all papers nobody strongly liked were ultimately rejected. But still, there were several papers in between -- strongly liked by exactly one reviewer -- which should have been and could have been accepted. Instead, they had to be ranked and a line had to be drawn.

As an observer to the process, I cannot believe the claim that conferences with low acceptance ratios will accept "higher quality" papers. Quality of research is elusive at best, as a measure. The history of science is full of examples of papers which were rejected or ignored despite being not only correct but also revolutionary, or of discredited work which was widely accepted in its time. We can say with some confidence whether a paper is technically correct or not, but beyond that we venture into the realm of fashion, taste and aesthetics.

For the future I would make one further change. For technical vetting, which is laborious, perhaps 3 reviews are enough. But for gaging the level of interest further reviewers, perhaps the entire programme committee could and should be asked to express preferences. The sample of 3 people to assess interest seems statistically quite low to be conclusive. More people can include more papers into the pool of interesting papers, especially if the maximal aggregation is used.


A broader discussion can be had, and is ongoing in various quarters, on the suitability of the peer review process and the publication models of academic research in general and theoretical computer science in particular. But before wider, more revolutionary, change is undertaken, small tweaks to the existing process may reduce random sources of unconscious bias and lead to a more desirable distribution of accepted papers. Especially when the existing processes are not deliberately designed but merely the contingent outcome of existing IT support systems.

Tuesday 17 July 2018

Haskell ain't maths

A programming language is truly "pure" if all computations produce a value -- Agda is an example of such a language. For all its claims to the contrary, Haskell is not a pure language. Computations in Haskell may produce exceptions (try taking the head of the empty list) or nothing at all (in the case of computations that run forever). I don't hold that against Haskell, I just dislike the propaganda. I don't hold that against Haskell because math itself is not pure in this sense, since particular operations could be undefined, for example division by zero. But could we say that Haskell is "as pure as maths"? In the (not very serious) post below I will argue to the contrary: Haskell is a lazy language whereas maths is an eager language, at least as commonly practiced.

In math we have functions, as we do in programming. The functions of programming can be either lazy or eager, depending on the programming language. Are the functions of math lazy or eager?

In programming, the issue is quite important because it has a significant impact on one of the key equations of the lambda calculus, the so-called beta-law, which says that we can substitute (i.e. inline) an argument for the parameter of the function.

(fun x. M)N ≡ M[N/x]

This law lies at the heart of all compiler optimisations.

The equation is always valid in call-by-name languages such as Algol. In lazy languages such as Haskell it still holds in the presence of some effects such as exceptions or non-termination, but it does not hold in the presence of stateful assignments. In an eager languages such as OCaml it is also invalidated in the presence of non-termination or exceptions. It is therefore common that in call-by-value languages the argument (N above) must be a value, i.e. a language constant, a variable, or a lambda expression, usually written as v.

(fun x.M)v ≡ M[v/x]

Of course, substitution is also an important rule of equational reasoning, which is a big part of mathematical calculations. But which rule do we use in math, the by-name, lazy or the by-value (eager) substitution? And how can we tell?

Lets look closer at when the beta law fails in an eager functional program. If the argument is not a value (N≠v) but it is equivalent to a value (Nv) then the rule holds. It only fails when the argument is not equivalent to a value, for example because it runs forever (it 'diverges'). An example of such a term is  Ω =(fun x.x x)(fun x.x x). Or, more mundanely, Ω = while true do skip. An example of failure of the beta law is

(fun x.k) Ω ≢ k[Ω/x]

where k is a language constant. If we run the left-hand side as a program it will first run the argument Ω, which will result in a nonterminating computation. On the other hand, the right-hand side is just the term consisting of the constant k because substitution happens at a syntactic level. So its value is, trivially, k. The two cannot be equivalent.

In maths we don't usually deal with functions such as Ω but we can still have expressions that don't denote a value, such as for example 1/m when m is 0. Both Ω and 1/m are 'undefined' in this sense, that they don't denote a value.

The following example will test whether maths is lazy or eager. Consider the constant function

c(m) = 1

The question is whether the equation

mc(1/m) = 0

has a solution (m=0) or not.

I have not run any referenda on the matter, but from my own understanding of mathematical practice and from pub chats with mathematician pals I would say that m=0 is not considered a valid solution for the above. A function such as the constant function above is defined over values and not over terms.

Perhaps it could be otherwise, but it ain't. Meanwhile, in Haskell

let c(x) = 1 in let m = 0 in m * (c (1/m))

will always evaluate to 0 (or rather 0.0 to be more precise).

So Haskell is impure. And not even mathematically so. 

Understanding the issue of equality in Homotopy Type Theory (HoTT) is easier if you are a programmer

We programmers know something that mathematicians don't really appreciate: equality is a tricky concept. Lets illustrate this with a str...