## Monday, 12 November 2018

### Zippers for non-inductive types

Famous mathematician-philosoper Gian-Carlo Rota said that mathematical understanding is all about showing how things that seem different are essentially the same. If that is true, then functional programming is something that we understand quite well.

One of the most celebrated aspects of functional programming is the correspondence that shows that logical propositions and programming language types are essentially the same thing. It is known as the "Curry-Howard" correspondence. Less known but perhaps even more interesting that each common type corresponds to an algebraic operation. Let's look at all three in a table (the types are written in OCaml syntax):

 Type Constructor Logic Algebra void none $\bot$ 0 unit () $\top$ 1 'a * 'b (a, b) $A\wedge B$ $A\times B$ ('a, 'b) sum Left a Right b $A\vee B$ $A+B$ 'a -> 'b fun x -> b $A\supset B$ $B^A$

where type ('a, 'b) sum = Left of 'a | Right of 'b.

There are many tutorial introductions explaining this fascinating correspondence.

What is very cool is that the correspondence runs deep, so that, for example, if an algebraic equation holds then the corresponding types are isomorphic. For example, since $A\times (B+C)=A\times B+A\times C$ it follows that the types 'a * ('b, 'c) sum and ('a * 'b, 'a * 'c) sum are isomorphic. Indeed, the isomorphisms are:

let f = function (a, Left b)  -> Left (a, b)
| (a, Right c) -> Right (a, c);;
val f : 'a * ('b, 'c) sum -> ('a * 'b, 'a * 'c) sum = <fun>

let g = function Left (a, b)  -> (a, Left b)
| Right (a, c) -> (a, Right c);;
val g : ('a * 'b, 'a * 'c) sum -> 'a * ('b, 'c) sum = <fun>

The correspondence is really about types rather than about OCaml or any particular language. Limitations of the language may occasionally break the correspondences. For example, $0^A=1$ and indeed there is always a unique function void -> 'a just as () : unit is unique. But in OCaml, because of the lack of support for empty patterns, we cannot implement it. A language which has empty patterns and in which we can implement it is Agda.

The correspondence is broken in both directions, since in OCaml we can define

let contr : (unit -> void) =
let rec div x = div x in div ()

which has the type which corresponds to $\top\supset\bot$, which is logically inconsistent. This again cannot happen in Agda. Note that it can also happen in Haskell, so I wouldn't call Haskell a "pure" language, as it is also not faithful to these correspondences.

So these correspondences may or may not hold in an actual language. In what follows I will be even less rigorous. But rigour here would move us too far away from the clarity required to tell a good story. Apologies in advances, but this is merely a blog post. I hope a bit of licence will not brand the whole thing as "fake news". For rigour I refer the reader to the proper literature, for which I give a few entry points at the end.

Type isomorphisms are cool enough, but the correspondence runs even deeper than that. Consider for example the type of lists over some type A. If we enumerate the number of distinct lists of size n, noting that a list of size n is essentially a n-tuple of As, the number of such lists is $A^n$. So the type of lists, in general, is either an empty list, or a unit-length list, or a length-two list, or ... . So we have that:

$$L(A)=1+A+A^2+\cdots+A^n+\cdots = \sum_n A^n.$$

If we think of $L(A)$ as a function, then the above is its Taylor-series expansion, from which it follows that

$$L(A)=1+A+A^2+\cdots+A^n+\cdots = \sum_n A^n = \frac 1{1-A}.$$

We can arrive at the same result in a more direct way from the recursive definition of a list

$$L(A) = 1+A\times L(A)$$

which if we solve for $L(A)$ we get the same result! However, speaking of informality, the equation-solving route does not always apply. Consider the case of natural numbers, defined as type nat = Zero | Suc of nat. The corresponding equation is $N=1+N$, which has no solution.  But let us look the other way when faced with such complications.

For the types in the table at the beginning the intuitions can be quite direct, but the algebraic representation of lists, which involves division and subtraction, the direct intuitions are not obvious at all. This is rather fascinating!

Going deeper down this rabbit hole, the algebraic manipulation opens new possibilities. If types are functions $T(A)$, is the derivative $T'(A)=\frac{dT(A)}{dA}$ meaningful? Yes! The derivative corresponds to the zipper of a type. An element of a zipper is like an element of the original type but with a "highlighted" point, or a "hole".

For example, given a list

its zipper is a list with a highlighted node

which is equivalent to a pair of lists, the prefix and the suffix relative to the hole. So unlike in a list where we can move just in one direction ("right") in a list zipper we can move both left and right. This is what imperative people usually use a doubly linked list for! But mathematics tells us that a pair of lists suffices, a prefix (which is to be represented in reversed order) and a suffix. Moving left/right is simply a matter of shifting the head of the prefix/suffix onto the suffix/prefix.

But how does this relate to algebra? As we said, the type of the zipper is computed by the derivative:

$$L'(A)=0 + 1\times L(A) + A\times L'(A)$$

If we solve the equation we get

$$L'(A) = \frac {L(A)}{1-A}= L(A)\times \frac 1 {1-A}$$

But $L(A)= \frac 1 {1-A}$ so

$$L'(A) = L(A)\times L(A)$$

The zipper of a list is a pair of lists! You should have a heart hardened by years of system-level programming or hollowed by years of corporate middleware drivel not to be enraptured by how beautiful all this is.

And note that if you compute the Taylor expansion of $L'(A)$ you get the same result as computing the derivative of the Taylor expansion of $L(A)$, i.e.

$$L'(A)=\sum_n (1+n)\times a^n.$$

It all comes together superbly.

After this too-fast introduction let's get to the main course, which is answering this question:

Does the magic of algebra only apply to the inductive types we deal with in functional programming?

The answer might shock you, because it is NO!

Let us take the type of circular lists over some type $A$, which we represent graphically as

This is a typical non-recursive data-type, so we cannot retrieve its algebraic representation using an equation. But we can via Taylor series, by enumerating the number of distinct such structures of each size $n$. Some not-so-deep thinking leads us to the conclusion that since there are $A^n$ distinct lists of length $n$, there will be $\frac{A^n}n$ distinct circular lists: each time we move the head at the end of the list we get an equivalent list, and we must quotient by this equivalence. Thus

$$C(A)=\sum_n \frac{A^n}n = −\mathrm{log}(1−A),$$

by bringing the Taylor expansion to a closed form.

But, hey, how did I do that, go from that Taylor expansion to the original function? Via zippers! Because I know two things.

First, I know that the type of the zipper is the derivative of the original type. Second, I know that the zipper is like the original structure with a hole. And to me, that looks like a list:

So $C'(A)=\frac 1{1-A}$, which means that $C(A)=\int \frac 1{1-A}da=−\mathrm{log}(1−A)$. And I can easily verify that my intuition is right, because computing the Taylor expansion gives me $C(A)=\sum_n \frac{A^n}n$, which is consistent to the above.

Is this a fluke? No. Lets take another important non-inductive type, that of multisets. Again looking at the enumeration of distinct multisets in order to compute the Taylor series we get that

$$MSet(A)=\sum_n \frac{A^n}{n!}=e^A,$$

since there are $n!$ permutations to quotient by.

From this, since $\frac{\mathrm de^x}{\mathrm d x}=e^x$ it further follows that the zipper of a multiset is a multiset!

Also, we can get isomorphisms such as

$$MSet(A+B)=e^{A+B}=e^A\times e^B=MSet(A)\times MSet(B)$$

which says that a multiset of $A$s or $B$s is the same as a pairing a multiset of $A$s with a multiset of $B$s. The isomorphism going one way is partitioning by the obvious equivalence, and the other is multiset union.

• A bunch of papers by Conor McBride link
• Combinatorial Species and Tree-like Structures by Bergeron et. al. link
• ∂ for Data: Differentiating Data Structures by M. Abbott et. al. link
• The zipper by G. Huet link
• Seven trees in one by A. Blass link
• Derivatives of regular expressions by J.A. Brzowski link

## Monday, 30 July 2018

Monads have exerted a curious fascination on programmers who turn to Haskell. The incomprehension of the newbies is matched in intensity only by the smugness of the experts. The official monadic doctrine, as espoused by the Haskell Wiki, is that

Monads in Haskell can be thought of as composable computation descriptions. The essence of monad is thus separation of composition timeline from the composed computation's execution timeline, as well as the ability of computation to implicitly carry extra data, as pertaining to the computation itself, in addition to its one (hence the name) output, that it will produce when run (or queried, or called upon). This lends monads to supplementing pure calculations with features like I/O, common environment, updatable state, etc.
Some of the words above, especially the meandering second sentence, seem to be written to baffle rather than inform. (It reminds me of what Orwell criticises here.) The lay interpretation is that monads are mathemagical constructs that allow you to add "computations" (also sometimes called "effects") to a "pure" programming language. This lay interpretation is largely wrong and the Haskell monadic dogma is rather overblown. In this post I will point out its inaccuracies then explain the real reason why monads are required in Haskell to handle interactive computations. In the process maybe we will understand monads a bit better. Or, at least, maybe we will manage to demystify them.

Monads were introduced by Moggi in his seminal paper Notions of Computation and Monads as a mathematical (categorical) framework for something that he calls computations. The paper is one of the most influential papers in semantics of programming languages. It is mathematically brilliant, well written, and it changed the field. Here are Moggi's original examples of computations:
• partiality, i.e. non-terminating computations
• nondeterminism, i.e. computations that may produce different values
• side-effects, such as reading and writing from state
• exceptions, i.e. computations that may terminate abruptly
• continuations, a powerful mechanism for manipulating flow of control
• interactive input and output
Note that partiality and exceptions are features of what passes for "pure" Haskell. Partial behaviour is what happens when you run a recursive function which never stops, and exceptional behaviour when you try to take the head of an empty list. Such programs are standard Haskell programs which do not produce compiler errors or warnings. And, of course, they are computations in the sense that they do not produce values, as 1+7 or head[0] do.

So it turns out you don't need monads in the type system to have computations. Monads, as a mathematical concept, can be always used as a part of the semantic model, but whether they transpire into the type system of the object language is a matter of design choice. Indeed, one of the most interesting contributions of Moggi's original paper is showing how a monadic meta-language can interpret a language with computations, and how different evaluation strategies require different translations of the (simple) types of the object language into monadic meta-language types.

And of course many functional programming languages have computations but do not require monads. The ML family of languages  (OCaml, SML) have a rich set of computational features which are not reflected in the type system, and so do the grandfather of modern functional languages, ALGOL 60, or more recent  inventions such as Scala.

So that is a bit strange, isn't it? Some computations (partiality, exceptions) do not require language-level monads while others (side-effects, input-output) do. What is the reason for this disparity in Haskell? Is it just a matter of taste? Why do some (most) functional languages make all computations irrelevant to the typing discipline, whereas Haskell makes them relevant -- but only some?

My answer has two parts. In the first part we will see why some Haskell computations need monads when other languages don't. In the second part we will understand why Haskell doesn't need to track all computations monadically.

Most functional languages (ML-family, Scala, etc.) have an "eager" (call-by-value)  evaluation strategy in which a function application f m is evaluated by first bringing the function f and the argument m to a "value" form and then performing the function call. This implies an "evaluation order" so that for example f is evaluated first, then m. The order is language-specific. In this case, if f and m are computations that interact with the world in some way (read or write from state or files) then all the interactions of f will happen before those of m. The schedule of interactions is predictable from the syntactic form of the application f m.  So consider for example assignment, where x:=m can be thought of simply as the infix syntactic form of assign x m. If an assignment is used then the arguments will be evaluated to values (e.g. x to some memory location, m to some integer value) in the language-specific order. The assignment (a computation) will be executed last. If x and m are just variables then we know that variables are bound to values, so they will produce no interactions. The only interaction produced by the term is the assignment itself.

ALGOL 60 works differently, being a "call-by-name" (non-strict) language. This means that in an application f m the argument f is evaluated but m is not, and it's passed as a thunk (Think something like f (fun ()->m) in an ML language.). If f and m are computations then no interactions happen in the application f m. To find the interaction you have to look deeper into the code and find the terms that force evaluations. To repeat the example above, assignment is one of those context when an evaluation must be forced, in order to obtain the location and the assigned value. So the behaviour is similar to the above. However, if x and m are just variables, the behaviour is different because variables are bound to thunks rather than values. And these thunks will be executed. So in a term such as x:=m+m the interactions of x are executed once and those of m twice. The assignment itself, another interaction, will be performed at the end. The schedule of interactions in ALGOL 60 takes more getting used to than in eager languages, because some contexts force evaluation (e.g. arithmetic-logic operators, assignments, etc.) and some don't.  Irregular behaviour is usually conducive to programmer mistakes, so one could speculate that this is a reason why call-by-name didn't catch on. (Note that Haskell also enjoys this sometimes baffling mixing of forcing ("eager") and non-forcing ("lazy") behaviour in pattern matching. To have a principled mix of function call strategies my colleague Paul Levy has proposed a calculus called call-by-push-value.)

However, in both cases of call-by-value and call-by-name the schedule of interactions is syntactically identifiable by a local code inspection. What would happen if we added interactive computations to Haskell in the same way? Does an evaluation f m trigger any interactions? No, because the calling mechanism is lazy. What about an assignment x:=m? Here is the crazy part: we wouldn't know just by looking at the assignment statement itself! In a lazy language the terms bound to x and m are evaluated once, then their values are memoized. And, by necessity, the interactions which are embedded in the computation are carried out during the evaluation. Once we obtain the value the interactions are lost. So in order to know whether an occurrence of m or x produces any interactions you have to inspect the entirety of the code and figure out if it had been evaluated anywhere earlier or not. This is a difficult problem.

This is where monads rescue Haskell. The key relevant construct of the monad is what Moggi called the extension of a function f to a function f*, i.e. the way functions are to be "lifted" to be interpreted in the monadic setting. It is defined as f* c = let x <= c in f x. This helps because it forces the evaluation of argument c in a predictable way. That is all that the monads really need to do in Haskell: turn a language with an unpredictable evaluation strategy into a language with a predictable, sequential evaluation strategy. (The order of evaluation becomes even more apparent when the Haskell "do" notation is used.) This forcing action is emphasised in Moggi's paper:
Remark 2.6 The let-constructor plays a fundamental role: operationally it corresponds to sequential evaluation of programs [...]
The way this is practically achieved in Haskell is to allow the combination of computations only through special "binding operators" (such as >>= and >>) whereas in languages with more predictable evaluation strategies any term may contain sub-terms which contribute computational interactions. Here is a typical implementation (also from the Haskell wiki), where you can see how action1 is explicitly scheduled to be performed before action2:

(>>=) :: IO a -> (a -> IO b) -> IO b
(action1 >>= action2) world0 =
let (a, world1) = action1 world0
(b, world2) = action2 a world1
in (b, world2)

(An aside about the example: The let (x,y)=z pattern matching in Haskell is lazy, so the evaluation is not forced by the pattern matching, as one might reasonably expect, but presumably by the actions. It looks like some monads defined using this recipe might not actually even force sequential evaluation at all, so they may not be monads, technically speaking. Perhaps bang patterns should be used instead?)

There is a bit more to the monad story though. Monads can be merely used as a functional design pattern which has nothing to do with effects -- for example the list monad or the option monad. These monads can be used in any functional languages if so desired. This is why conflating monads and effects is confusing.

Let us revisit our running example of an assignment operation, which you would write as x:=m in most languages. In Haskell you would need to write something like do v <- x; e <-m; assign v e which first force terms x then m to produce values v then and exercise their interactions in the process, then produce the actual assignment via assign. (The actual syntax may end up being more complicated due to difficulties of mixing computations via mixing monads.) At any rate, whether such elaborations are to be considered an improvement is certainly a matter of personal taste.

#### Why are partiality and exceptions invisible to the Haskell type system?

The answer is a pragmatic one. Partial and exceptional computations produce at most one answer, whereas interactive (or non-deterministic) computations may produce multiple answers. Reading variable x for example produces different values at different times. Or, to put it otherwise, m = m evaluates to true, if it evaluates at all, in the presence of partiality and exceptions, whereas m = m may evaluate to false in the presence of interactive computations. The fact that m = m is true-or-bust is what is called "referential transparency" in Haskell, and is a powerful property which allows terms to be treated as if they are values and not computations, thus enabling a wide range of compiler optimisations. (See a related observation in an earlier post.)

I say "as if" because they are not values but computations -- but if the computation is manifested then the whole program either terminates abruptly (exceptions) or never (partiality), so the wrong answer is never produced. This is the idea of "partial correctness", by the way. So these computations are in some sense "benign" and don't need to be tracked.  Because monads are used in association with certain kinds of interactive computations, and not for benign computations, in Haskell they can serve as a simple kind of "effect type system", which is I suppose a genuine but incidental bonus.

By contrast, languages such as Agda take purity very seriously, only allowing computations which are guaranteed to produce one (and only one) value. These languages require the static checking both of exhaustive pattern matching and of termination, which makes them harder to program in. But they are "like maths", or rather "like logic", in a formal sense (see the Curry-Howard correspondence) and can be used dually as programming languages or as proof assistants.

To conclude, if I had control of the Haskell wiki, this is how I would start the page on Monads:
Monads in Haskell are used as a mechanism for scheduling evaluation, thus turning a language with a hard-to-predict evaluation strategy into a language with predictable, sequential, interactions. This makes it possible to add interactive computations such as state and input-output to the language, noting that benign (non-interactive) computations are already a part of the language, transparent to the type system.

## Monday, 23 July 2018

### Reflections of a PC Chair

We are please to inform you that your paper has been accepted for publication.
These are the words that you wait for as an academic researcher. But how do we get there? What is the process driving the decision of acceptance versus rejection? It is something known as "peer review". To paraphrase Churchill, it is a broken, frustrating, and error-prone process. Just that a better one has yet to be invented.

I am an academic researcher in theoretical computer science. Our area of research is peculiar in the elevated importance it gives to conference publications as opposed to journal publications. This has never been a deliberate decision but is something that just happened and we seem to be culturally stuck in this model, for better or worse (probably worse rather than better).

I have been involved in several aspects of academic research service as programme committee member, journal editor, and in a couple of instances as programme committee chair for established conferences. As a PC member I learned many things about how the process works, and as a PC chair I have tried to improve it.

#### How the process usually works

Most conferences I participate in are hosted by a an online system called EasyChair. It has a decidedly Web 1.0 flavour, neither elegant nor easy to use. But it has been used successfully in countless instances so its reputation is cemented. One feature of EasyChair exerts a direct, subtle, and sometimes unconscious bias on the way the papers are ultimately ranked. I doubt that a lot of thought went into that feature, but it has become established by repeated usage. It is essentially a preliminary ranking based on "acceptance" scores ranging for example from +2 ("strong accept") to -2 ("strong reject").

We are by now quite used to these ratings of "strong accept", "weak accept", "weak reject", "strong reject" even though of course ultimately accept/reject is a binary decision. To refine things further a reviewer is sometimes allowed to express a degree of confidence in their reviews, self-evaluating from "expert" (+4) down to "null" (0). These scores are then used to compute a weighted sum of all the scores of a paper, which are then used to do a preliminary, evolving, ranking of the papers.

The system works well enough to identify papers which are unanimously strong/weak accepts or strong/weak rejects, but a band of inconclusive results is often left in the middle, out of which papers must be selected. It is generally accepted that at this point the scores stop working. The weighted average of a "weak reject" by an "expert" and a "strong accept" by an "informed outsider" is for all purposes a meaningless quantity. I have no evidence that is so, but in a discussion I ran on my Facebook page, in which many senior academics participated, there seemed to be a clear consensus on this point. For these middling papers it is the role of PC discussions to sort out the final decision.

This idea seems nice in principle but it is unworkable in practice. The main reason is that each PC member is assigned a batch of papers to discuss and rank, so the PC member has a local view guided by their own batch of papers. On the other hand, the acceptance decision can only be done globally. If we accept 40% of the papers, that is a global ratio and it does not mean that for each reviewer 40% of their own batch is accepted. The local picture can be very different. In a conference I have recently chaired this is the accept / reject ratio per PC member:

Accept Reject
7
6
5
7
6
6
7
6
5
8
5
8
7
5
9
4
3
10
7
6
4
8
3
9
7
5
1
6
5
7
4
8
6
7
3
10
4
9
7
6
3
10

Depending on what you have on your plate, a paper may seem relatively good (or bad), but that view is biased by the vagaries of distribution.

Moreover, the discussions are further biased by the fact that the preliminary rankings based on the meaningless aggregation of reviewer accept/reject scores (possibly weighted by self-assessed expertise) is visible during the discussions. Even though the rankings are based on meaningless data, the position of a paper will inevitably mark it as a potential winner or loser, with the potential winners drawing the more energetic support of those who are willing to champion it.

I have seen this phenomenon happen, also in grant allocation panels. Preliminary rankings which are officially declared as meaningless, and putatively used only to sort out the order in which grant proposals are discussed, have a crucial impact on final outcomes.

#### An alternative ranking system

In a conference I have recently chaired I made two changes.

The first change is that I separated scores based on technical merit (i.e. whether a paper is wrong or trivial) from scores based on more subjective factors (i.e. whether a paper is important or interesting). The second thing is that no aggregate score was computed and no preliminary ranking was made available to PC members.

The technical scores were used in a first stage, to reject papers with potential problems. The last thing you want to do is to publish a paper that has mistakes, is trivial, or merely an insignificant increment on existing work. Because of this, if any of the reviewers indicated a technical problem with a paper, and that indication went unchallenged by the other reviewers, the paper was rejected. That seems to be the right way to consider technical scores overall. If one of three reviewers finds a mistake, and that mistake stands, it does not matter that the others didn't spot it, even if they thought that was an otherwise excellent paper. The correct aggregate is minimum not average.

The remaining papers after this stage were all in a techincal sense acceptable. Their publication would not have been embarrassing to us. A case can be made that they should have been accepted, but for logistic reasons outside of my control the total number of papers that could have been accepted was much smaller.

If the first stage focussed on rejection, the second stage focussed on acceptance. From the technically acceptable papers, a ranking was made on the basis of expressed interest. Aggregating a subjective measure, such as interest is not easy. Also, relying on discussions on a matter of taste is also not likely to be very helpful because more often than not interest is stemming out of familiarity with the topic. Few people are likely to become highly interested in papers out of their area, and those must be exceptionally interesting papers indeed.

Two ways of aggregating interest seem plausible: maximal interest vs. average interest. A test case would be a paper with a profile of +2 and -2 (like a lot / dislike a lot) versus +1 and 0 (like a little/indifferent). The first paper wins on maximal interest, the second one on average. Both of them make sense, although I am inclined to prefer the first one to the second one since it is likely to make for a more interesting talk. This should matter in a conference.

There is another aspect for which I preferred the maximal to the average. In the discussion stage the referees can adjust their interest scores following their reading of other reviews and discussions. By taking the average, in limit cases we give the referees an effective veto over what papers get accepted because a downward adjustment can easily take a paper out of contention. By taking the maximum we empower the reviewers who like the paper. In effect, the process is more acceptance-oriented rather than rejection-oriented. This aspect I liked a lot.

In the end, the outcomes of the two systems are not radically different, as one might expect. About 15% of the accepted papers under the maximal interest system would have been rejected under the average interest system, because one reviewer disliked it. They would have been replaced by less controversial papers (in terms of interest, because technically they are all already vetted).

The level of technical virtuosity of a paper is quite strongly correlated to the level of interest. The number of technically strong papers, as unanimously rated by their reviewers which didn't make the programme represented only about 13% of the programme. A technically strong paper is not necessarily interesting, but it seems that it often is.

### Conclusions

Was I happy with the resulting programme? Yes, in particular with the fact that several 'controversial' papers were accepted. One thing research should not do is regress to the mean, especially in matters of taste. There should be some room for those who want to stir things up -- provided their arguments are technically sound. I am more happy to see those in, rather than technically more complex papers which elicit little interest or dissent.

My main frustration is that the acceptance rate was artificially lowered by the total number of papers we could accept. Expanding the programme by 15% would have ensured that all papers considered interesting by at least one member of the PC were accepted. As it stands, all papers strongly liked by two or three members were accepted, and all papers nobody strongly liked were ultimately rejected. But still, there were several papers in between -- strongly liked by exactly one reviewer -- which should have been and could have been accepted. Instead, they had to be ranked and a line had to be drawn.

As an observer to the process, I cannot believe the claim that conferences with low acceptance ratios will accept "higher quality" papers. Quality of research is elusive at best, as a measure. The history of science is full of examples of papers which were rejected or ignored despite being not only correct but also revolutionary, or of discredited work which was widely accepted in its time. We can say with some confidence whether a paper is technically correct or not, but beyond that we venture into the realm of fashion, taste and aesthetics.

For the future I would make one further change. For technical vetting, which is laborious, perhaps 3 reviews are enough. But for gaging the level of interest further reviewers, perhaps the entire programme committee could and should be asked to express preferences. The sample of 3 people to assess interest seems statistically quite low to be conclusive. More people can include more papers into the pool of interesting papers, especially if the maximal aggregation is used.

A broader discussion can be had, and is ongoing in various quarters, on the suitability of the peer review process and the publication models of academic research in general and theoretical computer science in particular. But before wider, more revolutionary, change is undertaken, small tweaks to the existing process may reduce random sources of unconscious bias and lead to a more desirable distribution of accepted papers. Especially when the existing processes are not deliberately designed but merely the contingent outcome of existing IT support systems.

## Tuesday, 17 July 2018

A programming language is truly "pure" if all computations produce a value -- Agda is an example of such a language. For all its claims to the contrary, Haskell is not a pure language. Computations in Haskell may produce exceptions (try taking the head of the empty list) or nothing at all (in the case of computations that run forever). I don't hold that against Haskell, I just dislike the propaganda. I don't hold that against Haskell because math itself is not pure in this sense, since particular operations could be undefined, for example division by zero. But could we say that Haskell is "as pure as maths"? In the (not very serious) post below I will argue to the contrary: Haskell is a lazy language whereas maths is an eager language, at least as commonly practiced.

In math we have functions, as we do in programming. The functions of programming can be either lazy or eager, depending on the programming language. Are the functions of math lazy or eager?

In programming, the issue is quite important because it has a significant impact on one of the key equations of the lambda calculus, the so-called beta-law, which says that we can substitute (i.e. inline) an argument for the parameter of the function.

(fun x. M)N ≡ M[N/x]

This law lies at the heart of all compiler optimisations.

The equation is always valid in call-by-name languages such as Algol. In lazy languages such as Haskell it still holds in the presence of some effects such as exceptions or non-termination, but it does not hold in the presence of stateful assignments. In an eager languages such as OCaml it is also invalidated in the presence of non-termination or exceptions. It is therefore common that in call-by-value languages the argument (N above) must be a value, i.e. a language constant, a variable, or a lambda expression, usually written as v.

(fun x.M)v ≡ M[v/x]

Of course, substitution is also an important rule of equational reasoning, which is a big part of mathematical calculations. But which rule do we use in math, the by-name, lazy or the by-value (eager) substitution? And how can we tell?

Lets look closer at when the beta law fails in an eager functional program. If the argument is not a value (N≠v) but it is equivalent to a value (Nv) then the rule holds. It only fails when the argument is not equivalent to a value, for example because it runs forever (it 'diverges'). An example of such a term is  Ω =(fun x.x x)(fun x.x x). Or, more mundanely, Ω = while true do skip. An example of failure of the beta law is

(fun x.k) Ω ≢ k[Ω/x]

where k is a language constant. If we run the left-hand side as a program it will first run the argument Ω, which will result in a nonterminating computation. On the other hand, the right-hand side is just the term consisting of the constant k because substitution happens at a syntactic level. So its value is, trivially, k. The two cannot be equivalent.

In maths we don't usually deal with functions such as Ω but we can still have expressions that don't denote a value, such as for example 1/m when m is 0. Both Ω and 1/m are 'undefined' in this sense, that they don't denote a value.

The following example will test whether maths is lazy or eager. Consider the constant function

c(m) = 1

The question is whether the equation

mc(1/m) = 0

has a solution (m=0) or not.

I have not run any referenda on the matter, but from my own understanding of mathematical practice and from pub chats with mathematician pals I would say that m=0 is not considered a valid solution for the above. A function such as the constant function above is defined over values and not over terms.

Perhaps it could be otherwise, but it ain't. Meanwhile, in Haskell

let c(x) = 1 in let m = 0 in m * (c (1/m))

will always evaluate to 0 (or rather 0.0 to be more precise).

So Haskell is impure. And not even mathematically so.

## Tuesday, 6 March 2018

### Copying vs. sharing in functional languages

Debugging and profiling functional programs is a challenge, for at least two reasons.

The first reason is that functional programs have no obvious step-by-step flow of control as you would find in imperative programs -- there is no semicolon to induce explicitly a notion of "step". The flow of control is induced implicitly by the evaluation rules of the underlying lambda calculus, for example call-by-need in Haskell, or call-by-value (function-then-argument) in SML, or call-by-value mixing function-then-argument and argument-then-function in OCaml.

The second reason is that the memory footprint of a functional program is much more complex due to automatic copying of arguments and the creation of closures in the heap. Complex memory operations have a crucial impact on the efficiency of functional programs.

Debugging (and profiling) functional languages is also problematic because a low-level (assembly or bytecode) level of debugging is rather useless. Unlike in imperative languages, the conceptual gap between the functional language and the underlying concrete machine is too broad for such a representation to work. A lower-level, yet comprehensible, view of functional computation can be given by so-called 'abstract machines'.

Together with my student Koko Muroya, we have developed a new abstract machine for functional languages based on J.Y. Girard's "Geometry of Interaction" semantics of linear logic proofs. The idea of the machine is to represent the functional program as a graph, which is traversed by a token representing the flow of control. While the token traverses the graph it also causes the graph to reduce by rewriting. This machine is cost-accurate in the sense that it represents faithfully both the number of steps taken during the execution (for a particular evaluation strategy) and the memory footprint. Using a diagrammatic (2D) rather than textual (1D) representation is a convenient tool for describing the memory accurately, since it naturally distinguishes between shared and copied occurrences.

This, for example, is how the program (fun x -> x + x) 1 is represented:
We can observe the syntactically obvious elements of the graph (application "@", function "fun", addition "+", constant "1"). The function-definition subgraph (the lambda expression) is indicated by a dashed yellow line indicating that the subgraph is "copyable" only as a whole. What is distinctive about the GOI-based representation is the representation of variables as edges ("x") which can be shared via special nodes (in the diagram the sharing node is unlabelled). Also distinctive is the representation of abstraction as the connection of the abstracted variables to the fun (lambda) node.

Here is a step-by-step execution in this model. The red arrow indicates the flow of control, pointing initially at the root node of the program, which is the function application.

As we are in a call-by-value language, we are going to evaluate the argument first, which is a reasonable assumption -- evaluating the function first is also possible but more complex and less efficient in many cases. The flow of control advances to the constant "1" and then returns.

After evaluating the argument, the function is inspected next, in three steps. First the flow of control reaches the function, then it "opens" the function, then it enters the function.

At this stage the flow of control is aware that it has visited an application node "@" followed immediately by a function node "fun". This is a combination that can be reduced, by removing these nodes and linking the argument "1" directly to the edge corresponding to the variable "x":
The expression now is equivalent to "1+1", but such that the "1" is actually shared, not replicated (more like let x = 1 in x + x). But the rules of the language evaluation strategy, along with the further propagation of the flow of control will eventually lead to the replication of the argument:
The resulting graph is simply the syntactic tree of a simple expression, which is evaluated to the obvious constant value, in 5 extra stages, of which the last two are:
To avoid being overwhelmed by detail, in a more complex graph we may choose to hide the inside of functions which are not in the process of being evaluated. So, before evaluation the term above would be actually displayed as
Here is now a slightly more complex example in which we can see how closures are created and how they share information in the heap, namely during the execution of the term

let f = (fun x y -> x + y) (0 + 1) in (f 2) + (f 3)

The initial graph is
And here is the stage after a closure has been created via the partial function application (notice "x=1" capture by the closure):

Another interesting stage is the instantiation of the closure twice in the two applications, noting that the argument "x=1" is still shared between the two closures:

Now here is the execution visualiser in action on two more substantial examples, insert-sort vs. bubble-sort, and insert-sort vs. merge-sort. In both cases (sorting small lists) insert-sort wins. We can see how utterly horrible bubble-sort is for functional programming, noting the large amount of copying that happens in this program. We can also see how the constant-factor overheads for merge-sort are too large to be overcome by the asymptotic growth on such a small list.

The visualiser itself will be released soon -- watch this space. It has been implemented by Birmingham University Computer Science student Jack Hughes, as a final year project.

The graph-reduction theory behind this visualiser can be read in The Dynamic Geometry of Interaction Machine: A Token-Guided Graph Rewriter.

## Sunday, 22 October 2017

### The logic of machine learning: inductive and abductive reasoning

"The division of all inference into Abduction, Deduction, and Induction may almost be said to be the Key of Logic." C.S. Peirce

## Deduction, Induction, Abduction

C.S. Peirce, in his celebrated Illustrations of the Logic of Science, re-introduced three kinds of Aristotelian reasoning: deductive, inductive, and abductive. Deduction and induction are widely used in mathematics and computer science, and they have been thoroughly studied by philosophers of science and knowledge. Abduction, on the other hand, is more mysterious. Even the name “abduction” is controversial. Peirce claims that the word is a mis-translation of a corrupted text of Aristotle (“απαγωγη´”), and sometimes used “retroduction” or “hypothesis” to refer to it. But the name “abduction” seems to be the most common, so we will use it. According to Peirce the essence of deduction is the syllogism known as “Barbara”:

Rule: All men are mortal.
Case: Socrates is a man.
——————
Result: Socrates is a mortal.

Peirce calls all deduction "analytic" reasoning, the application of general rules to particular cases. Deduction always results in "apodeictic" knowledge: incontrovertible knowledge you can believe as strongly as you believe the premises. Peirce’s interesting formal experiment was to then permute the Rule, the Case, and the Result from this syllogism, resulting in two new patterns of inference which, he claims, play a key role in the logic of scientific discovery. The first one is induction:

Case: Socrates is a man.
Result: Socrates is a mortal.
——————
Rule: All men are mortal.

Here, from the specific we infer the general. Of course, as stated above the generalisation seems hasty, as only one specific case-study is generalised into a rule. But consider

Case: Socrates and Plato and Aristotle and Thales and Solon are men.
Result: Socrates and Plato and Aristotle and Thales and Solon mortal.
——————
Rule: All men are mortal.

The Case and Result could be extended to a list of billions, which would be quite convincing as an inductive argument. However, no matter how extensive the evidence, induction always involves a loss of certainty. According to Peirce, induction is an example of a synthetic and "ampliative" rule which generates new but uncertain knowledge. If a deduction can be believed, an inductively derived rule can only be presumed. (Also, one must not confuse this 'logical' induction, which is unsound, with 'mathematical' induction which is actually a sound deductive rule for introducing a universal quantifier.)

The other permutation of the statements is the rule of abductive inference or, has Peirce originally called it, “hypothesis”:

Result: Socrates is a mortal.
Rule: All men are mortal.
——————
Case: Socrates is a man.

This seems prima facie unsound and, indeed, Peirce acknowledges abduction as the weakest form of (synthetic) inference, and he gives a more convincing instance of abduction in a different example:

Result: Fossils of fish are found inland.
Rule: Fish live in the sea.
——————
Case: The inland used to be covered by the sea.

We can see that in the case of abduction the inference is clearly ampliative and the resulting knowledge has a serious question mark next to it. It is unwise to believe it, but we can surmise it. This is the word Peirce uses to describe the correct epistemological attitude regarding abductive inference. Unlike analytic inference, where conclusions can be believed a priori, synthetic inference gives us conclusions that can only be believed a posteriori, and even then always tentatively. This is why experiments play such a key role in science. They are the analytic test of a synthetic statement.

But the philosophical importance of abduction is greater still. Consider the following instance of abductive reasoning:

Rule: If the temperature is 20C then the thermometer reads 20C.
——————
Case: The temperature is 20C.

Peirce’s philosophy was directly opposed to Descartes’s extreme scepticism, and abductive reasoning is really the only way out of the quagmire of Cartesian doubt. We can never be totally sure whether the thermometer is working properly. Any instance of trusting our senses or instruments is an instance of abductive reasoning, and this is why we can only generally surmise the reality behind our perceptions. Whereas Descartes was paralysed by the fact that believing our senses can be questioned, Peirce just took it for what it was and moved on.

## A computational interpretation of abduction: machine learning

Formally, the three rules of inference could be written as:

A ⋀ (A → B) ⊢ B (Deduction)
A ⋀ B ⊢ A → B (Induction)
B ⋀ (A → B) ⊢ A (Abduction)

Using the Curry-Howard correspondence as a language design guide, we will arrive at some programming language constructs corresponding to these rules.

Deduction corresponds to producing B-data from A-data using a function A → B. Induction would correspond to creating a A → B function when we have some A-data and some B-data. And indeed, computationally we can (subject to some assumptions we will not dwell on in this informal discussion) create a look-up table from As to the Bs, which maybe will produce some default or approximate or interpolated/extrapolated value(s) when some new A-data is input. The process is clearly both ampliative, as new knowledge is created in the form of new input-output mappings, and tentative as those mappings may or may not be correct.

Abduction by contrast assumes the existence of some facts B and a mechanism of producing these facts A → B. As far as we are aware there is no established consensus as to what the As represent, so we make a proposal: the As are the parameters of the mechanism A → B of producing Bs, and abduction is a general process of choosing the “best” As to justify some given Bs. This is a machine-learning situation. Abduction has been often considered as “inference to the best explanation”, and our interpretation is consistent with this view if we consider the values of the parameters as the “explanation” of the facts.

Let us consider a simple example written in a generic functional syntax where the model is a linear map with parameters a and b. Compared to the rule above, the parameters are A = float × float and the “facts” are a model B = float → float

f : (float × float) → (float → float)
f (a, b) x = a ∗ x + b

A set of reference facts can be given as a look-up table data : B. The machine-learning situation
involves the production of an “optimal” set of parameters, relative to a pre-determined error (e.g.
least-squares) and using a generic optimisation algorithm (e.g. gradient descent):

(a0, b0) = abduct f data
f0 = f (a0, b0)

Note that a concrete, optimal model f0 can be now synthesised deductively from the parametrised model f and experimentally tested for accuracy. Since the optimisation algorithm is generic any model can be tuned via abduction, from the simplest (linear regression, as above) to models with thousands of parameters as found in a neural network.

## From abductive programming to programmable abduction

Since abduction can be carried out using generic search or optimisation algorithms, having a fixed built-in such procedure can be rather inconvenient and restrictive. Prolog is an example of a language in which an abduction-like algorithm is fixed. The idea is to make abduction itself programmable.

We propose a new functional programming construct the type of which is inspired by the abductive logical rule above. The rule is realised by identifying all provisional constants (parameters) in a term, collecting them, and producing a version of the term which is now explicitly parameterised. Concretely, we write let f @ p = t in t′ to mean that the term t is “abductively decoupled” into a function f and a collection of parameters p. Identifiers f and p are bound in t′.

We illustrate abductive programming with a typical machine-learning program. Provisional constants (parameters) are syntactically indicated using braces {−}. The model defines upper and lower boundaries for a set of points in the plain, taking any two arbitrary functions as boundaries. Concretely, it will use a line and a parabola. The parameters need to be tuned in order to best fit some reference data set:

let model low high x = (low x, high x)
let linr x = {1} ∗ x + {0}
let quad x = {4} ∗ x * x + {3} ∗ x + {2}
let m @ p = model linr quad
let p′ = optimise m loss data p
let model′ = m p′

The decoupling will extract a parameter vector p with values [0;1;2;3;4] and create a parameterised model m. Improved parameters p′ can be computed using the parameterised model and an elsewhere-defined generic optimisation and loss functions (e.g. gradient descent and least squares), reference data and the original parameter values. The improved concrete model is model′ which can be used as a standard function in the sequel. We leave it as a reader exercise to reimplement this program fragment in conventional (functional) styles and note the unpleasant accruing of complex boilerplate code required to manage parameters.

The idea we propose here is not entirely original. The transparent update of parameters via generic optimisation algorithms is one of the features that makes Google's TensorFlow so easy to use. What we wanted is to consider this feature as a first-class language feature, rather than an embedded DSL as is TensorFlow presented, under a Curry-Howard-like correspondence with logical inference. The abductive explanation seems to us plausible (and cute).

A longer discussion of abduction, its relation with machine learning, and more examples are given in our paper Abductive functional programming, a semantic approach [arXiv:1710.03984]. The paper has a link to a visualiser [link] which allows a step-by-step execution of an abductive functional program using a semantics in the style of the "Geometry of Interaction".

## Friday, 4 August 2017

### Can we machine-learn a programming language semantics?

In this post I will talk about whether denotational semantics of programming languages can be machine-learned and why that could be useful.

But let's set some background first.

Programming language semantics is a collection of mathematical theories about representing the meaning of programming languages in an abstract, machine-independent ways. The semantics of a language can be given either operationally, by a set of syntactic transformations which, loosely speaking, show how a program is evaluated, or can be given denotationally, by mapping any piece of code into a mathematical object, its "meaning". Operational specifications can be given to realistic languages and, as a method, is the workhorse of formal aspects of programming languages from verification to compilation. Denotational interpretations, on the other hand, are mathematically much more mathematically challenging to develop but are also more powerful. In particular, denotational models can be used to reason about equivalence of sub-programs by reducing it to the equality of their denotations. But perhaps the right phrasing is that denotational models would be useful, because they haven't been developed yet for actual programming languages, just for idealised versions of some programming languages.

When the operational and denotational models coincide, i.e. equivalence in one model coincides with equality in the other, the situation is said to be fully abstract, and it's generally considered a good thing. Full abstraction is a technically fascinating problem, but the notion of equivalence it models can be seen as somewhat contrived. I wrote about these issues earlier and I won't reprise the argument. But even with these caveats, it would certainly be neat to be able to say whether two program fragments are equivalent. This could have some nice application in program verification, in particular compiler optimisation verification.

But can we derive a denotational model of a programming language automatically, via some kind of machine learning? This question was posed to me by Martín Abadi, whom I had the pleasure to run into at a very nice research conference in Berlin a couple of years ago. Martín has been a leading light of programming language research for many years, and is now involved with the TensorFlow project at Google. He asked me this question for a particular reason. My research gravitates around a particular kind of denotational semantics called game semantics. I wrote about game semantics earlier, and I recommend these posts for an in-depth informal introduction.

What makes game semantics appealing from the point of view of machine learning is that it is mathematically elementary. Its basic notion is that of a move, i.e. a way in which a sub-program can interact with its broader context. More precisely, function calls and returns, either from the sub-program into the context or the other way: in total, four kinds. Moves are sequenced into plays, which represent the way a sub-program interacts with its larger context during a particular program execution. Finally, all these plays are grouped into sets called strategies, which then model that particular sub-program. So in the end, each sub-program is represented by a set of sequences -- a language. And we know that languages can be, to some extent, machine learned.

Luckily, we can conduct a controlled experiment using one of the numerous idealised languages for which game semantic models (fully abstract nonetheless) have been created. We can generate random plays (sequences) using the model and see whether they can be learned. To make a long story short, it turns out that yes we can learn these sets, for rather complex languages (higher-order functions, state, concurrency). And we can learn them rather well and rather efficiently, using LSTMs and TensorFlow. The full story is in the paper On the Learnability of Programming Language Semantics, which has recently been presented at the 10th Interaction and Concurrency Experience (June 21-22, 2017, Neuchâtel, Switzerland), in collaboration with Khulood Alyahya [PDF].

The work of creating such sequences for production languages is harder, but not impossible. Any sub-term to be modeled in a larger program must be instrumented to record its interactions with the context, in a way similar to a profiler, then the program must be executed on random inputs and the interaction data collected. And then other sub-terms in other programs must be instrumented in the same way. It is a lot of work but it all can be automated. And, considering our proof-of-concept experiment, there is a good chance that the resulting model is going to be accurate enough to be useful!

### Zippers for non-inductive types

Famous mathematician-philosoper Gian-Carlo Rota said that mathematical understanding is all about showing how things that seem different are...