Sunday, 22 October 2017

The logic of machine learning: inductive and abductive reasoning

"The division of all inference into Abduction, Deduction, and Induction may almost be said to be the Key of Logic." C.S. Peirce

Deduction, Induction, Abduction

C.S. Peirce, in his celebrated Illustrations of the Logic of Science, re-introduced three kinds of Aristotelian reasoning: deductive, inductive, and abductive. Deduction and induction are widely used in mathematics and computer science, and they have been thoroughly studied by philosophers of science and knowledge. Abduction, on the other hand, is more mysterious. Even the name “abduction” is controversial. Peirce claims that the word is a mis-translation of a corrupted text of Aristotle (“απαγωγη´”), and sometimes used “retroduction” or “hypothesis” to refer to it. But the name “abduction” seems to be the most common, so we will use it. According to Peirce the essence of deduction is the syllogism known as “Barbara”:

Rule: All men are mortal.
Case: Socrates is a man.
Result: Socrates is a mortal.

Peirce calls all deduction "analytic" reasoning, the application of general rules to particular cases. Deduction always results in "apodeictic" knowledge: incontrovertible knowledge you can believe as strongly as you believe the premises. Peirce’s interesting formal experiment was to then permute the Rule, the Case, and the Result from this syllogism, resulting in two new patterns of inference which, he claims, play a key role in the logic of scientific discovery. The first one is induction:

Case: Socrates is a man.
Result: Socrates is a mortal.
Rule: All men are mortal.

Here, from the specific we infer the general. Of course, as stated above the generalisation seems hasty, as only one specific case-study is generalised into a rule. But consider

Case: Socrates and Plato and Aristotle and Thales and Solon are men.
Result: Socrates and Plato and Aristotle and Thales and Solon mortal.
Rule: All men are mortal.

The Case and Result could be extended to a list of billions, which would be quite convincing as an inductive argument. However, no matter how extensive the evidence, induction always involves a loss of certainty. According to Peirce, induction is an example of a synthetic and "ampliative" rule which generates new but uncertain knowledge. If a deduction can be believed, an inductively derived rule can only be presumed. (Also, one must not confuse this 'logical' induction, which is unsound, with 'mathematical' induction which is actually a sound deductive rule for introducing a universal quantifier.)

The other permutation of the statements is the rule of abductive inference or, has Peirce originally called it, “hypothesis”:

Result: Socrates is a mortal.
Rule: All men are mortal.
Case: Socrates is a man.

This seems prima facie unsound and, indeed, Peirce acknowledges abduction as the weakest form of (synthetic) inference, and he gives a more convincing instance of abduction in a different example:

Result: Fossils of fish are found inland.
Rule: Fish live in the sea.
Case: The inland used to be covered by the sea.

We can see that in the case of abduction the inference is clearly ampliative and the resulting knowledge has a serious question mark next to it. It is unwise to believe it, but we can surmise it. This is the word Peirce uses to describe the correct epistemological attitude regarding abductive inference. Unlike analytic inference, where conclusions can be believed a priori, synthetic inference gives us conclusions that can only be believed a posteriori, and even then always tentatively. This is why experiments play such a key role in science. They are the analytic test of a synthetic statement.

But the philosophical importance of abduction is greater still. Consider the following instance of abductive reasoning:

Result: The thermometer reads 20C.
Rule: If the temperature is 20C then the thermometer reads 20C.
Case: The temperature is 20C.

Peirce’s philosophy was directly opposed to Descartes’s extreme scepticism, and abductive reasoning is really the only way out of the quagmire of Cartesian doubt. We can never be totally sure whether the thermometer is working properly. Any instance of trusting our senses or instruments is an instance of abductive reasoning, and this is why we can only generally surmise the reality behind our perceptions. Whereas Descartes was paralysed by the fact that believing our senses can be questioned, Peirce just took it for what it was and moved on.

A computational interpretation of abduction: machine learning

Formally, the three rules of inference could be written as:

A ⋀ (A → B) ⊢ B (Deduction)
A ⋀ B ⊢ A → B (Induction)
B ⋀ (A → B) ⊢ A (Abduction)

Using the Curry-Howard correspondence as a language design guide, we will arrive at some programming language constructs corresponding to these rules. 

Deduction corresponds to producing B-data from A-data using a function A → B. Induction would correspond to creating a A → B function when we have some A-data and some B-data. And indeed, computationally we can (subject to some assumptions we will not dwell on in this informal discussion) create a look-up table from As to the Bs, which maybe will produce some default or approximate or interpolated/extrapolated value(s) when some new A-data is input. The process is clearly both ampliative, as new knowledge is created in the form of new input-output mappings, and tentative as those mappings may or may not be correct. 

Abduction by contrast assumes the existence of some facts B and a mechanism of producing these facts A → B. As far as we are aware there is no established consensus as to what the As represent, so we make a proposal: the As are the parameters of the mechanism A → B of producing Bs, and abduction is a general process of choosing the “best” As to justify some given Bs. This is a machine-learning situation. Abduction has been often considered as “inference to the best explanation”, and our interpretation is consistent with this view if we consider the values of the parameters as the “explanation” of the facts.

Let us consider a simple example written in a generic functional syntax where the model is a linear map with parameters a and b. Compared to the rule above, the parameters are A = float × float and the “facts” are a model B = float → float

f : (float × float) → (float → float)
f (a, b) x = a ∗ x + b

A set of reference facts can be given as a look-up table data : B. The machine-learning situation
involves the production of an “optimal” set of parameters, relative to a pre-determined error (e.g.
least-squares) and using a generic optimisation algorithm (e.g. gradient descent):

(a0, b0) = abduct f data
f0 = f (a0, b0)

Note that a concrete, optimal model f0 can be now synthesised deductively from the parametrised model f and experimentally tested for accuracy. Since the optimisation algorithm is generic any model can be tuned via abduction, from the simplest (linear regression, as above) to models with thousands of parameters as found in a neural network.

From abductive programming to programmable abduction

Since abduction can be carried out using generic search or optimisation algorithms, having a fixed built-in such procedure can be rather inconvenient and restrictive. Prolog is an example of a language in which an abduction-like algorithm is fixed. The idea is to make abduction itself programmable.

We propose a new functional programming construct the type of which is inspired by the abductive logical rule above. The rule is realised by identifying all provisional constants (parameters) in a term, collecting them, and producing a version of the term which is now explicitly parameterised. Concretely, we write let f @ p = t in t′ to mean that the term t is “abductively decoupled” into a function f and a collection of parameters p. Identifiers f and p are bound in t′.

We illustrate abductive programming with a typical machine-learning program. Provisional constants (parameters) are syntactically indicated using braces {−}. The model defines upper and lower boundaries for a set of points in the plain, taking any two arbitrary functions as boundaries. Concretely, it will use a line and a parabola. The parameters need to be tuned in order to best fit some reference data set:

let model low high x = (low x, high x) 
let linr x = {1} ∗ x + {0} 
let quad x = {4} ∗ x * x + {3} ∗ x + {2} 
let m @ p = model linr quad
let p′ = optimise m loss data p 
let model′ = m p′

The decoupling will extract a parameter vector p with values [0;1;2;3;4] and create a parameterised model m. Improved parameters p′ can be computed using the parameterised model and an elsewhere-defined generic optimisation and loss functions (e.g. gradient descent and least squares), reference data and the original parameter values. The improved concrete model is model′ which can be used as a standard function in the sequel. We leave it as a reader exercise to reimplement this program fragment in conventional (functional) styles and note the unpleasant accruing of complex boilerplate code required to manage parameters.

The idea we propose here is not entirely original. The transparent update of parameters via generic optimisation algorithms is one of the features that makes Google's TensorFlow so easy to use. What we wanted is to consider this feature as a first-class language feature, rather than an embedded DSL as is TensorFlow presented, under a Curry-Howard-like correspondence with logical inference. The abductive explanation seems to us plausible (and cute). 

A longer discussion of abduction, its relation with machine learning, and more examples are given in our paper Abductive functional programming, a semantic approach [arXiv:1710.03984]. The paper has a link to a visualiser [link] which allows a step-by-step execution of an abductive functional program using a semantics in the style of the "Geometry of Interaction". 

Haskell: If monads are the solution, what is the problem?

Monads have exerted a curious fascination on programmers who turn to Haskell. The incomprehension of the newbies is matched in intensity onl...