Dan Ghica's Personal Blog: Teaching Compilers

For the first time in my 15 years as a lecturer I got to teach Compilers. I have written compilers before and I have published a bunch of research papers on the topic, so I considered myself reasonably qualified to teach it. I had also taught Compilers Workshop before, a hands-on project oriented course on building a working compiler. So from my own experience and from observing students, I had a pretty good idea about where the challenges lie. So I wanted to teach a course that was aimed particularly at those challenges. I looked around quite hard for a textbook that would give some emphasis to these topics which I deemed to be crucial, but I failed to find one.

Here are some random musings on these topics.

Parsing

Compiler books spend an inordinate amount of time on the intricacies of parsing. As a subject in itself parsing is fascinating both from a theoretical and algorithmic perspective. As a conceptual hurdle in building a parser for a compiler, it is not a significant one. If the aim is to build a compiler, rather than a parser, then the reasonable thing to do is to use a parser generator. To use a parser generator the required concepts are that of a grammar and that of conflict. The intricacies of understanding the details of the parsing algorithm are irrelevant as they do not help designing elegant grammars nor even resolving conflicts. Like grammar design, resolving conflicts is an art, because there are no algorithms for resolving grammar conflicts. I suspect the reason parsing is such a popular element of compiler courses is that it's very testable. Putting a grammar into a particular form so it becomes parsable by a particular parser is the kind of technical drudgery that the student will forget as soon as they write their final exam, but it makes the job of writing a final exam easy. Parsing could be a valid stand-alone course for those interested in that kind of stuff.

Operational Semantics

The absence of OS from compilers textbook is to me baffling. Any language needs to be specified before it's implemented and OS is a reasonably straightforward way to achieve that. Instead, conventional compilers books only show how to compile basic language features that you find in a C compiler and their behaviour is "obvious". Even closures, essential for functional languages, are often ignored. But if you wanted to write a new compiler, presumably your language had some unique features? How do you specify those new features? Operational Semantics.

Abstract Machines

They are the nexus between OS and executable code. There is some beautiful (and reasonably elementary and accessible) theory about how AMs can be derived from OS. In the case of functional programming in particular AMs are unavoidable. But even in the case of OO languages, which have a complex method dispatch protocol, an AM is the best way to understand the OS in an execution-oriented way.

Types

Modern programming languages are all about types, yet compiler textbook spend a trivial amount of time on what a type system is and on type inference algorithms. Even the terminology for type inference and type checking enshrined in compiler textbook is bizarre, "semantic analysis". Students must be aware of the powerful combination of OS and Type Systems in order to guarantee various soundness-style runtime properties, even though the details of such proofs do not belong in the compilers course (but in a Programming Languages course). However, these concepts must be understood by any compiler designer and implementer.

Code generation

Instead of OS and AMs the conventional compiler book will jump straight into code generation, presented as a series of recipes for compiling this feature and that feature. Code generation is usually baffling because there is little understanding as to where those recipes come from (absent an AM model) and how we know they are correct (absent an OS). They are taken as obvious, I suppose, but they are not.

Front-end optimisations

So long as the code is still represented as some kind of syntax tree most compiler books do a good job at presenting optimisation. Things like constant folding and constant propagation are a good mix of interesting concepts and clever algorithms. Particularly interesting, both conceptually and algorithmically is Single Static Assignment and optimisations deriving from it. However, absent an OS it is difficult to justify that such transformations are meaning-preserving. Also, absent an AM with an execution-cost model it is difficult to justify that a particular transformation reduces the cost of execution. Both these aspects need to be taken as obviously correct, but it is not entirely clear that they always are.

Back-end optimisations

After code-generation optimisation becomes an extremely complex exercise. Things like register allocation and other memory and instruction management optimisations are so tied to the machine model that algorithms are fragile. We need to either give Mickey Mouse algorithms based on greatly simplified machine models or we stand to be drowned into a sea of processor and operating/runtime system details that change continuously due to evolving technologies. I think that at this level the student should only be made aware of the concepts, yet the compiler textbooks insist on technical detail -- register allocation via graph colouring is the darling here. I think this is also the kind of stuff that the student will instantly forget so there is no great reason to teach it, except that it makes for easy-to-grade exam questions.

Exotic architectures

Nowadays multicore, cloud, GPUs and FPGAs are no longer rare. They all raise interesting and pertinent challenges for compilers, yet they are all largely ignored. Technical details are complicated so they cannot be taught, but the concepts can be taught in a way that at least makes students aware of the challenges they will face once they leave the enchanted land of the compiler course and its vanilla Von Neumann architecture.

❡

In my own course I tried to address all these issues by:

teaching only basic concepts of parsing along with some hands-on skills on parser generators
introducing operational semantics, for higher-order functions and state
introducing types, type inference, and the basic idea of runtime safety
introducing abstract machines derived from the operational semantics and showing how they lead to efficient interpreters
deriving code from the abstract machines
conventional presentation of optimisations (both front-end and back-end)
a tour of exotic architecture and programming languages for the future

It was an incredible amount of work and the course notes are still quite rough -- too rough for public consumption -- but very satisfying. Most satisfying was the very positive reaction of my students who rated the course very highly.

4 comments:

Stefan Ciobaca23 April 2020 at 07:37
Great choices! Another recent development worth introducing into compiler lectures, imho, is incrementality, as described here: https://www.youtube.com/watch?v=wSdV1M7n4gQ.
Peter Sovietov23 April 2020 at 13:20
Thank you for your notes, it was a really interesting read! Here are some of my thoughts on the topic, if you don't mind.

Most of the modern compilers don't use parser generators due to issues like error checking, speed, integration with the main code etc. They use a recursive-decent technique with extensions like PEG, Pratt parser. It makes sense to teach students to implement these techniques in the form of parser combinators.

An interesting thing about parsing and finite automaton theory (FA) is that you may use them in the later compiler phases too: in instruction scheduling (see VLIW scheduling in LLVM, it's based on FA), in instruction selection (burs/burg-like methods), even in dataflow analysis.

Yes, it's really important to have standard formal semantics for your PL. But the problem here is that very few existing PLs (Scheme, ML) have it.

Moreover, you need both static and dynamic semantics here. And the additional problem is that we have a lot of different "semantics", and every year there is more. So which one should I select both as part of language standard and as part of compiler implementation?

For example, there is a very promising approach by Peter D. Mosses: https://arxiv.org/pdf/1912.10631.pdf

It's a very interesting idea to automatically generate a compiler from a formal semantics description. One may use partial evaluation, Futamura projections... but it's unclear if you may get an efficient/optimizing compiler as the result.

Yet another question is which abstract machine (AM) to choose. If we want to use it to formalize/automate code generation then it makes sense to take a look at your target machine first. Then your AM may have some specific characteristics: for example, be register-based, stack-based, or even data flow-based.

It's really important to talk about optimizations and exotic architectures (my favorite topics!), but in the end, the students may just get the impression that these topics are too messy, informal, and low-level. And, of course, it's not true, these topics are hottest, most evolving of all other compiler-related things. Maybe it makes sense just to leave them for another course instead?
Anonymous30 October 2020 at 05:51
Thanks for sharing! I found this insightful as I'm currently exploring the world of compilers on my free time. Just curious. Do you plan to make any part of your compiler course public in the future?

Dan Ghica's Personal Blog

Sunday, 19 April 2020

Teaching Compilers