Monday 23 July 2018

Reflections of a PC Chair

We are please to inform you that your paper has been accepted for publication.
These are the words that you wait for as an academic researcher. But how do we get there? What is the process driving the decision of acceptance versus rejection? It is something known as "peer review". To paraphrase Churchill, it is a broken, frustrating, and error-prone process. Just that a better one has yet to be invented.

I am an academic researcher in theoretical computer science. Our area of research is peculiar in the elevated importance it gives to conference publications as opposed to journal publications. This has never been a deliberate decision but is something that just happened and we seem to be culturally stuck in this model, for better or worse (probably worse rather than better).

I have been involved in several aspects of academic research service as programme committee member, journal editor, and in a couple of instances as programme committee chair for established conferences. As a PC member I learned many things about how the process works, and as a PC chair I have tried to improve it.

How the process usually works

Most conferences I participate in are hosted by a an online system called EasyChair. It has a decidedly Web 1.0 flavour, neither elegant nor easy to use. But it has been used successfully in countless instances so its reputation is cemented. One feature of EasyChair exerts a direct, subtle, and sometimes unconscious bias on the way the papers are ultimately ranked. I doubt that a lot of thought went into that feature, but it has become established by repeated usage. It is essentially a preliminary ranking based on "acceptance" scores ranging for example from +2 ("strong accept") to -2 ("strong reject").

We are by now quite used to these ratings of "strong accept", "weak accept", "weak reject", "strong reject" even though of course ultimately accept/reject is a binary decision. To refine things further a reviewer is sometimes allowed to express a degree of confidence in their reviews, self-evaluating from "expert" (+4) down to "null" (0). These scores are then used to compute a weighted sum of all the scores of a paper, which are then used to do a preliminary, evolving, ranking of the papers.

The system works well enough to identify papers which are unanimously strong/weak accepts or strong/weak rejects, but a band of inconclusive results is often left in the middle, out of which papers must be selected. It is generally accepted that at this point the scores stop working. The weighted average of a "weak reject" by an "expert" and a "strong accept" by an "informed outsider" is for all purposes a meaningless quantity. I have no evidence that is so, but in a discussion I ran on my Facebook page, in which many senior academics participated, there seemed to be a clear consensus on this point. For these middling papers it is the role of PC discussions to sort out the final decision.

This idea seems nice in principle but it is unworkable in practice. The main reason is that each PC member is assigned a batch of papers to discuss and rank, so the PC member has a local view guided by their own batch of papers. On the other hand, the acceptance decision can only be done globally. If we accept 40% of the papers, that is a global ratio and it does not mean that for each reviewer 40% of their own batch is accepted. The local picture can be very different. In a conference I have recently chaired this is the accept / reject ratio per PC member:


Accept Reject
7
6
5
7
6
6
7
6
5
8
5
8
7
5
9
4
3
10
7
6
4
8
3
9
7
5
1
6
5
7
4
8
6
7
3
10
4
9
7
6
3
10

Depending on what you have on your plate, a paper may seem relatively good (or bad), but that view is biased by the vagaries of distribution.

Moreover, the discussions are further biased by the fact that the preliminary rankings based on the meaningless aggregation of reviewer accept/reject scores (possibly weighted by self-assessed expertise) is visible during the discussions. Even though the rankings are based on meaningless data, the position of a paper will inevitably mark it as a potential winner or loser, with the potential winners drawing the more energetic support of those who are willing to champion it.

I have seen this phenomenon happen, also in grant allocation panels. Preliminary rankings which are officially declared as meaningless, and putatively used only to sort out the order in which grant proposals are discussed, have a crucial impact on final outcomes.

An alternative ranking system

In a conference I have recently chaired I made two changes.

The first change is that I separated scores based on technical merit (i.e. whether a paper is wrong or trivial) from scores based on more subjective factors (i.e. whether a paper is important or interesting). The second thing is that no aggregate score was computed and no preliminary ranking was made available to PC members.

The technical scores were used in a first stage, to reject papers with potential problems. The last thing you want to do is to publish a paper that has mistakes, is trivial, or merely an insignificant increment on existing work. Because of this, if any of the reviewers indicated a technical problem with a paper, and that indication went unchallenged by the other reviewers, the paper was rejected. That seems to be the right way to consider technical scores overall. If one of three reviewers finds a mistake, and that mistake stands, it does not matter that the others didn't spot it, even if they thought that was an otherwise excellent paper. The correct aggregate is minimum not average.

The remaining papers after this stage were all in a techincal sense acceptable. Their publication would not have been embarrassing to us. A case can be made that they should have been accepted, but for logistic reasons outside of my control the total number of papers that could have been accepted was much smaller.

If the first stage focussed on rejection, the second stage focussed on acceptance. From the technically acceptable papers, a ranking was made on the basis of expressed interest. Aggregating a subjective measure, such as interest is not easy. Also, relying on discussions on a matter of taste is also not likely to be very helpful because more often than not interest is stemming out of familiarity with the topic. Few people are likely to become highly interested in papers out of their area, and those must be exceptionally interesting papers indeed.

Two ways of aggregating interest seem plausible: maximal interest vs. average interest. A test case would be a paper with a profile of +2 and -2 (like a lot / dislike a lot) versus +1 and 0 (like a little/indifferent). The first paper wins on maximal interest, the second one on average. Both of them make sense, although I am inclined to prefer the first one to the second one since it is likely to make for a more interesting talk. This should matter in a conference.

There is another aspect for which I preferred the maximal to the average. In the discussion stage the referees can adjust their interest scores following their reading of other reviews and discussions. By taking the average, in limit cases we give the referees an effective veto over what papers get accepted because a downward adjustment can easily take a paper out of contention. By taking the maximum we empower the reviewers who like the paper. In effect, the process is more acceptance-oriented rather than rejection-oriented. This aspect I liked a lot.

In the end, the outcomes of the two systems are not radically different, as one might expect. About 15% of the accepted papers under the maximal interest system would have been rejected under the average interest system, because one reviewer disliked it. They would have been replaced by less controversial papers (in terms of interest, because technically they are all already vetted).

The level of technical virtuosity of a paper is quite strongly correlated to the level of interest. The number of technically strong papers, as unanimously rated by their reviewers which didn't make the programme represented only about 13% of the programme. A technically strong paper is not necessarily interesting, but it seems that it often is.

Conclusions

Was I happy with the resulting programme? Yes, in particular with the fact that several 'controversial' papers were accepted. One thing research should not do is regress to the mean, especially in matters of taste. There should be some room for those who want to stir things up -- provided their arguments are technically sound. I am more happy to see those in, rather than technically more complex papers which elicit little interest or dissent.

My main frustration is that the acceptance rate was artificially lowered by the total number of papers we could accept. Expanding the programme by 15% would have ensured that all papers considered interesting by at least one member of the PC were accepted. As it stands, all papers strongly liked by two or three members were accepted, and all papers nobody strongly liked were ultimately rejected. But still, there were several papers in between -- strongly liked by exactly one reviewer -- which should have been and could have been accepted. Instead, they had to be ranked and a line had to be drawn.

As an observer to the process, I cannot believe the claim that conferences with low acceptance ratios will accept "higher quality" papers. Quality of research is elusive at best, as a measure. The history of science is full of examples of papers which were rejected or ignored despite being not only correct but also revolutionary, or of discredited work which was widely accepted in its time. We can say with some confidence whether a paper is technically correct or not, but beyond that we venture into the realm of fashion, taste and aesthetics.

For the future I would make one further change. For technical vetting, which is laborious, perhaps 3 reviews are enough. But for gaging the level of interest further reviewers, perhaps the entire programme committee could and should be asked to express preferences. The sample of 3 people to assess interest seems statistically quite low to be conclusive. More people can include more papers into the pool of interesting papers, especially if the maximal aggregation is used.


A broader discussion can be had, and is ongoing in various quarters, on the suitability of the peer review process and the publication models of academic research in general and theoretical computer science in particular. But before wider, more revolutionary, change is undertaken, small tweaks to the existing process may reduce random sources of unconscious bias and lead to a more desirable distribution of accepted papers. Especially when the existing processes are not deliberately designed but merely the contingent outcome of existing IT support systems.

2 comments:

  1. "This idea seems nice in principle but it is unworkable in practice. The main reason is that each PC member is assigned a batch of papers to discuss and rank, so the PC member has a local view guided by their own batch of papers."

    I used to think so until recently, when somebody pointed out that I've been reading many papers in the past.

    For example, if you review for POPL, then you've been reading many POPL papers in the past, and that gives you a good baseline of what "POPL quality" is supposed to be. Judging quality of papers only relative to the review batch you have (what you call "local view") is indeed a mistake.

    ReplyDelete
    Replies
    1. That is an excellent point and I am happy you made it!

      POPL is special in that it requires (or it did require last time I was on the PC) its PC members to review all papers assigned to them. This ensures that all papers are read and evaluated by reviewers who have this reasonably broad perspective that you mention.

      However, most conferences (including the ones I have chaired) rely extensively on sub-refereeing. Case in point, 56% of the reviews were written by sub-referees. Unlike the experienced POPL committee members, sub-referees are going to include many junior researchers who lack this sense of broader perspective and make their judgement on an extremely narrow sample of 1 paper, with no much context. The PC members are reduced to assembling some kind of hearsay perspective of their papers. This is actually worse than what I wrote, and I didn't want to, but you made me :)

      POPL is also special because it is an old, established conference which publishes many papers, so a sense of baseline does indeed exist, indeed. So your comment is correct, but it is also exceptional.

      Delete

Understanding the issue of equality in Homotopy Type Theory (HoTT) is easier if you are a programmer

We programmers know something that mathematicians don't really appreciate: equality is a tricky concept. Lets illustrate this with a str...