On the Ofqual algorithm for the allocation of A-level grades
The recent scandal surrounding the use of an algorithm to adjust students’ predicted A-level grades by the Office of Qualifications and Examinations Regulations (Ofqual) has highlighted a lack of consideration of basic data ethics by the government. In this brief report we outline, at a high level, the basic pipeline of the algorithm used, why this approach was not sound from a technical perspective, and why the entire notion of using an algorithm for this purpose is raises significant ethical concerns
The Ofqual process
For all students, teachers provided two pieces of information: a predicted grade, referred to as a centre assessment grade (CAG); and a ranking of students within each grade boundary by their expected performance. In cases where a particular school had more than fifteen students on a course the following process was followed by Ofqual:
Historic grades from the previous three years of exams at the school were used to produce a grade distribution based on past students studying the same course at the same school.
Considering all of the data available from previous years, an algorithm learns the connection between future attainment (A level grades) and past attainment (GCSE grades).
Based on this learned connection, the algorithm can “predict” the grade distribution of past students at the school (e.g. the 2019 year from their 2017 GCSEs). The predicted grade distribution can be compared to the actual grade distribution these past students achieved.
The performance of current students is then predicted using the same connection between past and future performance.
Ofqual’s data on past students is incomplete. So in this step they determine whether the school in question has good enough records of students’ past performance at GCSE level to be used in future steps.
If the school does not have data on previous attainment at GCSE level of current students, then the grade distribution of past students is used directly as the predicted grade distribution for the current cohort. If the school does have records of past attainment, the difference between the predicted distribution for the current cohort and the previous cohort is used to adjust the predicted grade distribution.
Students are then assigned a grade based on the ranking given by their teachers, such that the final distribution of grades awarded matches the predicted distribution
These grades are then converted in to marks based on the student’s rank within that grade boundary.
National grade boundaries are then adjusted based on the predicted grade distribution for the whole country. Students are then assigned a final grade based on where their given mark (from step 8) falls within the new grade boundaries.
If a school has fewer than five students on a course, then steps 1-7 are skipped and the grades are just based on the CAG. If a school has between five and fifteen students on a course then a mix of the two approaches is used.
A spokesperson for Ofqual said, ““Our standardisation model, for which we have published full details, does not distinguish between different types of centres and therefore contains no bias”. However, this is untrue: notice that in the case that a centre has over fifteen students on a course, the CAG is not taken into account at all in the final grade.
Contrast this to the lower cohort sizes where the given grades are either wholly or partially based upon the teacher-determined CAG. In cases where CAGs are higher than the grades adjusted by the algorithm, the Ofqual process would overwhelmingly benefit students studying at private schools, due to the much smaller year group sizes (and hence lower probability of there being more than 15 students on a particular course) . This inequality has been bourne out in the data as the number of A* grades rose by 2% at comprehensive schools, but by 4.5% at private schools.
Furthermore, when considering the past performance of a student, the Ofqual process only considered which ten-percent band (“decile”) the student fell in. This has the effect of reducing the impact of good past performance of an individual student at a poorly performing school. For example, if a student achieved a top of the class mark at GCSE the algorithm only saw that they were in the top 10% of students.
How the algorithm ‘cheated’
All predictions involving the use of algorithms are subject to validation tests, which measure the fitness of the method for application to real-world problems. To enable these key tests the dataset is split into two parts, the training and test datasets. The algorithm is taught how to solve the problem, in this case predicting A-level grades, using the training set. During this process of teaching the algorithm to solve the problem the testing dataset should never be used.
When an algorithm has been trained, it predicts the results, in this case A-level results, for the test dataset, which can be compared to their known results. This is a key aspect of evaluating whether the algorithm can effectively make predictions for new datasets, when the results are not known. This is often part of an iterative process of tweaking parameters or attempting different algorithms to improve the validation test results.
Figure 1: (a) The training data and test data are entirely separate; therefore, testing is more representative of a real-world application of the model. (b) The test data is part of the training data, the model has seen this data before and therefore has already learned the patterns present and will perform unrealistically accurately.
For these reasons, it is critically important that the algorithm never sees the testing dataset while training is still happening. Otherwise the two datasets become merged and it becomes impossible to evaluate the performance of the model on new (unseen) data. Figure 1 shows how this ‘cheating’ occurs when test data is taken from the training data.
According to Tom Haines from the University of Bath, the Ofqual algorithm was tested on data which was part of the training dataset. This is a meaningless test which can give no further useful information about how the algorithm will perform in future when making predictions from data it has never seen before. The separation of training and testing data is an extremely basic principle in data science, the failure in this case is therefore very concerning.
The data which algorithms learn patterns from are referred to as features, in this case the features include the performance of past students from the same school (alongside past performance of the student, the grade distribution of the country etc.). It is not clear from the reports available who had the final say on what data were selected as features which is a decision taken by Ofqual which can directly affect the decisions the model learns to make. For example, the inclusion of a school’s past performance leads to ethical issues surround ingrained socioeconomic biases which can punish these students unfairly (for another example of this see Amazon’s now defunct recruitment algorithm which learned, from institutional biases already present in the company to automatically reject many applications from women).
Due to the failures in proper data-science best practice the Royal Statistical Society (RSS) has been called in to audit the Ofqual process. However, Ofqual have required the RSS to sign a non disclosure agreement (NDA). This removes any transparency from the process of reviewing the failings of Ofqual.
Regardless of the technical mismanagement of the algorithm described previously, there is a more fundamental flaw in this approach. In order for an algorithm to take a decision based on some data, it must follow a series of computational processes which have been designed by a data scientist when the model was built.
This data scientist is therefore the one actually making decisions and is entirely unaccountable to the public their model is directly impacting. Due to the lack of accountability and oversight, it was entirely improper to use an algorithm for the purpose of making life-changing decisions for an entire cohort of students.