FAT Forensics: A Python Toolbox for Algorithmic Fairness, Accountability and Transparency

Today, artificial intelligence systems driven by machine learning algorithms can be in a position to take important, and sometimes legally binding, decisions about our everyday lives. In many cases, however, these systems and their actions are neither regulated nor certified. To help counter the potential harm that such algorithms can cause we developed an open source toolbox that can analyse selected fairness, accountability and transparency aspects of the machine learning process: data (and their features), models and predictions, allowing to automatically and objectively report them to relevant stakeholders. In this paper we describe the design, scope, usage and impact of this Python package, which is published under the 3-Clause BSD open source licence.


ALGORITHMIC FAIRNESS, ACCOUNTABILITY AND TRANSPARENCY WITH FAT FORENSICS
Open source software is the backbone of reproducible research, especially so in Artificial Intelligence (AI) and Machine Learning (ML) where changing the seed of a random number generator may cause a state-of-the-art solution to become a subpar predictive system. Despite numerous efforts to ensure that publications are accompanied by code, both the AI and ML fields struggle with a reproducibility crisis [8]. One way to address this problem is to promote publishing high-quality software used for scientific experiments under an open source licence or enforce it as part of the publishing process [28]. In spite of their importance, implementations are nonetheless commonly treated just as a research byproduct and often abandoned after publishing the findings based upon them. We call this phenomenon paperware, i.e., code whose main purpose is to see a paper towards publication rather than implement any particular concept with thorough software engineering practice. Such attitude results in standalone packages that often prove difficult to use due to the lack of documentation, testing, usage examples and (post-publication) maintenance, therefore impacting their reach, usability and, more broadly, reproducibility of scientific findings. This state of affairs is especially problematic for AI and ML research with its fast-paced environment, lack of standards and far-ranging social implications.
Widespread reliability issues with ML systems have inspired a range of frameworks to assess and document them as well as report their quality, robustness and other (technical) properties through standardised mechanisms.
For example, researchers have suggested approaches to characterise data sets [4,7]; automated decision-making systems [19]; predictive models offered as a service accessible via an Application Programming Interface (API) [2]; ranking algorithms [29]; AI & ML explainability approaches [22]; and privacy aspects of applications that collect, process and share user data [9] to ensure their high quality, transparency, reliability and accountability. Such efforts are laudable, however they may require the authors to understand the investigated system in detail, suffer from limited scope or be subject to time-and labour-intensive creation process, all of which can hinder their uptake or slow down the ML research and development cycle. Moreover, self-reporting -and lack of external audits -means that some of their aspects may be subjective, hence misrepresent the true behaviour of the underlying system, whether done intentionally or not. Certification, on the other hand, creates a need for external bodies, which seems difficult to achieve for all ML systems that somehow affect humans.
To help address such shortcomings in the fields of AI & ML Fairness, Accountability and Transparency (FAT), we designed and developed an open source Python package called FAT Forensics [25] - Table 1 lists the algorithms distributed in its latest release (version 0.1.1). It is intended as an interoperable framework to implement, test and deploy novel algorithms proposed by the FAT community as well as facilitate their evaluation and comparison against state-of-the-art methods, therefore democratising access to these techniques. The toolbox is capable of analysing all facets of the data-driven predictive process -data (raw and their features), models and predictions -in view of their FAT aspects. The common interface layer of the software (described in §2) makes it flexible enough to support workflows typical of academics and practitioners alike, and enables two modes of operationresearch and deployment -that span diverse use cases such as prototyping, exploratory analytics, (numerical or visual) reporting and dashboarding as well as inspection, monitoring and evaluation of FAT properties. Additionally, the package is backed by thorough and beginner-friendly documentation, which spans tutorials, examples, how-to manuals and a user guide. In the following section ( §2) we introduce our software and describe its architecture. Next, we present a number of possible use cases and benefits of having various FAT algorithms under a shared roof ( §3). We conclude the paper with an overview of the impact of our package to date and a discussion of the envisaged long-term benefits of FAT Forensics in view of our contributions ( §4). While this paper focuses on the wide-reaching advantages of our software, a complementary publication [25] offers its high-level overview, implementation details and comparison to related packages. problem-oriented how-to guides; and a comprehensive user guide. The toolbox implements a number of popular FAT algorithms -with many more to come -under a coherent API, reusing many functional components across FAT tools and making them readily accessible to the community. The initial development is focused on tabular data and wellestablished predictive models (scikit-learn [16]), which will be followed by techniques capable of handling sensory data (images & text) and neural networks (TensorFlow [1] & PyTorch [15]). Additionally, we envisage that relevant software packages that are already prominent in the FAT community and that adhere to best software engineering practice can be "wrapped" by our toolbox under a common API to make them easily accessible and avoid re-implementing them.

DESIGN AND ARCHITECTURE
Algorithms included in FAT Forensics are designed and engineered to support two main application areas. The research mode, characterised by "data in -visualisations out", envisages the toolbox being loaded into an interactive Python session (e.g., a Jupyter Notebook) to support exploratory analysis, prototyping, development, evaluation and testing. This mode is intended for researchers who could use it to propose new fairness metrics, compare them with existing solutions or inspect a new predictive system or data set (without the burden of setting up a dedicated software engineering workflow). Contributing these implementations of cutting-edge techniques to FAT Forensics will in turn make the package attractive for monitoring and auditing of data-driven systems -the second intended application domain. More specifically, the deployment mode, characterised by "data in -data out", offers to incorporate the package into a data processing pipeline to provide a (numerical) analytics, hence support any kind of automated reporting, dashboarding or certification (thus partially alleviating the issues with manual, error-prone and subjective characterisation of AI & ML components). This mode is intended for ML practitioners who (by accessing the lowlevel API) may use it to monitor or evaluate a data-driven system; where continuous integration is used in software engineering to ensure high quality of the code, our toolbox could be employed to evaluate FAT of any component of an ML pipeline during its development and deployment.
A considerable portion of FAT software is developed to support research outputs, which often results in superfluous dependencies, data sets, predictive models and (interactive) visualisations being distributed with the code base that itself is accessible via a non-standard API. To mitigate these issues, FAT Forensics decouples the core FAT functionality from its possible presentation to the user and experiment-specific resources. This abstraction of the software infrastructure is achieved by making minimal assumptions about the operational setting of these algorithms, therefore facilitating a common interface layer for key FAT functionality, focusing only on the interactions between data, models, predictions and users [25]. In this purview a predictive model is assumed to be a plain Python object with fit, predict and, optionally, predict_proba methods, which offers compatibility with scikit-learn [16] -the most popular Python ML toolbox -without explicitly depending on it, in addition to supporting any other predictor that can be represented in this way, e.g., TensorFlow, PyTorch or even one hosted on the Internet and accessible via a web API. Similarly, a data set is assumed to be a two-dimensional NumPy array: either a classic or a structured array, with the latter bringing support for (string-based) categorical attributes. Since visualisations are a vital part of our first application mode (research), the software provides basic plotting functionality that is only enabled when the optional Matplotlib dependency is installed. In addition to relaxed input requirements, all of the techniques incorporated into the package are split into interoperable algorithmic building blocks that can be easily reused, even across FAT borders, to create new functionality -the versatility of this atomic-level decomposition is demonstrated in the following section. More details about the technical aspects of the software can be found in the FAT Forensics technical paper [25].

USE CASES
We present three distinct use cases to demonstrate how the software can be applied to analyse FAT aspects of real data, illustrating the diverse range of functionality enabled by its universal infrastructure. To this end, we employ the UCI Census Income (Adult) data set [10], which is popular in algorithmic fairness and transparency research. The data analysis that follows is representative of the research mode and is inspired by the tutorials included in the FAT Forensics documentation 1 ; it can be reproduced with a dedicated Jupyter Notebook 2 . To demonstrate the deployment mode, we provide a dashboard based on Plotly Dash, which facilitates interactive analysis of the same data set using FAT Forensics as the back end 3 . Feature Grouping. One of the core building blocks of FAT Forensics is a collection of functions to partition data based on (sets of) unique values for categorical features and threshold-based binning for numerical attributes. This algorithmic concept -in conjunction with any standard (predictive) performance metric derived from predicted and true labels -facilitates a number of FAT workflows. A variety of different group-based (pairwise) fairness criteria, not limited to the ones implemented in the package, can be computed in this way by conditioning on protected features (attributes that may be used for discriminatory treatment, e.g., gender), allowing us to investigate disparate impact of a predictive model based on group unaware, equal opportunity, equal accuracy or demographic parity metrics, among many others [6]. Since some of them are mutually incompatible [13], comparing them side-by-side can be beneficial. For example, the Asian-Pac-Islander (Asi-Pac-Isl) and Other groups are subject to fairness disparity when equal accuracy and demographic parity are considered; Other and White sub-populations are also treated unfairly according to demographic parity; whereas equal opportunity does not exhibit any signs of disparate impact as shown in Figure 1.
The grouping functionality can also help to assess accountability of data and models in a similar fashion. Data Density. Density estimate for a region in which a data point of interest is located (based on the distribution of training data) can be treated as a proxy for the confidence of its prediction [17], thus helping to judge its accountability and robustness as dense regions should offer more accurate modelling. To this end, FAT Forensics implements a bespoke neighbour-based density estimator -its scores are between 0 and 1, where high values are assigned to instances from sparse regions since their n th neighbour (a user-defined parameter) is relatively distant. As an illustration we   estimate the density of Adult based on its first 1,000 instances and select four data points -two from a dense and two from a sparse region -to assess robustness of their predictions. The former two receive density scores of 0 and are correctly predicted as ≤50K; the latter two are assigned density scores of 1 with one predicted correctly and the other misclassified as ≤50K. Upon closer inspection this data point has a relatively high value (99.99 th percentile) of the fnlwgt feature (1,226,583), which is a clue to its high density score and incorrect prediction (see the aforementioned Jupyter Notebook for more details).
In addition to engendering trust in predictions, a density estimate can help to assess the quality of exemplar explanations and compute realistic counterfactuals [18], which can be used as a transparency tool and individual fairness mechanism (by conditioning on protected attributes). Sourcing counterfactuals from sparse regions may yield explanations based on instances that are unlikely to occur in the real life, e.g., prescribing a person to become 200 years old. Explaining the aforementioned misclassified data point taken from a sparse region provides explanations such as: (i) raising capital-gain from 0 to 25,000 predicts >50K (sparse region with 1 density score); and (ii) increasing capital-loss from 0 to 4,000 and decreasing fnlwgt from 1,226,583 to 430,985 predicts >50K (dense region with 0.02 density score).
While (i) prescribes a sensible action, preserving the unusually high value of fnlwgt makes it unlikely; (ii), on the other hand, decreases the value of this attribute -therefore placing the counterfactual in a dense region -and shows that even with 4,000 of capital-loss being classified as >50K is possible, casting even more suspicion on the unusually high original value of the former feature. Finally, no counterfactuals conditioned on protected attributes could be found for this instance, showing us that its prediction is fair (again, see the aforementioned Jupyter Notebook for more details).
Surrogate Modularity. Surrogate explainers are a popular interpretability technique that fits a transparent model in a selected neighbourhood to approximate and explain the predictive behaviour of the underlying black box in said region [3,20,26]. Given their high modularity, FAT Forensics implements their core building blocks via the bLIMEy meta-algorithm 4 -consisting of interpretable representation composition, data sampling and explanation generation steps -which allows the user to easily construct a bespoke surrogate that is suitable for the problem at hand, thus considerably improving the quality and faithfulness of the resulting explanations [21,26]. For example, an interpretable representation of tabular data can be built with quartile-based discretisation or a feature space partition extracted from a decision tree (the latter is more faithful [24]); data can be augmented with Gaussian or mixup [30] sampling (the latter offers a diverse and local sample [26]); and an explanation can be generated with a linear model or a decision tree  (the former is limited to feature influence, whereas the latter provides a diverse range of insights such as rules and counterfactuals [21,23,24]). Such a surrogate explainer can either be local -by sampling data in the neighbourhood of a selected instance -or global -when the sample covers the entire data space. Specifically, consider the two local surrogates shown in Figure 4, where a tree-based explainer [23,24] is better able to approximate the decision boundary close to the selected instance.

IMPACT OVERVIEW
While software is one of the primary drivers of progress in AI & ML research, its quality is often found lacking.
FAT Forensics offers a possible solution in the space of algorithmic fairness, accountability and transparency by facilitating the development, evaluation, comparison and deployment of FAT tools. Sharing a common functional base between implementations of FAT algorithms is one of many advantages of such a comprehensive package. Its versatility as well as support for the research and deployment operation modes make it appealing to members of academia and industry, especially as it supports investigating FAT aspects of an entire predictive pipeline: data, models and predictions.
This in turn ought to encourage the community to adopt the software and contribute their novel algorithms and bug fixes here (instead of releasing them as standalone code), thus exposing them to the wider audience in a robust and sustainable environment, enhancing reproducibility of research in this space and orienting the package towards real-world use cases. By developing FAT tools on a modular level from the ground up FAT Forensics ensures their robustness and accountability in addition to being shielded from any errors that otherwise could have been introduced downstream. For example, LIME [20] -which is "wrapped" by Microsoft's Interpret [14] and Oracle's Skater [11] libraries -has known issues with the locality and coherence of its explanations [12,26], which inadvertently affect both these packages. We therefore hope and expect that all the software engineering best practice followed during the initial development of FAT Forensics (and maintained carrying forwards) have helped us to create a sustainable package that is easy to extend and contribute to, serving the community for a long time to come.
Additionally, the modular design of the package facilitates conducting cutting-edge research. To date, the implementation of surrogate explainers available in FAT Forensics allowed us to carefully study their capabilities and failure modes, leading to new findings, theories and transparency tools. bLIMEy -the surrogate meta-algorithm -is a case in point; its inception was inspired by identifying independent algorithmic modules, whose further investigation showed the importance of local sampling for tabular data and effectiveness of decision trees as surrogate models [21,26]. One particular realisation of this explainer -LIMEtree -is based on multi-output regression trees and improves upon many shortcomings of surrogates by offering faithful, consistent, customisable and multi-class explanations of different types, including counterfactuals [23]. Diverse implementations of surrogate building blocks also helped us to analyse the role and parameterisation of interpretable representations and improve their robustness -they translate the low-level data representation used by predictive models into human-comprehensible concepts underlying explanations and are the backbone of surrogates [24]. FAT Forensics has also been the foundation of a hands-on conference tutorial on ML explainability [27] as well as numerous lectures, summer school sessions, educational events and learning resources 5 .