Advertisement
Original software publication| Volume 12, 100250, May 2022

Cardinal, a metric-based Active learning framework

Open AccessPublished:February 09, 2022DOI:https://doi.org/10.1016/j.simpa.2022.100250

      Highlights

      • Active learning uses a trained model to select samples to label to boost performance.
      • Metrics and insights can help a practitioner choose the best strategy.
      • Cardinal is a python package that provides and helps research metrics.
      • Cardinal’s experimental framework caches experiments and logs metrics.
      • Cardinal was used to design metrics detecting noisy samples and a coping strategy.

      Abstract

      In Active learning, a trained model is used to select samples to label to maximize its performance. Choosing the best sample selection strategy for a one-shot experiment is hard, but metrics have been proven to help by detecting strategies performing worse than random or detecting and avoiding noisy samples. Cardinal is a python framework that assists the practitioner in selecting a strategy using metrics and the researcher in developing those metrics. Cardinal caches experiments to compute insights costlessly, keeps track of logged metrics, and proposes extensive documentation. It also interfaces with other packages to use state-of-the-art strategies.

      Keywords

      Code metadata
      Tabled 1
      Current code version0.6
      Permanent link to code/repository used for this code versionhttps://github.com/SoftwareImpacts/SIMPAC-2021-174
      Permanent link to reproducible capsulehttps://codeocean.com/capsule/2060952/tree/v1
      Legal Code LicenseApache License 2.0
      Code versioning system usedgit
      Software code languages, tools, and services usedpython
      Compilation requirements, operating environments & dependenciesmatplotlib, scikit-learn, numpy, scipy, cardinal
      If available Link to developer documentation/manualhttps://dataiku-research.github.io/cardinal/
      Support email for questions[email protected]

      1. Cardinal in the Active Learning landscape

      Active Learning (AL)
      For disambiguation, note that this work refers to active learning in machine learning and not to the educational technique.
      optimizes the labeling of large datasets when the available budget is only sufficient to annotate a fraction of them. In this setting, we use the samples labeled so far to train a model and use it to select a batch of unlabeled samples to send for labeling to an oracle. We add these newly labeled samples to the training set of the model and repeat this loop until the budget is exhausted. Setting up such an experiment is challenging, especially in an industrial context. Most available packages assume expertise from the practitioner, while industrial AL projects are often one-shot. In this context, we propose the python package cardinal that makes AL easy for all profiles with a minimalistic framework backed by an extensive documentation.

      1.1 Prior work on Active Learning packages

      Three packages have historically dominated AL. modAL [
      • Danka Tivadar
      • Horvath Peter
      modAL: A modular active learning framework for Python.
      ] is a minimalistic package providing a sklearn-compatible interface for query samplers. It provides tried-and-tested simple strategies, including historical ones and query-by-committee strategies that provide good performances at the cost of training several models. Regarding experiments, modAL provides an ActiveLearner object that wraps the estimator and the data to help designing the main loop. ALiPy [
      • Tang Ying-Peng
      • Li Guo-Xiang
      • Huang Sheng-Jun
      ALiPy: Active learning in Python.
      ] is a package focused on experimentation. It provides classical AL strategies and research ones developed in-house. ALiPy’s AlExperiment is an object that runs a classical AL loop with fixed batch-size and bounded cost. It resumes failed experiment using checkpoints and allows computation of various performance metrics. Libact [
      • Yang Yao-Yuan
      • Lee Shao-Chuan
      • Chung Yu-An
      • Wu Tung-En
      • Chen Si-An
      • Lin Hsuan-Tien
      Libact: Pool-based active learning in python.
      ] is a package oriented toward performance and production. High-speed methods coded in C are available through a python interface. Like modAL, it provides simple bricks to build the AL loop.
      All the above packages focus on easing research of new query strategies, which implies running benchmarks on public datasets with well-known models. However, industrial practitioners willing to use AL can be overwhelmed by the number of strategies available, especially since reported uplifts are not consistent across experiments [

      Daniel Kottke, Adrian Calma, Denis Huseljic, G.M. Krempl, Bernhard Sick, Challenges of reliable, realistic and comparable active learning evaluation, in: Proceedings of the Workshop and Tutorial on Interactive Adaptive Learning, 2017, pp. 2–14.

      ]. A straightforward solution could be to test all strategies against an independent labeled test set, but the cost of labeling this test set can be prohibitive. For this reason, we have focused our research on getting insights through metrics rather than improving query strategies. This new approach comes with implementation needs that cardinal is trying to address.

      1.2 Cardinal’s objectives

      Cardinal provides metrics to guide a practitioner in choosing the best query strategy among existing ones. We do it by computing metrics and sharing insights and best practices about them. Metrics are preferably computed on an independent labeled test set but an estimation can also be computed on a sizeable unlabeled test set using metric proxies. The primary purpose of cardinal is to propose a research-oriented, human-friendly, experimental framework to ease research about metrics.
      Cardinal does not aim at providing an exhaustive list of strategies. Our versatile design is meant to be used with strategies from other packages. We still supply exclusive strategies that are not implemented in other packages such as WKMeans [
      • Zhdanov Fedor
      Diverse mini-batch active learning.
      ], a query strategy combining a preselection using an uncertainty score and a diversity enforcing step using KMeans, and IWKMeans [

      Alexandre Abraham, Léo Dreyfus-Schmidt, Sample Noise Impact on Active Learning, in: IAL 2021 Workshop, ECML PKDD, 2021.

      ], another version of this strategy more robust to sample noise.

      1.3 Metrics and experiments made easy

      Cardinal has inherited the sobriety of modAL and also exposes a sklearn-like interface. Our query strategies expose a fit function to train the model and prepare the strategy, and a select_samples method to select the samples to send to the oracle for annotation. We chose not to use the predict semantic as the purpose is different, and the result is a set of indices instead of labels. A typical cardinal script consists of a main experimental loop, and then possibly several loops to compute metrics.
      In the main experimental loop, see Lst. 1, samples are selected and the model is trained. Cardinal’s Experimenter provides an iterator that controls the loop itself: if a failure happens in the middle of a run, cardinal will skip the already computed iterations. It proposes either to cache a variable, which only stores the last value in order to resume an iteration, or to persist it, which allows to retrieve it when computing metrics. Data is stored in pickle or numpy format in a human browsable tree, making debugging using small scripts or ipython sessions easy.
      Figure thumbnail fx1001
      Metrics computation can require access to costly derived data such as the distance matrix between labeled and unlabeled samples at each iteration. With cardinal, each block is independent and loads the resource it needs, see Lst. 2. Our caching system optimizes memory usage and allows mem mapping to access big matrices directly from the disk.
      Figure thumbnail fx1002
      Each block uses only the required resources in this design, access the data it needs, and can be parallelized. At the core of this design is cardinal’s ActiveLearningSplitter that indexes training set, testing set, but also labeled samples and selected batches at each iteration. It enables easy access to all selected samples at each iteration and, therefore, computing metrics that span several iterations.
      Finally, cardinal guides the user through his AL journey. We provide detailed examples showing how to use the package and how to get insights into AL experiments. We also provide notebooks to test properties of query sampling methods, such as robustness to bad initialization or noisy decision boundaries, see Fig. 1.

      2. Impact overview and achieved results

      Measuring batch diversity and training set representativeness are common practices in AL. However, these measures have been used so far as criteria to optimize for [
      • Du Bo
      • Wang Zengmao
      • Zhang Lefei
      • Zhang Liangpei
      • Liu Wei
      • Shen Jialie
      • Tao Dacheng
      Exploring representativeness and informativeness for active learning.
      ], or to determine when to stop labeling [

      Masood Ghayoomi, Using variance as a stopping criterion for active learning of frame assignment, in: Proceedings of the NAACL HLT 2010 Workshop on Active Learning for Natural Language Processing, 2010, pp. 1–9.

      ], but not as metrics to evaluate the samplers themselves.
      Cardinal paves the way for research on better metrics. In our first work [
      • Abraham Alexandre
      • Dreyfus-Schmidt Léo
      Rebuilding trust in active learning with actionable metrics.
      ], we have proposed a rigorous framework to design and test metrics for AL experiments. We first compute metrics on controlled experiments where the ground truth is known, demonstrate the utility of this metric, and then propose proxies using unlabeled data when possible. In a second work [

      Alexandre Abraham, Léo Dreyfus-Schmidt, Sample Noise Impact on Active Learning, in: IAL 2021 Workshop, ECML PKDD, 2021.

      ], we have identified the problem of noisy samples, proposed metrics to identify them, and showed that it was possible to build a more robust query strategy as seen in Fig. 2.
      Cardinal has been developed by Dataiku that distributes Data Science Studio (DSS), a collaborative data science platform. DSS includes a labeling solution proposing active learning and backed by cardinal. In particular, DSS does not impose a fixed-batch size when performing AL and the labeler can retrain the model on demand. A heuristic designed with cardinal warns the user when enough samples have been labeled and retraining should be triggered. A recent study on the lack of success of AL in industry [
      • Chabanet Sylvain
      • El-Haouzi Hind Bril
      • Thomas Philippe
      Coupling digital simulation and machine learning metamodel through an active learning approach in Industry 4.0 context.
      ] also mentioned that the problems addressed by cardinal are at the heart of industrial concerns.

      3. Perspectives

      Other packages have been released concomitantly or after the initial release of cardinal. Namely, scikit-activeml, a minimalistic package close to modAL and cardinal, Baal, a package focused on bayesian methods, and Distil, a package focused on practical applications and performance. Since our overarching goal is to select the best strategy among state-of-the-art ones, we plan to provide wrappers or provide examples and benchmarks featuring strategies from those packages. Today, we have still not reached our purpose to select the best query strategy based on metrics. However, we have been able in our tasks to identify strategies worse than random. We are open to collaboration with other research groups willing to explore metrics for AL.

      Declaration of Competing Interest

      The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

      Acknowledgment

      This research was funded by Dataiku \let \prOteCt \relax \Protect \::ref {dkuurl}, a software editor providing a data science platform to industrial customers.

      References

        • Danka Tivadar
        • Horvath Peter
        modAL: A modular active learning framework for Python.
        2018 (arXiv preprint arXiv:1805.00979)
        • Tang Ying-Peng
        • Li Guo-Xiang
        • Huang Sheng-Jun
        ALiPy: Active learning in Python.
        2019 (arXiv preprint arXiv:1901.03802)
        • Yang Yao-Yuan
        • Lee Shao-Chuan
        • Chung Yu-An
        • Wu Tung-En
        • Chen Si-An
        • Lin Hsuan-Tien
        Libact: Pool-based active learning in python.
        2017 (arXiv preprint arXiv:1710.00379)
      1. Daniel Kottke, Adrian Calma, Denis Huseljic, G.M. Krempl, Bernhard Sick, Challenges of reliable, realistic and comparable active learning evaluation, in: Proceedings of the Workshop and Tutorial on Interactive Adaptive Learning, 2017, pp. 2–14.

        • Zhdanov Fedor
        Diverse mini-batch active learning.
        2019 (arXiv preprint arXiv:1901.05954)
      2. Alexandre Abraham, Léo Dreyfus-Schmidt, Sample Noise Impact on Active Learning, in: IAL 2021 Workshop, ECML PKDD, 2021.

        • Du Bo
        • Wang Zengmao
        • Zhang Lefei
        • Zhang Liangpei
        • Liu Wei
        • Shen Jialie
        • Tao Dacheng
        Exploring representativeness and informativeness for active learning.
        IEEE Trans. Cybern. 2015; 47: 14-26
      3. Masood Ghayoomi, Using variance as a stopping criterion for active learning of frame assignment, in: Proceedings of the NAACL HLT 2010 Workshop on Active Learning for Natural Language Processing, 2010, pp. 1–9.

        • Abraham Alexandre
        • Dreyfus-Schmidt Léo
        Rebuilding trust in active learning with actionable metrics.
        2020 IEEE International Conference On Data Mining Workshops (ICDMW). IEEE, 2020
        • Chabanet Sylvain
        • El-Haouzi Hind Bril
        • Thomas Philippe
        Coupling digital simulation and machine learning metamodel through an active learning approach in Industry 4.0 context.
        Comput. Ind. 2021; 133103529