Advertisement

SIRAD: Secure Infrastructure for Research with Administrative Data

Open AccessPublished:February 01, 2022DOI:https://doi.org/10.1016/j.simpa.2022.100245

      Highlights

      • Government administrative data is valuable for social sciences research and policy insights.
      • Administrative data is sensitive and must be anonymized to maintain privacy and confidentiality.
      • SIRAD is a tool for anonymizing administrative data in a consistent way for research and insights.
      • To date, SIRAD has anonymized data for nine research studies and 75 policy memos.

      Abstract

      Governments collect large quantities of administrative records through the programs and services they administer. These data can generate important insights into new policy and evaluations of existing policy but are often difficult to access because they are siloed across government agencies and subject to privacy laws and regulations. We developed a data integration pipeline, SIRAD, that securely joins administrative data sets from multiple government agencies and replaces personally identifiable information with anonymous identifiers to maintain privacy while facilitating research, insights, and fact-based policy. Data processed with SIRAD have been used in nine research studies and 75 policy memos.

      Keywords

      Code metadata
      Tabled 1
      Current code version0.3.2
      Permanent link to code/repository used for this code versionhttps://github.com/SoftwareImpacts/SIMPAC-2022-4
      Permanent link to reproducible capsulehttps://codeocean.com/capsule/0283272/tree/v1
      Legal code licenseModified 3-clause BSD
      Code versioning system usedGit
      Software code languages, tools and services usedPython, YAML
      Compilation requirements, operating environments and dependenciesPython >=3.7
      If available, link to developer documentation/manualhttps://github.com/ripl-org/sirad/wiki
      Support email for questions[email protected]
      Software metadata
      Tabled 1
      Current software version0.3.2
      Permanent link to executables of this versionhttps://github.com/ripl-org/sirad
      Software LicenseModified 3-clause BSD
      Computing platform/Operating SystemLinux, Windows, OS X
      Installation requirements & dependenciesPython >=3.7
      Link to user manualhttps://github.com/ripl-org/sirad/wiki
      Support email for questions[email protected]

      1. Background

      Secure Infrastructure for Research with Administrative Data (SIRAD) is a tool for building integrated data sets of anonymized administrative records for social science research and policy insights. One of the biggest challenges with using administrative data in research is identifying the same individual across data sets while preserving the highest level of confidentiality and anonymity possible [
      • Connelly R.
      • Playford C.J.
      • Gayle V.
      • Dibben C.
      The role of administrative data in the big data revolution in social science research.
      ]. SIRAD solves this challenge by removing and replacing personally identifiable information (PII) with a global anonymized identifier, allowing researchers to securely join data on an individual from multiple tables without knowing the individual’s identity.
      Our motivation and initial use case for SIRAD was the creation of an integrated, anonymized data set, called RI 360, containing administrate data from nearly every government agency in the State of Rhode Island [

      J.S. Hastings, Fact-based policy: How do state and local governments accomplish it?, The Hamilton Project (Brookings) Policy Proposal 2019-01 (2019). https://www.brookings.edu/wp-content/uploads/2019/01/Hastings_PP_web_20190128.pdf. (Accessed 28 December 2021).

      ,
      • Hastings J.S.
      • Howison M.
      • Lawless T.
      • Ucles J.
      • White P.
      Unlocking data to improve public policy.
      ]. Although we initially investigated enterprise tools in the business intelligence space for data integration, the up-front cost of those solutions led us to instead develop an open and lightweight integration tool in Python using an agile approach, which became SIRAD. The core function of SIRAD is to hash sensitive identification numbers and separate out PII at load time.
      At least one other integration tool has been created with this same core functionality of anonymizing administrative data, the Coleridge Initiative’s Data Hashing Application, which is a component of the Administrative Data Research Facility [
      • Coleridge Initiative
      Data Hashing Application, ADRF User Guide.
      ]. However, at the time of writing, the Data Hashing Application is no longer publicly accessible, nor is it available open-source. Recent reviews of existing Extract Transform Load (ETL) tools compare 21 different tools (Supplementary Table 1) across 44 different features (Supplementary Table 2), yet none of the features identified provide for separation of PII or anonymization of sensitive data [
      • Kholod I.I.
      • Efimova M.S.
      • Kulikov S.Y.
      Using ETL tools for developing a virtual data warehouse.
      ,
      • Biplob M.B.
      • Sheraji G.A.
      • Khan S.I.
      Comparison of different extraction transformation and loading tools for data warehousing.
      ,
      • Biswas N.
      • Sarkar A.
      • Mondal K.C.
      Empirical analysis of programmable ETL tools.
      ,
      • Patel M.
      • Patel D.B.
      Progressive growth of ETL tools: A literature review of past to equip future.
      ]. This is consistent with our initial assessment that existing business intelligence tools would require the cost of adapting or extending them for use in integrating and anonymizing administrative data. Moreover, 11 out of the 21 tools identified in these reviews are not available open-source, and commercial licensing would further increase this cost.
      Since its initial public release (version 0.1.2), SIRAD has been enhanced with parallel processing support, performance improvements, and robustness to common issues found in administrative data dumps (such as the presence of null bytes and non-ASCII characters). Updates are summarized as GitHub releases at https://github.com/ripl-org/sirad/releases.

      2. Functionality

      SIRAD constructs a global anonymous identifier, the sirad_id, that researchers can use to consistently join information about the same individual across tables without using PII. An automated script (which can be run by an unattended service account for increased confidentiality) constructs the sirad_id by concatenating all hashed Social Security number (SSN), first name, last name, and date of birth records into a single table, while maintaining an encrypted link to the source table and row for the records. If a record contains information on multiple individuals (such as a birth record that describes both the child and the parents), it is expanded into one row per individual. All names are cleaned to remove non-letter characters, and first names are converted to Soundex [

      C.R. Robert, The Soundex coding system. US Patent No. US1261167.

      ] values. A sirad_id is assigned to every valid hashed SSN, and to every distinct combination of first name Soundex value, last name, and date of birth that cannot be matched to a single valid hashed SSN. For example, if multiple records match on first name Soundex value, last name, and date of birth, but only one record has a valid hashed SSN, then all of those records will inherit the sirad_id corresponding to the valid hashed SSN. However, if those records instead match to several valid hashed SSNs, then a distinct sirad_id is assigned to each valid hashed SSN as well as to the remaining unmatched combinations of first name Soundex, last name, and date of birth. Finally, records that are missing a valid hashed SSN and are also missing one of first name, last name, or date of birth, are considered too ambiguous and are not assigned a sirad_id.
      SIRAD uses a simple layout file for each incoming data source to describe the metadata for each of its columns, specified in the YAML markup language. Supported input formats include comma-separated value (CSV), Excel, and fixed-width text files. The layout file describes the original column name, the type (e.g. date, string, or numeric), the date format (if applicable), a flag for whether a column is a sensitive numeric identifier that needs to be automatically hashed at load time, and another flag indicating whether the column contains PII and a standardized name for the PII values (such as first name, last name, or date of birth). The full feature set of the layout files is described in SIRAD’s documentation at https://github.com/ripl-org/sirad/wiki. To retain a history of data integration operations, the layout files can be managed with existing version control systems, such as git. In RI 360, data were updated quarterly and each quarter’s processed data were permanently archived and associated with a tagged git commit containing the corresponding layout files.
      SIRAD is designed for an Extract Load Transform (ELT) approach to managing the flow of data [
      • Dayal U.
      • Castellanos M.
      • Simitsis A.
      • Wilkinson K.
      Data integration flows for business intelligence.
      ]. Loading the original data first without applying extensive transformations has several benefits when using administrative data for research and policy insights. First, it retains the provenance of data and the values in the database can be assumed to be the original information from the administrative system, not a derivative value created by a transformation process that may not be readily available to the researcher viewing the data. Second, transformations do not need to be defined upfront, which can be both time consuming and rigid, especially as research needs can evolve and change rapidly. The data are minimally transformed and made available to researchers within a short period of time. Finally, this approach is flexible for researchers; transformations can be created, changed or dropped within the database without requiring interventions by the data integration team, or time-consuming reprocessing steps.

      3. Example usage

      A simple example with synthetic data, available from https://github.com/ripl-org/sirad-example, serves as both a regression test for development of SIRAD and a tutorial on how to apply the tool in practice. In this worked example, we investigate how gross adjusted income correlates with credit score. To answer this question, we simulate two administrative data sets: (1) IRS 1040 tax returns, identified by SSN, first/last name, and date of birth; and (2) credit history, identified by first/last name and date of birth. SIRAD matches records across the two data sets corresponding to the same individual. It then assigns an anonymized identifier, the sirad_id, to each matched individual, and creates a deidentified table for each data set where the SSNs, names, and dates of birth have been replaced with the sirad_id. Finally, we demonstrate an analysis that uses the sirad_id to join adjusted gross income from the tax returns table to credit scores in the credit history table.

      4. Impact

      SIRAD created the RI 360 database that generated 75 policy memos and briefs for Rhode Island policymakers, working across a dozen high-priority policy areas identified in collaboration with the Office of the Governor [

      J.S. Hastings, Fact-based policy: How do state and local governments accomplish it?, The Hamilton Project (Brookings) Policy Proposal 2019-01 (2019). https://www.brookings.edu/wp-content/uploads/2019/01/Hastings_PP_web_20190128.pdf. (Accessed 28 December 2021).

      ]. In addition, SIRAD has enabled nine research studies by joining and anonymization the following types of administrative records:
      • Labor training, wage roll, and unemployment insurance (UI) records, to estimate the value-added to wages following enrollment in workforce training programs [
        • Angell M.
        • Gold S.
        • Hastings J.S.
        • Howison M.
        • Jensen S.
        • Keleher N.
        • Molitor D.
        • Roberts A.
        Estimating value-added returns to labor training programs with causal machine learning, OSF Preprints.
        ].
      • Supplemental Nutrition Assistance Program (SNAP) records, UI records, and transaction records from a large US grocery retailer, to demonstrate how a mental accounting model can explain SNAP benefit spending [
        • Hastings J.S.
        • Shapiro J.M.
        How are SNAP benefits spent? Evidence from a retail panel.
        ] and to identify the effects of SNAP participation on the nutritional content of purchased food [
        • Hastings J.S.
        • Kessler R.
        • Shapiro J.M.
        The effect of SNAP on the composition of purchased foods: Evidence and implications.
        ].
      • Electricity billing data, to understand the importance of bill timing for low-income and aged households who rely on SNAP and Social Security benefits [
        • Barrage L.
        • Chin I.
        • Chyn E.
        • Hastings J.S.
        The impact of bill receipt timing among low-income and aged households: New evidence from administrative electricity bill data.
        ].
      • Medicaid claims, social benefit program, wage roll, and incarceration records, to predict high-cost use of emergency departments that could be diverted to more appropriate care [
        • Hastings J.S.
        • Howison M.
        Predicting Divertible Medicaid Emergency Department Costs, OSF Preprints.
        ].
      • Birth, Medicaid claims, social benefit program, and education records, to understand the impact of early-life interventions for very low birth weight children on later-life health and educational outcomes and social program expenditures [
        • Chyn E.
        • Gold S.
        • Hasting J.S.
        The returns to early-life interventions for very low birth weight children.
        ].
      • Medicaid claims, social benefit program, wage roll, incarceration, and criminal history records, to predict adverse outcomes that could result from opioid therapy before the initial opioid prescription is written [
        • Hastings J.S.
        • Howison M.
        • Inman S.E.
        Predicting high-risk opioid prescriptions before they are given.
        ].
      • Voter registration, voting history, social benefit program, wage roll, and driver’s license records, to examine the impact of a state photo ID law on voter turnout and registration [

        F.M. Esposito, D. Focanti, J.S. Hastings, Effects of Photo ID Laws on Registration and Turnout: Evidence from Rhode Island, NBER Working Paper No. 25503, 2019, https://www.nber.org/papers/w25503 (Accessed 5 January 2022).

        ].
      • Child protective services and education records, to measure the impact of removing children from abusive and neglectful homes on educational outcomes [

        A. Bald, E. Chyn, J.S. Hasting, M. Machelett, The Causal Impact of Removing Children from Abusive and Neglectful Homes, NBER Working Paper No. 25419, 2019, https://www.nber.org/papers/w25419 (Accessed 5 January 2022).

        ].
      In all nine of these studies, the anonymization provided by SIRAD was an essential prerequisite to conducting research with administrative records. Eight of the nine studies further required the data integration capabilities of SIRAD to join data across two or more government agencies.

      5. Discussion

      SIRAD is a lightweight data pipeline that securely integrates siloed administrative data, constructs a global anonymous identifier that minimizes time spent on manually joining data, and allows government agencies to maintain confidentiality and privacy. Establishing a global anonymous identifier is important for research because agencies may use multiple sources of information to identify individuals in their records and may not consistently identify individuals across records even within the same agency. Without a universal identifier, the task of identifying unique individuals is difficult, and joining individual-level data across agencies becomes more complex as the number of agencies and records increases. This leads to several potential problems: researchers could end up spending more time identifying and joining data than actually performing analysis; that effort could be duplicated across projects; inconsistencies may arise if different projects take different approaches to joining data; and individual identities may be seen by researchers alongside data on those individuals during a matching process thus lowering the degree of anonymity provided during the database construction process. SIRAD avoids these problems through its core functionality of constructing the sirad_id using an automated, deterministic process.
      A limitation of SIRAD for some use cases is that it was designed for data that is updated on a monthly or quarterly schedule. For example, state wage roll data is an important data set underlying most of the studies described in Section 4, but it is only submitted to the state by employers on a quarterly basis. Other data integration tools may be more appropriate for data that must be updated more frequently, or especially in real-time.
      In future work, we plan to apply SIRAD to administrative data in additional US states, following the successful demonstration of the RI 360 database in Rhode Island. SIRAD has already been deployed to process administrative data in Hawaii and New Jersey and is planned for deployment in Colorado.

      Declaration of Competing Interest

      The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

      Acknowledgments

      We thank Ted Lawless, John Ucles, and Preston White for their previous contributions to SIRAD.

      Funding

      This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

      Appendix A. Supplementary tables

      The following is the Supplementary material related to this article.

      References

        • Connelly R.
        • Playford C.J.
        • Gayle V.
        • Dibben C.
        The role of administrative data in the big data revolution in social science research.
        Soc. Sci. Res. 2016; 59: 1-12https://doi.org/10.1016/j.ssresearch.2016.04.015
      1. J.S. Hastings, Fact-based policy: How do state and local governments accomplish it?, The Hamilton Project (Brookings) Policy Proposal 2019-01 (2019). https://www.brookings.edu/wp-content/uploads/2019/01/Hastings_PP_web_20190128.pdf. (Accessed 28 December 2021).

        • Hastings J.S.
        • Howison M.
        • Lawless T.
        • Ucles J.
        • White P.
        Unlocking data to improve public policy.
        Commun. ACM. 2019; 62: 48-53https://doi.org/10.1145/3335150
        • Coleridge Initiative
        Data Hashing Application, ADRF User Guide.
        2021 (https://web.archive.org/web/20210414102424/https://coleridgeinitiative.org/adrf/documentation/adrf-overview/data-hashing-application/. (Accessed 14 April 2021))
        • Kholod I.I.
        • Efimova M.S.
        • Kulikov S.Y.
        Using ETL tools for developing a virtual data warehouse.
        in: 2016 XIX IEEE International Conference on Soft Computing and Measurements (SCM). IEEE, St. Petersburg, Russia2016: 351-354https://doi.org/10.1109/SCM.2016.7519778
        • Biplob M.B.
        • Sheraji G.A.
        • Khan S.I.
        Comparison of different extraction transformation and loading tools for data warehousing.
        in: Proceedings of the 2018 International Conference on Innovations in Science Engineering and Technology. IEEE, Chittagong, Bangladesh2018: 262-267https://doi.org/10.1109/ICISET.2018.8745574
        • Biswas N.
        • Sarkar A.
        • Mondal K.C.
        Empirical analysis of programmable ETL tools.
        in: Computational Intelligence, Communications, and Business Analytics, CICBA 2018Communications in Computer and Information Science. Vol. 1031. Springer, Singapore2019: 267-277https://doi.org/10.1007/978-981-13-8581-0_22
        • Patel M.
        • Patel D.B.
        Progressive growth of ETL tools: A literature review of past to equip future.
        in: Rising Threats in Expert Applications and Solutions. Advances in Intelligent Systems and Computing. Vol. 1187. Springer, Singapore2021: 389-398https://doi.org/10.1007/978-981-15-6014-9_45
      2. C.R. Robert, The Soundex coding system. US Patent No. US1261167.

        • Dayal U.
        • Castellanos M.
        • Simitsis A.
        • Wilkinson K.
        Data integration flows for business intelligence.
        in: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology. ACM, New York, NY, USA2009https://doi.org/10.1145/1516360.1516362
        • Angell M.
        • Gold S.
        • Hastings J.S.
        • Howison M.
        • Jensen S.
        • Keleher N.
        • Molitor D.
        • Roberts A.
        Estimating value-added returns to labor training programs with causal machine learning, OSF Preprints.
        2021https://doi.org/10.31219/osf.io/thg23
        • Hastings J.S.
        • Shapiro J.M.
        How are SNAP benefits spent? Evidence from a retail panel.
        Amer. Econ. Rev. 2018; 108: 3493-3540https://doi.org/10.1257/aer.20170866
        • Hastings J.S.
        • Kessler R.
        • Shapiro J.M.
        The effect of SNAP on the composition of purchased foods: Evidence and implications.
        Am. Econ. J. Econ. Policy. 2019; 13: 277-315https://doi.org/10.1257/pol.20190350
        • Barrage L.
        • Chin I.
        • Chyn E.
        • Hastings J.S.
        The impact of bill receipt timing among low-income and aged households: New evidence from administrative electricity bill data.
        NBER Bull. Retire. Disabil. 2020; (https://www.nber.org/brd/how-bill-timing-affects-low-income-and-aged-households. (Accessed 5 January 2022))
        • Hastings J.S.
        • Howison M.
        Predicting Divertible Medicaid Emergency Department Costs, OSF Preprints.
        2021https://doi.org/10.31219/osf.io/q36es
        • Chyn E.
        • Gold S.
        • Hasting J.S.
        The returns to early-life interventions for very low birth weight children.
        J. Health Econ. 2021; 75102400https://doi.org/10.1016/j.jhealeco.2020.102400
        • Hastings J.S.
        • Howison M.
        • Inman S.E.
        Predicting high-risk opioid prescriptions before they are given.
        Proc. Natl. Acad. Sci. 2020; 117: 1917-1923https://doi.org/10.1073/pnas.1905355117
      3. F.M. Esposito, D. Focanti, J.S. Hastings, Effects of Photo ID Laws on Registration and Turnout: Evidence from Rhode Island, NBER Working Paper No. 25503, 2019, https://www.nber.org/papers/w25503 (Accessed 5 January 2022).

      4. A. Bald, E. Chyn, J.S. Hasting, M. Machelett, The Causal Impact of Removing Children from Abusive and Neglectful Homes, NBER Working Paper No. 25419, 2019, https://www.nber.org/papers/w25419 (Accessed 5 January 2022).