Learn ▼

Beta-launch: PolicyEngine's enhanced microdata for policy analysis

By integrating and calibrating multiple datasets, PolicyEngine makes world-class tax-benefit microsimulation modeling available to anyone.

By max ghenis and nikhil woodruff

March 25, 2024

7 min read

Beta-launch: PolicyEngine's enhanced microdata for policy analysis

Contents

Motivation

Methodology

Prepare PUF for training:

Create PUF-structured CPS file:

Reweight with gradient descent:

Results

How you can use it

Future development

Motivation#

PolicyEngine's broad policy scope creates special opportunities and challenges for microdata. Federal tax analysts like the Joint Committee on Taxation and the Tax Foundation have generally focused on data to support federal tax reform. Here, open resources like Quantria Strategies' technical documentation and the Policy Simulation Library's open-source Tax-Data project have served an essential role, documenting and demonstrating how to calibrate datasets of tax returns to best represent the national landscape, and how to construct and integrate nonfilers from surveys to capture reforms that could make them liable for tax or eligible for refundable credits.

Adding benefit programs and state taxes to the model, and reporting on other outcomes like poverty, requires a different approach. Tax returns lack the detailed person-level characteristics, hierarchical data structures, and information on benefits needed to capture the impacts of these programs. For example, Maryland limits its state Child Tax Credit to children under six, or those under age 17 with a disability—an intersection of signals not available in tax returns.

After introducing this approach to microdata enhancement to PolicyEngine UK over two years ago, we've refined it and adapted it to the (much more complex) US context. We're excited to have it ready for use and further advances.

Methodology#

We construct our enhanced data, which we call the Enhanced Current Population Survey, or ECPS, from three sources:

Current Population Survey March Supplement (CPS). The CPS is a monthly nationwide survey that estimates unemployment rates and other key monthly indicators. Each March, the Census Bureau expands the sample and set of questions, to collect more information on the prior year among respondents. The March Supplement, also called the Annual Social and Economic Supplement (ASEC), powers reports such as official poverty estimates. Census releases the microdata and its results each September, so the latest currently available represents calendar year 2022. The data is hierarchical, with entities for people, families, tax filing units, households, and "SPM units", referring to the Supplemental Poverty Measure and representing groups of cohabitating individuals who share resources. The 2022 ASEC includes 146,133 people across 88,978 households.
Internal Revenue Service Public Use File (PUF). The IRS makes a flat dataset with information on a sample of tax returns available to researchers who pay a fee and agree not to distribute the data. The most recent PUF represents tax year 2015, with 179 characteristics for 207,696 records, statistically altered to avoid disclosure. From 2016 forward, the IRS will instead release a synthetic PUF, in partnership with the Urban Institute.
Administrative totals. For instance, the total US population, income tax revenue, dividends, SNAP benefits, and so on. We collect 92 of these from historical sources and Congressional Budget Office (CBO) forecasts.

Our procedure maps to the above sequence. As shown in Figure 1 (high resolution), we first "age" the CPS and PUF to the current year (e.g., by growing PUF wages by the average growth of wages since 2015), then incorporate information from the PUF onto the CPS (while also preserving the original CPS), and finally reweighting the data. This contrasts to the standard approach among tax analysts: similarly age the CPS and PUF, but use the PUF as a base file, append nonfilers from the CPS, then reweight.

Figure 1: PolicyEngine's data flow to create PUF-enhanced CPS file

Specifically, here are our steps:

Prepare PUF for training:

Select common variables: demographics, filing status, number of child dependents, and number of other dependents (NB: IRS caps dependent counts at 3 in PUF).
Impute PUF demographics from the 119,675 records with demographics to all 207,696 PUF records using quantile regression forests.

Create PUF-structured CPS file:

Aggregate CPS to tax unit to create a dataset with all common variables (includes transformations like capping dependent counts at 3).
Impute PUF variables to PUF-structured CPS file using quantile regression forests.
Break down the PUF-imputed CPS tax unit file by person.
Attach other CPS characteristics to the PUF-structure.
Stack the PUF-based CPS with the original.

Reweight with gradient descent:

Calculate the deviation between the dataset's aggregates and administrative totals for each of 90 targets (population, income component, benefit participation, etc.).
Construct a "loss function" that condenses these individual deviations into a single metric.
Apply gradient descent to iteratively adjust the household weights to minimize the loss function.

We've built a new open-source Python package, survey-enhance, to streamline the usage of quantile regression forests for integration and gradient descent for reweighting. We have also submitted a paper to the International Journal of Microsimulation describing the methodology in greater detail (in the UK context), and comparing it to standard methods. For instance, we show in a holdout experiment that quantile regression forests significantly outperform statistical matching for data integration, and also show that the gradient descent approach outperforms percentile matching when correcting surveys for high-income representation.

Results#

We've assessed our approach chiefly by comparing aggregates against administrative totals. We sum each of our 93x targets in the aged PUF, aged CPS, and ECPS, in 2022 to 2025 (applying our rules engine for computed targets). In almost every target, the ECPS performs best, coming within one percent of every target.

To see how our totals compare, explore our interactive dashboard (screenshot below).

PolicyEngine's calibration dashboard

We have additionally compared results from reforms, and we will launch a similar dashboard with those comparisons in the future.

How you can use it#

You can use the ECPS in the PolicyEngine web app, in the policyengine-us Python package, or by downloading files.

To use the ECPS in the PolicyEngine web app, run a microsimulation as normal, and then toggle the Use Enhanced CPS switch in the right panel, or bottom on mobile. We will make this the default option after we exit beta, as it has been in PolicyEngine UK since we introduced the enhanced data.

Enhanced CPS switch

To use the ECPS in the policyengine-us Python package, specify it in the Microsimulation call as follows, then use it as you would otherwise.

python

ecps = Microsimulation(dataset="enhanced_cps_2024")

To download the ECPS, visit the release page and download the h5 file.

Future development#

We've received valuable feedback from many in the economics community throughout the development process, and consider this launch only the beginning. Our plans for future development fall into four broad categories:

More targets and validation. As of this writing, we have 39 open issues related to calibration, ranging from adding targets to better aligning our definitions to available targets.
Comparison to other models. For instance, we are currently working with the Policy Simulation Library to validate this dataset against their PUF-based TaxData project. We are also collecting estimates from other analysts, such as the Joint Committee on Taxation, Congressional Budget Office, Tax Policy Center, and ITEP, to compare projected results of reforms.
Integrating more data. Beyond the CPS and PUF, we would like to integrate information from other sources. For example, assets from the Survey of Consumer Finances would improve our modeling of asset limits in programs like SNAP and SSI; consumption from the Consumer Expenditure Survey would enable modeling of a hypothetical Value Added Tax or carbon tax (as our UK model has); and rent expenses from the the American Community Survey would improve our modeling of state rental tax credits.
Finer geographic detail. Having calibrated our data to the US as a whole, we are equipped to replicate the process for smaller geographies. We will avoid the issues of small sample sizes by reweighting the full nationwide dataset to each local area's administrative targets—starting at the state level, then Congressional district and ultimately state legislative district or other locales.

We are setting new standards in policy simulations and evidence-based policymaking by enhancing the CPS with IRS tax records and reweighting. Our ongoing innovation and transparency align with PolicyEngine's mission to compute the impact of public policy accurately and objectively.

max ghenis
PolicyEngine's Co-founder and CEO

nikhil woodruff
PolicyEngine's Co-founder and CTO

Subscribe to PolicyEngine

Get the latests posts delivered right to your inbox.