Project overview
A large amount of data concerning American households is collected by surveys. Individual surveys typically focus on a single topic – e.g. finances, housing, or health – and field independent and relatively small samples. As a result, data users are constrained by the particular variables and spatial resolution available in a single survey. The fusionACS data science platform (Ummel et al. 2024) addresses this problem by statistically “fusing” variables from disparate surveys to simulate a single, integrated, high-resolution survey.
Variables unique to “donor” surveys are simulated for American Community Survey (ACS) households and individuals, thereby generating probabilistic estimates of how ACS respondents might have answered a donor survey’s questionnaire. Respondent characteristics that are common to both the donor and the ACS (e.g. income, age, household size) – as well as spatial information that can be merged to both (e.g. characteristics of the local built environment) – are used as predictors variables in LightGBM machine learning models (Ke et al. 2017).
In 2025, an enhanced version of fusionACS was introduced that integrates UrbanPop, a synthetic population data product produced by Oak Ridge National Laboratory (Tuccillo et al. 2023). UrbanPop provides probabilistic estimates of the location of each ACS respondent household. The fusionACS + UrbanPop platform makes it possible to derive estimates at high spatial resolution – even at the level of individual census block groups – for any ACS or donor survey variable.
The fusionACS project seeks to maximize the amount of useful information that can be extracted from existing U.S. household survey data in order to help academics and policymakers answer research questions that would otherwise be impossible.
How to access
The fusionACS R package allows users to access and analyze a “pseudo-sample” of the complete fusionACS database stored within the Yale High Performance Computing (HPC) facility. The complete database – consisting of multiple fusion implicates and integration of the ORNL UrbanPop synthetic population – is prohibitively large for public dissemination and contains some data that cannot be released. The goal of the fusionACS package is to enable users to perform exploratory analysis, and design, refine, and test analyses using the full suite of variables available in the private database.
The pseudo-sample contains a single record (observation) for each ACS respondent (households and persons) for the period 2015-2019. Each record includes actual/observed values for all variables in the ACS, as well as a single simulation of each variable fused from donor surveys. As a result, the pseudo-sample plausibly mimics how ACS respondents might have answered questions unique to the donor survey questionnaires.
The sample also includes a range of geographic variables – from census region down to individual block groups – for each ACS household. These are obtained by random sampling of the underlying UrbanPop data in such a way that all block groups nationwide are represented.
While the sample data cannot be used to derive “final” estimates or the associated margin of error, it can be used to do just about everything else a user might normally do during the analysis development and prototyping stages. And for analyses that are not hyper-specific (i.e. involve a comparatively large number of observations) the estimates derived from the pseudo-sample will often be quite close to the true values computed on the full database.
Our hope is that it will eventually be possible to remotely execute a valid analysis – designed and tested locally – using the complete fusionACS database and return the full “production” results to the user.