Analyze fusionACS microdata
analyze_fusionACS.Rd
For fusionACS internal use only. Calculation of point estimates and associated uncertainty (margin of error) for analyses using ACS and/or fused donor survey variables.
Efficiently computes means, medians, sums, proportions, and counts, optionally across population subgroups.
The use of native ACS weights or ORNL UrbanPop synthetic population weights is automatically determined given the requested geographic resolution.
Requires a local /fusionData
directory in the working directory path with assumed file structure and conventions.
Usage
analyze_fusionACS(
analyses,
year,
respondent = "household",
by = NULL,
area = NULL,
fun = NULL,
M = Inf,
R = Inf,
cores = 1,
version_up = 2,
force_up = FALSE
)
Arguments
- analyses
List. Specifies the desired analyses. Each analysis is a formula. See Details and Examples.
- year
Integer. One or more years for which microdata are pooled to compute
analyses
(i.e. ACS recipient year). Currently defaults toyear = 2015:2019
, if theby
variables indicate a sub-PUMA analysis requiring UrbanPop weights.- respondent
Character. Should the
analyses
be computed using"household"
- or"person"
-level microdata?- by
Character. Optional variable(s) that collectively define the set of population subgroups for which each analysis is computed. Can be a mix of geographic (e.g. census tract) and/or socio-demographic microdata variables (e.g. poverty status); the latter may be existing variables on disk or custom variables created on-the-fly via
fun()
. IfNULL
, analysis is done for the whole (national) sample.- area
Call. Optional unquoted call specifying a geographic area within which to compute the
analyses
. Useful for restricting the study area to a manageable size.- fun
Function. Optional function for creating custom microdata variables that cannot be accommodated in
analyses
. Must takedata
and (optionally)weight
as the only function arguments and must return adata.frame
with number of rows equal tonrow(data)
. See Details and Examples.- M
Integer. The first
M
implicates are used. SetM = Inf
to use all available implicates.- R
Integer. The first
R
replicate weights are used. SetR = Inf
to use all available replicate weights.- cores
Integer. Number of cores used for multithreading in
collapse-package
functions.- version_up
Integer. Use
version_up = 1
to access national, single-implicate weights. Useversion_up = 2
to access 10-replicate weights for 17 metro areas.- force_up
Logical. If
TRUE
, force use of UrbanPop weights even if the requested analysis can be done using native ACS weights.
Value
A tibble reporting analysis results, possibly across subgroups defined in by
. The returned quantities include:
- lhs
Optional analysis name; the "left hand side" of the analysis formula.
- rhs
The "right hand side" of the analysis formula.
- type
Type of analysis: sum, mean, median, prop(ortion) or count.
- level
Factor levels for categorical analyses; NA otherwise.
- N
Mean number of valid microdata observations across all implicates and replicates; i.e. the sample size used to construct the estimate.
- est
Point estimate; mean estimate across all implicates and replicates.
- moe
Margin of error associated with the 90% confidence interval.
- se
Standard error of the estimate.
- df
Degrees of freedom used to calculate the margin of error.
- cv
Coefficient of variation; conventional scale-independent measure of estimate reliability. Calculated as:
100 * moe / 1.645 / est
- rshare
Share of
moe
attributable to replicate weight uncertainty (as opposed to uncertainty across implicates).
Details
Allowable geographic units of analysis specified in by
are currently limited to: region, division, state, cbsa10, puma10, county10, cousubfp10 (county subdivision), zcta10 (zip code), tract10 (census tract), and bg10 (block group).
The final point estimates are the mean estimates across implicates. The final margin of error is derived from the pooled standard error across implicates, calculated using Rubin's pooling rules (1987). The within-implicate standard error's are calculated using the replicate weights.
Each entry in the analyses
list is a formula
of the format Z ~ F(E)
, where Z
is an optional, user-friendly name for the analysis, F
is an allowable “outer function”, and E
is an “inner expression” containing one or more microdata variables. For example:
mysum ~ mean(Var1 + Var2)
In this case, the outer function is mean(). Allowable outer functions are: mean(), sum(), median(), sd(), and var(). When the inner expression contains more than one variable, it is first evaluated and then F()
is applied to the result. In this case, an internal variable X = Var1 + Var2
is generated across all observations, and then mean(X)
is computed.
If no inner expression is desired, the analyses
list can use the following convenient syntax to apply a single outer function to multiple variables:
mean = c("Var1", "Var2")
The inner expression can also utilize any function that takes variable names as arguments and returns a vector with the same length as the inputs. This is useful for defining complex operations in a separate function (e.g. microsimulation). For example:
myfun = function(Var1, Var2) {Var1 + Var2}
mysum ~ mean(myfun(Var1, Var2))
The use of sum() or mean() with an inner expression that returns a categorical vector automatically results in category-wise weighted counts and proportions, respectively. For example, the following analysis would fail if evaluated literally, since mean() expects numeric input but the inner expression returns character. But this is interpreted as a request to return weighted proportions for each categorical outcome.
myprop ~ mean(ifelse(Var1 > 10 , 'Yes', 'No'))
analyze_fusionACS()
uses "fast" versions of the allowable outer functions, as provided by fast-statistical-functions
in the collapse
package. These functions are highly optimized for weighted, grouped calculations. In addition, outer functions mean(), sum(), and median() enjoy the use of platform-independent multithreading across columns when cores > 1
. Analyses with numerical inner expressions are processed using a series of calls to collap
with unique observation weights. Analyses with categorical inner expressions utilize a series of calls to fsum
.
Examples
# Analysis using ACS native weights for year 2017, by PUMA, in South Atlantic Census Division
# Uses all available implicates and replicate weights
test <- analyze_fusionACS(analyses = list(high_burden ~ mean(dollarel / hincp > 0.05)),
year = 2017,
by = "puma10",
area = division == "South Atlantic")
# Analysis using UrbanPop 2015-2019 weights, by tract, in Utah (actually Salt Lake City metro given current UrbanPop data)
# Uses 5 (of possible 20) fusion implicates for RECS "dollarel" variable
# Uses 5 (of possible 10) UrbanPop replicate weights
test <- analyze_fusionACS(analyses = list(median_burden ~ median(dollarel / hincp)),
year = 2015:2019,
by = "tract10",
area = state_name == "Utah",
M = 5,
R = 5)
# User function to create custom variables from microdata
# Variables explicitly referenced in my_fun() are automatically loaded into 'data' within analyze_fusionACS()
# Variables returned by my_fun() may be used in 'by' or inner expressions of 'analyses'
my_fun <- function(data) {
require(tidyverse, quietly = TRUE)
data %>%
mutate(elderly = agep >= 65,
energy_expend = dollarel + dollarfo + dollarlp + dollarng,
energy_burden = energy_expend / hincp,
energy_burden = ifelse(hincp < 5000, NA, energy_burden)) %>%
select(elderly, energy_burden, energy_expend)
}
# Analysis using UrbanPop 2015-2019 weights, by zip code and elderly head of household, in Atlanta CBSA
test <- analyze_fusionACS(analyses = list(energy_burden ~ mean(energy_burden),
at_risk ~ mean(energy_burden > 0.075 | acequipm_pub == "No air conditioning")),
year = 2015:2019,
by = c("zcta10", "elderly"),
area = cbsa10 == "12060",
fun = my_fun,
M = 5,
R = 5)