Survey Data Analysis
Why do we need survey data analysis software?
Regular statistical software (that is not designed for survey data) analyzes
data as if the data were collected using simple random sampling. For
experimental and quasi-experimental designs, this is exactly what we want.
However, when surveys are conducted, a simple random sample is rarely collected.
Not only is it nearly impossible to do so, but it is not as efficient (both
financially and statistically) as other sampling methods. When any sampling
method other than simple random sampling is used, we need to use survey data
analysis software to take into account the differences between the design that
was used and simple random sampling. The sampling design affects the calculation
of the standard errors of the estimates. If you ignore the sampling design,
e.g., if you assume simple random sampling when another type of sampling design
was used, the standard errors will likely be underestimated, possibly leading to
results that seem to be statistically significant, when in fact, they are not.
The difference in point estimates and standard errors obtained using non-survey
software and survey software with the design properly specified will vary from
data set to data set, and even between variables within the same data set. While
it may be possible to get reasonably accurate results using non-survey software,
there is no practical way to know beforehand how far off the results from
non-survey software will be.
Sampling designs
Most people do not conduct their own surveys. Rather, they use survey data that
some agency or company collected and made available to the public. The
documentation must be read carefully to find out what kind of sampling design
was used to collect the data. This is very important because many of the
estimates and standard errors are calculated differently for the different
sampling designs. Hence, if you mis-specify the sampling design, the point
estimates and standard errors is likely to be wrong.
Below are some common features of many sampling designs.
Weights
: There are many types of weights that can be associated with a survey.
Perhaps the most common is the sampling weight, sometimes called a pweight,
which is used to denote the inverse of the probability of being included in the
sample due to the sampling design (except for a certainty PSU). The
pweight is calculated as N/n, where N = the number of elements in the population
and n = the number of elements in the sample. For example, if a population has
10 elements and 3 are sampled at random with replacement, then the pweight would
be 10/3 = 3.33. The sum of the pweights should equal the population total.
PSU
: This is the primary sampling unit. This is the
first unit that is sampled in the design. For example, school districts from
California may be sampled and then schools within districts may be sampled. The
school district would be the PSU. If states from the US were sampled, and then
school districts from within each state, and then schools from within each
district, then states would be the PSU. One does not need to use the same
sampling method at all levels of sampling. For example,
probability-proportional-to-size sampling may be used at level 1 (to select
states), while cluster sampling is used at level 2 (to select school districts).
In the case of a simple random sample, the PSUs and the elementary units are the
same.
Strata
: Stratification is a method of breaking up the population into
different groups, often by demographic variables such as gender, race etc.
Once these groups have been defined, one sample from each group is selected independent of all of the other groups. For example, if a sample is to be
stratified on gender, men and women would be sampled independent of one another.
This means that the pweights for men will likely be different from the pweights
for the women. In most cases, you need to have two or more PSUs in each stratum.
The purpose of stratification is to improve the precision of the estimates.
Finite Population Correction (FPC)
: FPC
is used when the sampling fraction (the number of elements or respondents
sampled relative to the population) becomes large. The FPC is used in the
calculation of the standard error of the estimate. If the value of the FPC is
close to 1, it will have little impact and can be safely ignored. In some survey
data analysis programs, such as SUDAAN, this information will be needed if you
specify that the data were collected without replacement. To see the impact of the FPC for samples
of various proportions, suppose that you had a population of 10,000 elements.
Sample size (n) FPC
1 1.0
10 .9995
100 .9950
500 .9747
1000 .9487
5000 .7071
9000 .3162
Imputation flag: This is a 0/1 variable that is associated
with a variable in the data set and indicates whether the corresponding value in
the associated variable was imputed or given by the respondent. For example, in
the data set below
Subject Response ImputeFlag
1 60 0
2 60 1
3 63 0
the data for subject number 2 was imputed. The flag does not tell you how the
imputation was done (i.e., mean substitution, mean of neighboring units imputation, etc.). These
variables are useful for determining how much missing data each variable has.
Non-response weight : There are both unit and item non-response
weights. The former down-weights an entire case because the respondent did not
respond to any of the items on the survey. The
later down-weights "responses" from respondents who did not answer that item.
Certainty PSU : This is a PSU that was guaranteed to be in
the sample. This is independent of the sampling design: any sampling design can
have one or more certainty PSUs. Certainty PSUs are also called
self-representing units.
Poststratification: This is stratification that happens after the sample has been
collected, either because the information needed to do stratification was not
available when the sample was collected, or because it was not known at the time
of data collection that stratification on this variable would be
necessary/desirable. The purpose of poststratification is to improve the
precision of the estimates or to reduce bias caused by non-response.
Sampling with and without replacement
Most samples collected in the real world are collected "without replacement".
This means that once a respondent has been selected to be in the sample and has
participated in the survey, that particular respondent cannot be selected again
to be in the sample. Many of the calculations change depending on if a sample is
collected with or without replacement. Hence, programs like SUDAAN request that
you specify if a survey sampling design was implemented with or without
replacement, and an FPC is used if sampling without replacement is used, even if
the value of the FPC is very close to one.
Replicate weights
Replicate weights are a feature of an increasing number of public use survey
data sets. Replicate weights are a series of weight variables that are used
instead of PSUs and strata in an effort to protect the respondents' identity.
Either replicate weights or a Taylor series linearization, which is based on
PSUs and/or strata, are necessary for variance estimation.
Summary of four survey data analysis packages
We are now going to summarize some of the features of four survey data analysis
packages: Stata, SUDAAN, WesVar and SAS. On feature that all four programs share
is that once you specify the sampling design, it is either
1) Applied to all
analyses until you change it or exit the program (Stata and WesVar), OR
2) Very
easy to apply to all analyses (SUDAAN and SAS). In other words, you only need to
go through the work of specifying the design once, and then it applies to all
analyses of that data.
Stata:
-
It handles most sampling designs, except two-stage cluster sampling,
probability-proportional-to-size sampling, poststratification and certainty PSUs
-
It has the most statistical procedures of any of the packages
-
It does not handle replicate weights
-
It has a relatively easy to use command interface (point and click in
Stata version 8)
SUDAAN:
-
It
handles all sampling designs
-
It has a fair number of statistical procedures
-
It handles replicate weights (except for survival analysis)
-
It has a relatively more difficult to use command interface
WesVar:
-
It handles all sampling designs except two-stage cluster sampling
-
It has a fair number of statistical procedures
-
It handles replicate weights (and can create them from PSUs and strata)
-
It has a relatively easy to use point-and-click interface
SAS:
-
It handles all sampling designs except poststratification and two-stage
cluster sampling
-
It has a VERY limited number of statistical features (only means and
regression in version 8, frequencies maybe logistic regression in version 9)
-
It does not handle replicate weights
-
It has a relatively more difficult to use command interface