SSDA 2.0

Login	Survey Data Analysis (SSDA 2.0)	Logged in as:

Survey Data Analysis Why do we need survey data analysis software? Regular statistical software (that is not designed for survey data) analyzes data as if the data were collected using simple random sampling. For experimental and quasi-experimental designs, this is exactly what we want. However, when surveys are conducted, a simple random sample is rarely collected. Not only is it nearly impossible to do so, but it is not as efficient (both financially and statistically) as other sampling methods. When any sampling method other than simple random sampling is used, we need to use survey data analysis software to take into account the differences between the design that was used and simple random sampling. The sampling design affects the calculation of the standard errors of the estimates. If you ignore the sampling design, e.g., if you assume simple random sampling when another type of sampling design was used, the standard errors will likely be underestimated, possibly leading to results that seem to be statistically significant, when in fact, they are not. The difference in point estimates and standard errors obtained using non-survey software and survey software with the design properly specified will vary from data set to data set, and even between variables within the same data set. While it may be possible to get reasonably accurate results using non-survey software, there is no practical way to know beforehand how far off the results from non-survey software will be. Sampling designs Most people do not conduct their own surveys. Rather, they use survey data that some agency or company collected and made available to the public. The documentation must be read carefully to find out what kind of sampling design was used to collect the data. This is very important because many of the estimates and standard errors are calculated differently for the different sampling designs. Hence, if you mis-specify the sampling design, the point estimates and standard errors is likely to be wrong. Below are some common features of many sampling designs. Weights : There are many types of weights that can be associated with a survey. Perhaps the most common is the sampling weight, sometimes called a pweight, which is used to denote the inverse of the probability of being included in the sample due to the sampling design (except for a certainty PSU). The pweight is calculated as N/n, where N = the number of elements in the population and n = the number of elements in the sample. For example, if a population has 10 elements and 3 are sampled at random with replacement, then the pweight would be 10/3 = 3.33. The sum of the pweights should equal the population total. PSU : This is the primary sampling unit. This is the first unit that is sampled in the design. For example, school districts from California may be sampled and then schools within districts may be sampled. The school district would be the PSU. If states from the US were sampled, and then school districts from within each state, and then schools from within each district, then states would be the PSU. One does not need to use the same sampling method at all levels of sampling. For example, probability-proportional-to-size sampling may be used at level 1 (to select states), while cluster sampling is used at level 2 (to select school districts). In the case of a simple random sample, the PSUs and the elementary units are the same. Strata : Stratification is a method of breaking up the population into different groups, often by demographic variables such as gender, race etc. Once these groups have been defined, one sample from each group is selected independent of all of the other groups. For example, if a sample is to be stratified on gender, men and women would be sampled independent of one another. This means that the pweights for men will likely be different from the pweights for the women. In most cases, you need to have two or more PSUs in each stratum. The purpose of stratification is to improve the precision of the estimates. Finite Population Correction (FPC) : FPC is used when the sampling fraction (the number of elements or respondents sampled relative to the population) becomes large. The FPC is used in the calculation of the standard error of the estimate. If the value of the FPC is close to 1, it will have little impact and can be safely ignored. In some survey data analysis programs, such as SUDAAN, this information will be needed if you specify that the data were collected without replacement. To see the impact of the FPC for samples of various proportions, suppose that you had a population of 10,000 elements. Sample size (n) FPC 1 1.0 10 .9995 100 .9950 500 .9747 1000 .9487 5000 .7071 9000 .3162 Imputation flag: This is a 0/1 variable that is associated with a variable in the data set and indicates whether the corresponding value in the associated variable was imputed or given by the respondent. For example, in the data set below Subject Response ImputeFlag 1 60 0 2 60 1 3 63 0 the data for subject number 2 was imputed. The flag does not tell you how the imputation was done (i.e., mean substitution, mean of neighboring units imputation, etc.). These variables are useful for determining how much missing data each variable has. Non-response weight : There are both unit and item non-response weights. The former down-weights an entire case because the respondent did not respond to any of the items on the survey. The later down-weights "responses" from respondents who did not answer that item. Certainty PSU : This is a PSU that was guaranteed to be in the sample. This is independent of the sampling design: any sampling design can have one or more certainty PSUs. Certainty PSUs are also called self-representing units. Poststratification: This is stratification that happens after the sample has been collected, either because the information needed to do stratification was not available when the sample was collected, or because it was not known at the time of data collection that stratification on this variable would be necessary/desirable. The purpose of poststratification is to improve the precision of the estimates or to reduce bias caused by non-response. Sampling with and without replacement Most samples collected in the real world are collected "without replacement". This means that once a respondent has been selected to be in the sample and has participated in the survey, that particular respondent cannot be selected again to be in the sample. Many of the calculations change depending on if a sample is collected with or without replacement. Hence, programs like SUDAAN request that you specify if a survey sampling design was implemented with or without replacement, and an FPC is used if sampling without replacement is used, even if the value of the FPC is very close to one. Replicate weights Replicate weights are a feature of an increasing number of public use survey data sets. Replicate weights are a series of weight variables that are used instead of PSUs and strata in an effort to protect the respondents' identity. Either replicate weights or a Taylor series linearization, which is based on PSUs and/or strata, are necessary for variance estimation. Summary of four survey data analysis packages We are now going to summarize some of the features of four survey data analysis packages: Stata, SUDAAN, WesVar and SAS. On feature that all four programs share is that once you specify the sampling design, it is either 1) Applied to all analyses until you change it or exit the program (Stata and WesVar), OR 2) Very easy to apply to all analyses (SUDAAN and SAS). In other words, you only need to go through the work of specifying the design once, and then it applies to all analyses of that data. Stata: It handles most sampling designs, except two-stage cluster sampling, probability-proportional-to-size sampling, poststratification and certainty PSUs It has the most statistical procedures of any of the packages It does not handle replicate weights It has a relatively easy to use command interface (point and click in Stata version 8) SUDAAN: It handles all sampling designs It has a fair number of statistical procedures It handles replicate weights (except for survival analysis) It has a relatively more difficult to use command interface WesVar: It handles all sampling designs except two-stage cluster sampling It has a fair number of statistical procedures It handles replicate weights (and can create them from PSUs and strata) It has a relatively easy to use point-and-click interface SAS: It handles all sampling designs except poststratification and two-stage cluster sampling It has a VERY limited number of statistical features (only means and regression in version 8, frequencies maybe logistic regression in version 9) It does not handle replicate weights It has a relatively more difficult to use command interface

General links