HCUP Calculating Standard Errors
Thank you for joining us for this Healthcare Cost and Utilization Project (HCUP) online tutorial on Calculating Standard Errors.
Before we get started, a quick word about HCUP:
The goal of this tutorial is to show you how to determine the precision of the estimates you calculate from HCUP nationwide databases so that you will be able to draw sound conclusions from your analyses.
Importance of Calculating Standard Errors Standard error is a measure of the precision of a statistic. It reflects the amount that a sample statistic's value would fluctuate if a large number of samples were to be drawn using the same sampling design. Less precise estimates have larger standard errors while more precise estimates have smaller standard errors.
HCUP Nationwide Database Sample Design The HCUP nationwide databases are not simple random samples. The NIS (beginning with data year 2012) KID, and NRD are stratified samples. The NIS was redesigned in 2012 to improve national estimates. Prior to its redesign, the NIS was a stratified two-stage cluster sample without replacement. The NEDS also is a stratified two-stage cluster sample without replacement. Standard formulas for a stratified two-stage cluster sample without replacement may be used to calculate standard errors in most applications for all four samples. Although a sample of hospitals is not drawn for the NIS (beginning with data year 2012), KID, or NRD, for estimation purposes, hospitals should be treated as though they were selected at the first stage of sampling from the entire universe of hospitals within each stratum. Examples provided in this tutorial use 2013 NIS data, but the same standard error calculations apply to prior data years of the NIS as well as to the NEDS, KID, and NRD. To review the sample designs, refer to the HCUP Sample Design Tutorial. The Nationwide Inpatient Sample Prior to data year 2012, the NIS was a stratified two-stage cluster sample, similar to the NEDS. Beginning with the 2012 data year, the NIS is a stratified sample of hospital discharges. Discharges in the sampling frame are stratified by five key hospital characteristics. Then, a systematic random sample of discharges is chosen from each of the strata after the discharges are sorted by "control" variables ordered as follows: encrypted hospital ID, Diagnosis-Related Group (DRG), admission month, and a random number. Although the NIS is not a cluster sample, (discharges are sampled from all frame hospitals) discharges are still clustered within hospitals. Consequently, each hospital is considered a cluster for the purpose of calculating standard errors. The Nationwide Emergency Department Sample The NEDS is a stratified two-stage cluster sample. Hospital-based emergency departments in the sampling frame are stratified by five key hospital characteristics. Then, a random sample of hospital-based emergency departments is chosen from each of the strata. In sampling terminology, each emergency department is considered a cluster. The NEDS includes all discharges from the selected clusters, or emergency departments. The Kids Inpatient Database The KID is comprised of a sample of pediatric discharges from all hospitals in the sampling frame. Discharges are stratified by whether they are an uncomplicated in-hospital birth, a complicated in-hospital birth, or a pediatric non-birth. For the KID, a random sample of 10% of uncomplicated in-hospital births and 80% of all other pediatric discharges is selected. The Nationwide Readmissions Database The NRD is drawn from HCUP State Inpatient Databases (SID) that contain reliable, verified patient linkage numbers that can be used to track a person across hospitals within a State, while adhering to strict privacy guidelines. All of the discharges in the sampling frame were included, making the NRD a sample of convenience. Discharges are post-stratified for the purpose of weighting by hospital characteristics (census region, urban/rural location, hospital teaching status, size of the hospital defined by the number of beds, and hospital control) and patient characteristics (sex and five age groups [0, 1-17, 18-44, 45-64, and 65 and older]). The procedures being described in this tutorial all assume inferences to a large population. Therefore, the finite population correction is not used. It is applied only when inferences are being made to the specific population of patients actually hospitalized during the year of the data. Usually analysts prefer not to use the finite population correction because they are interested in the long-run results for hospitals. For example, interest centers on the true, long-run mortality rate for a hospital rather over multiple years rather than to the mortality rate actually observed in a single year.
Several statistical programming packages can be used to calculate sample statistics and appropriate standard errors based on data from complex sampling designs. Some examples of these statistical programming packages are SAS®, SUDAAN®, STATA®, and SPSS®. I will use SAS in today's demonstrations. In particular, I will use the SAS survey sampling and analysis procedures.
These procedures incorporate the complex sample design of the HCUP nationwide databases into the analysis. They MUST be used when calculating national estimates, regional estimates and standard errors. The HCUP reports Calculating Nationwide Inpatient Sample (NIS) Variances for Data Years 2011 and Earlier and Calculating National Inpatient Sample (NIS) Variances for Data Years 2012 and Later provide more information as well as example code for calculating standard errors using other statistical packages. First I will show you how to produce standard errors for statistics based on the entire National Inpatient Sample. The SAS program code below produces national estimates of the sums, the means, and the standard errors for the number of discharges, the length of stay, the percentage of people who died during hospitalization, and the total hospital charges from the 2013 NIS.
LIBNAME NIS2013 "C:\"; DATA NIS_2013_CORE; SET NIS2013.NIS_2013_CORE; LENGTH DISCHGS 3; RETAIN DISCHGS 1; RUN; PROC SURVEYMEANS DATA=NIS_2013_CORE SUM STD MEAN STDERR MISSING; WEIGHT discwt; CLASS died; FORMAT died FDIED.; CLUSTER hosp_nis; STRATA nis_stratum; VAR DISCHGS los died totchg; RUN; In all examples, the following conventions apply:
LIBNAME NIS2013 "C:\"; DATA NIS_2013_CORE; SET NIS2013.NIS_2013_CORE; LENGTH DISCHGS 3; RETAIN DISCHGS 1; RUN; PROC SURVEYMEANS DATA=NIS_2013_CORE SUM STD MEAN STDERR MISSING; WEIGHT discwt; CLASS died; FORMAT died FDIED.; CLUSTER hosp_nis; STRATA nis_stratum; VAR DISCHGS los died totchg; RUN; When you select "SET": Keep all observations in the CORE file. SET NIS2013.NIS_2013_CORE; When you select "LENGTH" By default, numeric variables have a length of 8. In order to reduce the size of the file, a length of 3 is sufficient for binary variables. LENGTH DISCHGS 3; When you select "RETAIN" Create a dummy variable to ensure that every observation will be included in the discharge count. RETAIN DISCHGS 1; When you select "PROC_SURVEYMEANS" The PROC SURVEYMEANS statement invokes the SAS procedure. PROC SURVEYMEANS DATA=NIS_2013_CORE SUM STD MEAN STDERR MISSING; When you select "DATA" The DATA= option requests that the analysis be performed on the NIS 2013 Core file. PROC SURVEYMEANS DATA=NIS_2013_CORE SUM STD MEAN STDERR MISSING; When you select "SUM" The SUM option requests the sum for variables listed in the VAR statement. For example, the variable DISCHGS is set to equal 1 for every record, so its sum estimates the total number of discharges. PROC SURVEYMEANS DATA=NIS_2013_CORE SUM STD MEAN STDERR MISSING; When you select "STD" The STD option requests the standard deviation of the sum. PROC SURVEYMEANS DATA=NIS_2013_CORE SUM STD MEAN STDERR MISSING; When you select "MEAN_STDERR" The MEAN and STDERR options request that the mean and its standard error be printed. PROC SURVEYMEANS DATA=NIS_2013_CORE SUM STD MEAN STDERR MISSING; When you select "MISSING" If you specify the MISSING option in the PROC SURVEYMEANS statement, the procedure treats missing values of a categorical variable as a valid category. Otherwise, observations with missing values of a categorical variable would be excluded from estimates. PROC SURVEYMEANS DATA=NIS_2013_CORE SUM STD MEAN STDERR MISSING; When you select WEIGHT The WEIGHT statement weights each record by the value of the variable DISCWT. WEIGHT discwt; When you select CLASS The CLASS statement identifies DIED as a categorical variable for which a ratio analysis is performed (ratio of sum of DIED to sum of DISCWT). CLASS died; When you select FORMAT The FORMAT statement is used to add value labels. In this example, it is assigning value labels for the class variable, DIED. If the FORMAT statement is not used, the SAS output will only display values (i.e., 0 and 1). The value labels help clarify the results (e.g., 0 represents patients that did not die in the hospital and 1 represents patients that died in the hospital). FORMAT died FDIED.; When you select CLUSTER The CLUSTER statement specifies HOSP_NIS as the cluster identifier. The cluster is the hospital. CLUSTER hosp_nis; When you select STRATA The STRATA statement specifies NIS_STRATUM as the stratum identifier. In the case of the NIS, the strata are based on hospital characteristics. STRATA nis_stratum; Here are the results of the program. The SURVEYMEANS Procedure Data Summary Number of Strata 202 Number of Clusters 4363 Number of Observations 7119563 Sum of Weights 35597792 Class Level Information CLASS Variable Label Levels Values DIED Died during hospitalization 4 .: Missing .A: Invalid 0: Did not die in hospital 1: Died in hospital Statistics Std Error Variable Level Label Mean of Mean Sum Std Dev -------------------------------------------------------------------------------------------------------------------------------------- DISCHGS 1.00 0.00 35,597,792 296,045 LOS Length of stay (cleaned) 4.55 0.02 161,796,496 1,466,640 TOTCHG Total charges (cleaned) 39,513.25 480.47 1,378,643,839,214 21,505,352,862 DIED .: Missing Died during hospitalization 0.00 0.00 9,585 2,359 .A: Invalid Died during hospitalization 0.00 0.00 3,575 855 0: Did not die in hospital Died during hospitalization 0.98 0.00 34,912,122 290,483 1: Died in hospital Died during hospitalization 0.02 0.00 672,510 7,974 -------------------------------------------------------------------------------------------------------------------------------------- As you can see, there are 202 sampling strata; 4,363 clusters, each of which is a hospital; and 7,119,563 unweighted sample records in the 2013 NIS. Data Summary Number of Strata 202 Number of Clusters 4363 Number of Observations 7119563 Sum of Weights 35597792 According to the results, it is estimated that nationwide there were a total of 35,597,792 inpatient discharges with a standard deviation of 296,045. Std Error Variable Level Label Mean of Mean Sum Std Dev -------------------------------------------------------------------------------------------------------------------------------------- DISCHGS 1.00 0.00 35,597,792 296,045 LOS Length of stay (cleaned) 4.55 0.02 161,796,496 1,466,640 TOTCHG Total charges (cleaned) 39,513.25 480.47 1,378,643,839,214 21,505,352,862 DIED .: Missing Died during hospitalization 0.00 0.00 9,585 2,359 .A: Invalid Died during hospitalization 0.00 0.00 3,575 855 0: Did not die in hospital Died during hospitalization 0.98 0.00 34,912,122 290,483 1: Died in hospital Died during hospitalization 0.02 0.00 672,510 7,974 -------------------------------------------------------------------------------------------------------------------------------------- The estimated average length of stay was 4.55 days with a standard error of .02 days.
The results of the example analysis can be verified using HCUPnet. Here are the results of an HCUPnet query corresponding to our SAS program. When the results of the SAS program are compared to HCUPnet output, all of the estimates and standard errors agree: total discharges, length of stay, total charges, and in-hospital deaths. When the results of the SAS program are compared to HCUPnet output, you may notice small discrepancies in some estimates. HCUPnet uses data that are stored as SAS files. The NIS files that are purchased through the HCUP Central Distributor are sent as ASCII files. Weights (for making national estimates) in the ASCII files are truncated at the fourth decimal place, thus some resulting estimates will be slightly different from those from HCUPnet; however, the differences should be very small. Calculating Standard Errors for Subsets What if your research focuses on only a subset of discharges from the NIS, such as hospital stays in which a coronary artery bypass graft, or CABG (pronounced "cabbage") was performed? Does calculating standard errors for a subset of discharges differ from calculating standard errors for estimates based on the entire sample? Yes. When you produce statistics based on all the discharges in the sample, you include discharges from all of the hospitals in the sample, and thus take all of the hospitals, or clusters, in the sample into account.
There are two methods you can use to account for all of the hospitals in the sample:
The recommended method for calculating standard errors requires more disk space and CPU time than the alternate method because the HCUP nationwide databases have a large number of records, all of which are involved in the recommended method. This may present a challenge in terms of disk space or software capabilities when using a database such as the 2013 NEDS--which contains roughly 30 million unweighted observations. In this case the alternate method, which we will look at shortly, may be more appropriate. See below for an explanation of each line of code and the recommended method for calculating standard errors. LIBNAME NIS2013 "C:\"; /* CREATE SUBSET OF CABG PROCEDURES */ DATA CABGSUBSET; SET NIS2013.NIS_2013_CORE; LENGTH DISCHGS CABG 3; RETAIN DISCHGS 1; IF PRCCS1=44 THEN CABG=1; ELSE CABG=0; RUN; PROC SURVEYMEANS DATA=CABGSUBSET SUM STD MEAN STDERR MISSING; WEIGHT discwt; CLASS died; FORMAT dief fdied.; CLUSTER hosp_nis; STRATA nis_stratum; VAR DISCHGS los died totchg; DOMAIN CABG; RUN; When you select "SET NIS.NIS_2013" Keep all observations in the CORE file. When you select "RETAIN DISCHGS 1" Create a dummy variable to ensure that every observation will be included in the discharge count. When you select "IF prccs1=44 THEN CABG=1" PRCCS1 is the data element in which the CCS principal procedure is stored and the CCS code for CABG is 44. For more information on Clinical Classification Software (CCS) and CCS codes, visit the HCUP-US Tools & Software page. When you select "CABG=0" Initialize a variable to flag discharges for which coronary artery bypass graft, or CABG, was the principal procedure performed. When you select "DOMAIN CABG" Use the CABG flag in the SAS DOMAIN statement in the SURVEYMEANS procedure. The DOMAIN statement requests analyses for a subpopulation (i.e. CABG procedures) and enables appropriate calculations for statistics in each domain. Subsets: Recommended Method Results The data summary shows the output accounts for all 4,363 hospitals in the sample and all 7 million unweighted observations. The first set of statistics, where CABG equals zero, are for discharges which did not have a CABG performed. The second set of statistics, where CABG equals one, are for those discharges for which CABG was the principal procedure. The SURVEYMEANS Procedure Data Summary Number of Strata 202 Number of Clusters 4363 Number of Observations 7119563 Sum of Weights 35597792 Class Level Information CLASS Variable Label Levels Values DIED Died during hospitalization 4 .: Missing .A: Invalid 0: Did not die in hospital 1: Died in hospital Domain Statistics in CABG Std Error CABG Variable Level Label Mean of Mean Sum Std Dev ---------------------------------------------------------------------------------------------------------------------------------------------------------- 0 DISCHGS 1.00 0.00 35,440,072 294,316 LOS Length of stay (cleaned) 4.52 0.02 160,334,536 1,449,387 TOTCHG Total charges (cleaned) 38,971.03 476.96 1,353,657,499,120 21,351,712,415 DIED .: Missing Died during hospitalization 0.00 0.00 9,545 2,356 .A: Invalid Died during hospitalization 0.00 0.00 3,545 854 0: Did not die in hospital Died during hospitalization 0.98 0.00 34,757,392 288,803 1: Died in hospital Died during hospitalization 0.02 0.00 669,590 7,935 1 DISCHGS 1.00 0.00 157,720 4,347 LOS Length of stay (cleaned) 9.27 0.06 1,461,960 41,387 TOTCHG Total charges (cleaned) 160,477.45 2,469.74 24,986,340,094 738,469,446 DIED .: Missing Died during hospitalization 0.00 0.00 40 18 .A: Invalid Died during hospitalization 0.00 0.00 30 25 0: Did not die in hospital Died during hospitalization 0.98 0.00 154,730 4,275 1: Died in hospital Died during hospitalization 0.02 0.00 2,920 142 ---------------------------------------------------------------------------------------------------------------------------------------------------------- Results show an estimated total of 157,720 hospitalizations in which CABG is the principal procedure with a standard deviation of 4,347. The average length of stay, indicated as LOS, is estimated at 9.27 days with a standard error of 0.06 days. The estimated average total charges were $160,477.45 with a standard error of $2,469.74. The mean of the flags indicating death during hospitalization was 0.02. In other words, 2 percent of stays resulted in death during hospitalization with a standard error of 0.00 percent. The results of the example analysis can be verified using HCUPnet. Here are the results of a query corresponding to our SAS program. The results of the SAS program are compared to HCUPnet output and you can see that all of the estimates are the same. The alternate method for calculating appropriate standard errors is to subset the nationwide database to the observations of interest. Then, append one "dummy" observation for each of the hospitals included in the nationwide database that is not represented in the subset. The dummy observations ensure that all the hospitals in the sample are taken into account, resulting in the accurate calculation of standard error. To do this, you must concatenate the subset of interest with the HOSPITAL file.
LIBNAME NIS2013 "C:\"; /* CREATE SUBSET OF CABG PROCEDURES */ DATA CABGSUBSET; SET NIS2013.NIS_2013_CORE; LENGTH DISCHGS 3; RETAIN DISCHGS 1; IF PRCCS1=44; /* CREATE ANALYSIS FILE */ DATA CABGSUBSET; SET CABGSUBSET NIS2013.NIS_2013_HOSPITAL (IN=INHOSP KEEP=HOSP_NIS NIS_STRATUM) ; LENGTH INSUBSET 3; INSUBSET = 1; IF INHOSP THEN DO; INSUBSET = 2; /* ASSIGN A VALUE OUTSIDE THE SUBSET */ DISCWT = 1; /* ASSIGN A VALID WEIGHT */ /* ASSIGN ANALYSIS VARIABLES TO 0 */ DISCHGS = 0; los = 0; died = 0; totchg = 0; END; RUN; TITLE "CABG Subset Statistics Using Alternative Method"; PROC SURVEYMEANS DATA=CABGSUBSET SUM STD MEAN STDERR MISSING; WEIGHT discwt; CLASS died; FORMAT died FDIED.; CLUSTER hosp_nis; STRATA nis_stratum; VAR DISCHGS los died totchg; DOMAIN INSUBSET; RUN; The Hospital File is a supplemental file which is provided with the NIS Core File. It contains a few key variables for each hospital included in the nationwide database.
When you select "NIS.NIS_2013_HOSPITAL" Append dummy observations from the HOSPITAL file. The variable INHOSP indicates which file the observation came from. In this case, INHOSP=1 indicates that the observation came from the HOSPITAL file. When you select "INSUBSET=1" Create a flag to indicate observations that came from the CABG subset. When you select "IF INHOSP THEN DO; INSUBSET = 2" Set the value of INSUBSET to 2 to indicate the observation did not come from the CABG subset (i.e., did not have the CCS code = 44 for CABG procedures). When you select "discwt =1" Assign a valid weight value to non-CABG subset observations from the HOSPITAL FILE to ensure that every hospital will be included in the standard error calculations. When you select "DISCHGS=0; los=0; died=0; totchg=0" Assign non-missing values to of variables of interest for non-CABG subset observations from the HOSPITAL file to ensure that every hospital will not be included in the standard error calculations. When you select "DOMAIN INSUBSET" The variable INSUBSET is used to indicate whether or not an observation came from the CABG subset. In this case the statistics will be calculated separately for observations that came from the CABGSUBSET file and those that did not. Thus, we will only be interested in the results for INSUBSET = 1. Subsets: Alternate Method Results The alternate method produces the same correct statistical output as the recommended method. Again, results of the analysis can be verified using HCUPnet. The SURVEYMEANS Procedure Data Summary Number of Strata 202 Number of Clusters 4363 Number of Observations 35907 Sum of Weights 162083.005 Domain Statistics in INSUBSET Std Error INSUBSET Variable Level Label Mean of Mean Sum Std Dev ------------------------------------------------------------------------------------------------------------------------------------------------------- 1 DISCHGS 1.00 0.00 157,720 4,347 LOS Length of stay (cleaned) 9.27 0.06 1,461,960 41,387 TOTCHG Total charges (cleaned) 160,477.45 2,469.74 24,986,340,094 738,469,446 DIED .: Missing Died during hospitalization 0.00 0.00 40 18 .A: Invalid Died during hospitalization 0.00 0.00 30 25 0: Did not die in hospital Died during hospitalization 0.98 0.00 154,730 4,275 1: Died in hospital Died during hospitalization 0.02 0.00 2,920 142 2 DISCHGS 0.00 0.00 0 0 LOS Length of stay (cleaned) 0.00 0.00 0 0 TOTCHG Total charges (cleaned) 0.00 0.00 0 0 DIED .: Missing Died during hospitalization 0.00 0.00 0 0 .A: Invalid Died during hospitalization 0.00 0.00 0 0 0: Did not die in hospital Died during hospitalization 1.00 0.00 4,363 0 1: Died in hospital Died during hospitalization 0.00 0.00 0 0 ------------------------------------------------------------------------------------------------------------------------------------------------------- Remember, if the alternate method was not correctly applied, and all hospitals in the sample were not included in the analysis, standard errors will be incorrect. The SURVEYMEANS Procedure Data Summary Number of Strata 124 Number of Clusters 1110 Number of Observations 31544 Sum of Weights 157720.005 Class Level Information CLASS Variable Label Levels Values DIED Died during hospitalization 4 .: Missing .A: Invalid 0: Did not die in hospital 1: Died in hospital Statistics Std Error Variable Level Label Mean of Mean Sum Std Dev -------------------------------------------------------------------------------------------------------------------------------------------------- DISCHGS 1.00 0.00 157,720 3,439 LOS Length of stay (cleaned) 9.27 0.06 1,461,960 33,676 TOTCHG Total charges (cleaned) 160,477.45 2,373.93 24,986,340,094 615,460,503 DIED .: Missing Died during hospitalization 0.00 0.00 40 18 .A: Invalid Died during hospitalization 0.00 0.00 30 25 0: Did not die in hospital Died during hospitalization 0.98 0.00 154,730 3,385 1: Died in hospital Died during hospitalization 0.02 0.00 2,920 134 -------------------------------------------------------------------------------------------------------------------------------------------------- Here is an example of output from a program which does not account for all hospitals in the sample. The number of strata and clusters do not reflect the complete sample. The standard errors produced when all hospitals are not accounted for are incorrect and could lead to erroneous conclusions in your research. It is critical to ensure you obtain a correct standard error. Once you have calculated standard errors for the subset of discharges you are studying, you may want to check to see if there are any statistically significant differences between outcomes or measures of hospital stays in your subset and other subsets. The Z-Test calculator is a convenient way to do just that. It can be accessed by clicking the Z-test calculator link below any HCUPnet query results page.
To test if the length of stay of a discharge with a principal procedure of CABG is significantly different from that of stays which did not have a principal CABG procedure, select the Z-Test calculator.
Perhaps I am also Interested in testing to see if there has been a statistically significant change in the number of hospital stays with CABG between 2003 and 2013.
As you calculate sample statistics and standard errors from the HCUP nationwide databases, you should consider the following key points:
If you are looking for more information on the subject matter covered here, several resources are available on the HCUP User Support (HCUP-US) website: www.hcup-us.ahrq.gov. If you can't find what you need, feel free to email the HCUP Technical Assistance staff at hcup@ahrq.gov. AHRQ has research personnel available to respond to technical questions you may have. Inquiries are answered within three business days. Thank you for accessing this module. There are several other HCUP Online Tutorials. Take a look to see if there are other topics that could be helpful to you. If you have any feedback regarding this module, please email us at hcup@ahrq.gov. Detailed documentation of HCUP is available on the HCUP User Support website (http://www.hcup-us.ahrq.gov For documentation on each of the HCUP national databases, click on the links below:
Special Methods Documents are available at http://hcup-us.ahrq.gov/reports/methods.jsp. Specific reports of interest to this module include: |
Internet Citation: HCUP Calculating Standard Errors - Accessible Version. Healthcare Cost and Utilization Project (HCUP). November 2016. Agency for Healthcare Research and Quality, Rockville, MD. www.hcup-us.ahrq.gov/tech_assist/standarderrors/508/508course_2016.jsp. |
Are you having problems viewing or printing pages on this website? |
If you have comments, suggestions, and/or questions, please contact hcup@ahrq.gov. |
Privacy Notice, Viewers & Players |
Last modified 11/18/16 |