Read stata_qingling.pdf text version

STAT 582 hw7

Qingling Jiao

Introduction to STATA

Stata is a statistical programming language that can be used under Windows, UNIX, and Linux system. The number of variables is different in the three editions of Stata (Stata/IC Stata/SE or Stata/MP). The number of observations is only limited by memory. Stata has interface of a pull-down menu and a command-line window in the new version 11. When a command is executed using the pull-down menu, Stata records the command in the Review window. This enables users to reproduce the results from those commands executed earlier. It is a big advantage of using Stata, compared to other statistical packages with only pull-down interface such as SPSS. Stata binary files can be easily transformed into SPSS or SAS files by the third-party application Stat/Transfer. Stata graphics tools are very good for exploratory data analysis, and high-quality 2-D publication-quality figures can be produced in many different formats. However, Stata does not have 3-D graphics capability and these features are under development in the future version. Many functions are included in Stata. For different study purposes, many models can be easily implemented to do the data analysis. The Stata module allows for time series analysis, survival analysis, and multivariate analysis, and so on. All functions in Stata can be called either by typing in command line or using menu tools. From now on, I will only focus on how to use command line to realize all varies of functionalities in Stata unless otherwise noted. (1) Linear regression modeling

Linear regression model can be easily implemented in Stata by command "regress". The command is followed by the response first, and then other independent variables, e.g., regress y x1 x2 x3. Many options are available for the regress command. For example, option beta can be added to get the standardized regression coefficients (e.g., regress y x1 x2 x3, beta).

To get predicted values of a regression model, the "predict" command is used. The predicted values can be calculated at any point after you run a regress command, but once a new regression is run, the predicted values will be based on the most recent regression. For example, "predict fv" means to get the predicted values of current regression and store them in the variable fv. The "predict" command can also be used to obtain the residuals. For example, "predict e, residual" means to get the residuals and store them into a variable e. (2) Random-effect modeling


STAT 582 hw7

Qingling Jiao

There are several commands that allow the Stata users to build a model with random effects such as xtreg, anova, and xtmixed. Xtmixed is more powerful in handling random effects than the other two. Three examples are given here to illustrate how xtmixed command is used. Mixed models consist of fixed effects and random effects. The fixed effects are specified as a dependent variable followed by a set of regressors. The random-effects are specified by first considering the grouping structure of the data. The variable lists that make up each equation describe how the random effects enter into the model, either as random intercepts (constant term) or as random coefficients on regressors in the data. One may also specify the variance-covariance structure of the within-equation random effects by four available structures independent, exchangeable, identity and unstructured. For example, xtmixed y x1 x2 x3 || school: z1, cov(un) || class: z1 z2 z3, nocons cov(ex) options. This means that class is nested under school. Both school and class are treated as random intercepts. The model also contains a random coefficient on z1 at the school level and random coefficients on variables z1, z2, and z3 at the class level. The covariance structure for the random effects at the class level is exchangeable, meaning that the random effects share a common variance and common pairwise covariance. (3) Nested variable modeling A model containing nested variables can be built by the command xtmixed. For example, we can build a mixed-effect model with a fixed, b random, and a*b interaction also random, using following command: egen ab = group(a b), label xi: xtmixed y i.a || _all: R.b || _all: R.ab, var We can also build a random-effect model with both a and b random. egen ab = group(a b), label xtmixed y || _all: R.a || _all: R.b || _all: R.ab, var Another command anova can also be used to model nested variables. For example, anova output machine / operator|machine /, dropemptycells, where machine is nested under operator.


STAT 582 hw7

Qingling Jiao


Non-linear modeling

The comand "nl" fits an arbitrary nonlinear function by least squares. That is, given yj = f xj , b + uj , the command "nl" finds b to minimize j u2 (nl can also obtain j weighted least squares estimates.) The user only need specify the function f(), which is done by writing a Stata program without supplying the derivatives. For example, to fit a negative exponential growth model y = b0 1 - e-b 1 x + u, users can obtain the estimates by using the following command: . nl (y = {b0=1}*(1 - exp(-1*{b1=0.1}*x))) (5) Continuous/categorical variable analyses ANCOVA is implemented easily using the command anova or by xi: regress. In Stata both commands assume a continuous response; with regress all predictors are continuous, with anova all predictors are by default categorical (and a separate indicator variable is created for each level of each predictor). Users need to specify what is continuous and what is categorical in order to appropriately use the two commands. For example, . anova y x1 c1, cont(x1)

It fits an ancova model with x1 as a continuous variable specified by option cont() and c1 as a categorical variable specified by the anova default setting.

(6) Histogram and QQ plots Quantile-Quantile plots can be obtained by the command qnorm. For example, . qnorm price . qnorm price, grid (add a grid background for the QQ-plot) Histogram can be generated using the command "histogram" assuming variable is continuous. Users need type only histogram followed by the variable name. The histogram can be scaled differently according to different purpose. For example, . histogram y Here y axis reports the density of y. If you sum up the area of bars, you will get 1. . histogram y, fraction Histogram is scaled so that the bar heights sum to 1. . histogram y, frequency


STAT 582 hw7

Qingling Jiao

Histogram is scaled so that the bar height reflects the number of observations. (7) Chi-squared testing A chi-square test is used to check whether there is a relationship between two categorical variables. In Stata, the "chi2" option is used with the "tabulate" command to obtain the test statistic and its associated p-value. For example, . tabulate jobtype gender, chi2 For goodness of fit test, users can download "csgof" from Stata by typing findit csgof. . csgof race, expperc(10 10 10 70)

The command line above is used to check whether there is significant difference between the true proportion of the races (10% for Hispanic and 10% for Asian, and 10% for African-American, and 70% for whites) and the observed sample proportions.

(8) Reading in multiple data formats

Stata has its own integrated spread sheet to introduce data manually. However, it is not very powerful. It is preferable to get your raw data ready in Excel and then transfer them to Stata. A comma/tab separated file with/without variable names on line 1 can be read into Stata by using "insheet" command. Options are also available for different purposes. . insheet var1 var2 var3 using data1.raw A space separated file can be read into Stata by using "infile" command. . infile var1 var2 var3 using data2.raw A fixed format file (e.g., fixed column data) can be read into Stata using "infix" command. . infix 1-13 var1 15-16 var2 18-21 var3 using data3.raw Other methods of getting data into Stata Data conversion programs can convert data from one file format into another file format. For example, they can directly create a Stata file from an Excel Spreadsheet, a Lotus Spreadsheet, an Access database, a Dbase database, a SAS data file, an SPSS system file, etc. Two such examples are Stat Transfer and DBMS Copy. Both of these products are available on SSC PCs and DBMS Copy is available on Nicco and Aristotle. Finally, if you are using Nicco, Aristotle or the RS/6000 Cluster, there is a command specifically for converting SAS data into Stata called sas2stata. If you have SAS data you want to convert to Stata, this will be a useful way to get your SAS data into Stata. 4

STAT 582 hw7

Qingling Jiao

(9) Data manipulation Stata is an excellent tool for data manipulation: moving data from external sources into the program, cleaning it up, generating new variables, generating summary data sets, merging data sets and checking for merge errors, collapsing cross­section time-series data on either of its dimensions, reshaping data sets from "long" to "wide", and so on. In this context, Stata is an excellent program for answering ad hoc questions about any aspect of the data. Stata has 27 numeric missing values: "." is the default, called the "system missing value" or sysmiss. And ".a, .b, .c, ..., .z" are called the "extended missing values". Numeric missing values are represented by large positive values. The ordering is < . < .a < .b < ... < .z. Most Stata statistical commands deal with missing values by disregarding observations with one or more missing values (called "listwise deletion" or "complete cases only"). Matrix multiplication can be done in Stata. Many commands are available for matrix operations. I will illustrate by two examples below to show some basic matrix operations. Create a 3 by 2 matrix A: . matrix A = (2,1\3,2\-2,2) . matrix list A

A[3,2] c1 r1 2 r2 3 r3 -2 c2 1 2 2

Transpose of matrix A can be obtained by using A' (10) PCA/factor analysis Principal component analysis (PCA) is a statistical technique used for data reduction. Stata commands "pca" and "pcamat" can be used for this purpose. The command "pca" takes the variables' list in the dataset as the input, but the command "pcamat" takes the correlation matrix directly as the input. The option vce(normal) assumes that the variables are multivariate normal distributed and the variance-covariance matrix of observations has all distinct and strictly positive eigenvalues. For example, . pca var1-var6, vce(normal)


STAT 582 hw7

Qingling Jiao

. pcamat s, n(979) comp(2) Besides pca and pcamat, Stata also provides Scores, residuals, rotations, scree plots, score plots, loading plots. Stata's command factor allows estimation of either principal component or common factor models. By default, factor produces estimates using the principal factor method Factor can alternatively produce iterated principal factor estimates, principalcomponents factor estimates, or maximum-likelihood estimates. For example, . factor item13-item24, ipf factor(3) The option ipf stands for iterated principal factors. After estimating a factor model, Stata allows rotation of the factor loading matrix using the varimax (orthogonal) and promax (oblique) methods. Stata can score a set of factor estimates using either rotated or unrotated loadings. Both regression and Bartlett scorings are available. (11) Sample size calculations For simple studies where only one measurement of outcome is planned, the command sampsi can be used to compute sample size or power for four types of tests: twosample comparison of means or proportion and one-sample comparison of mean or proportion. (12) Smoothing Many different algorithms are used in smoothing in Stata. Here I only list two ways. The command "lpoly" performs a kernel-weighted local polynomial regression of y on x and displays a graph of the smoothed values with (optional) confidence bands. This can be found in the menu: Statistics > Nonparametric analysis > Local polynomial smoothing. The other command that I am going to talk about is "smooth", which applies the specified resistant-nonlinear smoother to variables and stores the smoothed series in new variables. This can be found in the menu: Statistics > Nonparametric analysis > Robust nonlinear smoother. More sophisticated programming tasks can be done in Stata, such as survival models with

frailty, dynamic panel data (DPD) regressions, generalized estimating equations (GEE), multilevel mixed models, models with sample selection, multiple imputation, ARCH, and estimation with complex survey samples. 6

STAT 582 hw7

Qingling Jiao

In summary, Stata is a complete, integrated statistical package. It contains user friendly interface, an intuitive command syntax, and online help. Moreover, Stata is easy to use, fast, and accurate. All analyses can be reproduced and documented for publication and review.

Reference Introduction to Stata by Christopher F Baum



7 pages

Find more like this

Report File (DMCA)

Our content is added by our users. We aim to remove reported files within 1 working day. Please use this link to notify us:

Report this file as copyright or inappropriate


You might also be interested in

Microsoft Word - Useful Stata Commands 2012 v4
Microsoft Word - Bootstrapping for Regressions in Stata_031017.doc
Microsoft Word - Panel_Statmath.doc