Read LOGISTIC.pdf text version




(Version 1.11)

© J.H. Abramson

Revised October 13, 2006

What LOGISTIC does

LOGISTIC (previously named MULTI) is a WINPEPI program (Abramson 2004), part of the PEPI suite of computer programs for epidemiologists. ("PEPI" is an acronym for "Programs for EPIdemiologists".) This program has a single module, which performs multiple logistic regression analysis.

How to use LOGISTIC ..........................................................................................................2 Multiple logistic regression ......................................................................................... 4

References .......................................................................................................................... 12


It is unwise to use a statistical procedure whose use one does not understand. This manual cannot supply this knowledge, and it is certainly no substitute for the basic understanding of statistics and epidemiological thinking that is essential for the wise choice of methods and the correct interpretation of their results.



How to use LOGISTIC

Running the program: Click on "Start", and then follow the on-screen instructions. To return to the main

menu, click on the "Back to main menu" button. LOGISTIC cannot be run in Windows 3.

Entry of data:

Optionally, data can be pasted into entry boxes (see next page). If entries are required in different boxes, pressing Enter or Tab after entering a number will generally take you to the next box; pressing Escape will clear the entry. If several entries are required in the same box, press Enter or Space after each entry.

Use of data files: see p. 6.

Recalling results: Click on "View" in the top menu to display the current session's previous results

they can be pasted to other applications, at the site of the cursor (usually by pressing Shift-Insert or Ctrl-V). If the current session's previous results are recalled (by clicking on "View"), text can be marked (drag the mouse over it with button pressed) and copied to the clipboard (by pressing Ctrl-Insert or Ctrl-C) for pasting elsewhere.

Pasting results: Results shown on the screen are automatically placed in the Windows clipboard, from which

Adding comments: Click on "Note" in the top menu if you wish to add explanatory comments to be placed in

the clipboard, saved, or printed with the results.

Saving results: By default, all results of Pepi-for-Windows programs are saved in C:\PEPI.TXT, with a warning if it exceeds 500K. Results also go to C:\PEPI.TMP (for display in the "View" option); this file may be overwritten unless it is renamed on quitting LOGISTIC. Click on "Save" (in the top menu) to see the default procedure or to change it. TXT files can be combined with JOINTEXT, available free from Printing results: Click on "Print". If this fails, try switching the printer off and on again. Or paste the results

from the clipboard to Word or another program, and print from there. Results can also be printed from the file in which they are saved. Note: the "Print" option ejects full pages only.


If the data are available in a text file (e.g. a text file created by Notepad) or in a spreadsheet, they can be copied to the Windows clipboard [usually by pressing Ctrl-Insert or Ctrl-C], and then "pasted" into a data-entry box [usually by pressing Shift-Insert or Ctrl-V]. This can simplify data entry in boxes that require a number of entries (in rows or columns). [Also, data can be copied from a data-entry box and pasted to a text file for future re-use; press Ctrl-A to mark it for copying.] Precautions: The data must be pasted into the box as a single block, and not piecemeal. The data must be in the format required in the box, with spaces between the numbers; exact alignment of the columns is not necessary. For example 45 66 1 20 3 132 53 11 44 If a defined number of rows is required, this number must be entered first, e.g. in the "Number of strata" or "Number of categories" box. If row numbers are shown on the left (1, 2, etc.), ensure that the"1" is visible before pasting. The cursor must be in the top left corner of the box when the "paste" keys are pressed.




FINDER.HLP (provided with this program) is an alphabetical index that identifies the modules (in all WinPepi programs) that deal with a specific procedure or kind of study. It is called up by pressing F9 or clicking on "Finder" in any WinPepi program, or on the FINDER icon, and can be printed for easy reference.


1. The WinPepi suite of computer programs for epidemiologists, with their manuals. Can be downloaded free at 2. "Survey Methods in Community Medicine: Epidemiological Research, Programme Evaluation, Clinical Trials" (J.H. Abramson and Z.H. Abramson), fifth edition. Edinburgh: Elsevier Churchill Livingstone 2005. 3. "Making Sense of Data: A Self-Instruction Manual on the Interpretation of Epidemiological Data" (J.H. Abramson and Z.H.Abramson), third edition. Oxford: Oxford University Press 2001.


All WINPEPI (PEPI-for-Windows) and other PEPI programs can be downloaded free. The latest versions of WINPEPI programs ­ currently COMPARE2, DESCRIBE, ETCETERA, LOGISTIC, PAIRSetc, POISSON, and WHATIS ­ can be downloaded from; and the latest release of Version 4 of PEPI, which contains over 40 DOS-based programs (which can be used in Windows) and WHATIS, can be downloaded from or COMPARE2, DESCRIBE, ETCETERA, LOGISTIC, PAIRSetc, and POISSON are distributed with PDFmanuals. A printed manual is available for the DOS-based programs and WHATIS (Abramson and Gahlinger 2001.). WINPEPI programs are provided with no liability to users and without any warranties, whether expressed or implied. They are copyrighted, but may be freely copied and distributed for personal use; they may not be exploited commercially without permission.




J. H. Abramson and Eduardo L. Franco This module performs multiple logistic regression analysis. It may be used in cohort studies and trials that examine the occurrence or nonoccurrence of a disease or other outcome, in case-control studies of risk or protective factors associated with a disease or other disorder, in studies that aim to determine how diagnostic or prognostic criteria can be combined to appraise the probability that a disease is present or likely to occur, and for other purposes. The procedure measures the effects of single variables or combinations of variables, and permits control of confounding effects and appraisal of modifying effects. A number of choices are offered: the analysis can be unconditional or conditional; individual or grouped data can be analysed; data can be entered by pasting or by loading a data file; the names and sequence of the variables can be entered at the keyboard or by loading a dictionary file previously created by this program; the logistic model can include first-degree interactions; variables can be treated as simple (continuous) or categorical (with nominal or ordered categories); variables can be centred; values can be defined as missing; the analysis can be performed on a restricted sample; the confidence level can be altered; and the regression coefficients can be used to calculate both the probability of the outcome and odds ratios comparing different sets of values. The program computes the logistic regression coefficients (with their standard errors), the corresponding odds ratios (with 90, 95 or 99% confidence intervals), Z-scores, crude odds ratios, the log-likelihood for the model and for the null model, and the G statistic for the model. Wald, likelihood-ratio and score tests are performed, and the Hosmer-Lemeshow goodness-of-fit test and other indicators of the aptness of the logistic model are provided: the Pearson correlation coefficient between the observed values of the dependent variable and the probabilities predicted by the logistic equation; an estimate of the proportion of explained variation; Darlington's logistic regression fit index; and the pseudo R-squared value. To use the module, first enter the data, either by pasting or by loading a data file. If required, the choices listed above can then be made. A model (the list of independent variables and interactions to be included in the analysis) is then entered, or specifications for the automatic creation of a set of models. The program is then run. When the results have appeared, the model or options can be changed, or new data or a new dictionary file can be entered. At all stages, detailed instructions and help are provided on the screen. Instructions 1. A data file can be used only if it has previously been put in the C:\PEPI\LR\ folder. If this folder does not exist, it can be created by opening the module and clicking on "Find data file". A test file (LA.DAT) is provided with this program, with its corresponding dictionary file (LA.DIC); to use them, first put them in the C:\PEPI\LR\ folder. 2. After opening the module, either click on "Find data file" (and follow the on-screen instructions), or paste the data. Then click on "OK".



3. Either enter a dictionary file (which supplies the names of the variables and specifies the dependent variable) or describe the variables (i.e., enter their names and specify the dependent variable). Then click on "OK". 4. Optionally, select one or more of the "choices" listed on the screen, to alter the default settings (many of which are determined by the content of the dictionary file). You can decide whether variables are to be treated as simple (continuous) or categorical (with nominal or ordered categories). You can also specify the use of unconditional or conditional analysis and of individual or grouped data, restrict the analysis to a specified subgroup of the sample, define missing values, centre variables, alter the confidence level, and change the dependent variable. 5. Optionally, click on "Update/create dictionary" to create a dictionary file (automatically placed in the C:\PEPI\LR folder) or update an existing one. This is advisable if information about the variables has been entered at the keyboard, or if "choices" have been made. 6. Either enter a model, or choose the "Automatically create a set of models" option. To enter a model, just enter each independent variable and interaction (e.g. age*sex), in any order. The dependent variable and constant need not be specified. The model will be displayed on the screen; e.g. AGE SEX AGE*SEX. If the "automatic" option is selected, a series of models will be created, each including a different "main" variable and the same "control" variables, all of which must be specified. 7. Click on "RUN". 8. When the results have appeared, click on "NEXT" to change the model or options, or to enter new data or a new dictionary file, or (if the "automatic" option was chosen) to run the next model in the series. Optionally, click on "Use results" to calculate the probability of the outcome (except in case-control studies) for a specified set of values, or odds ratios expressing the contrast between two specified sets of values. At each stage, detailed instructions and help are shown on the screen. Click on "Variables" to view a list of the variables and information about the variables, on "Values" to view information about a specific variable and its values, or on "Numbers" to display descriptive statistics (sample size, numbers of missing values, and the distribution of the dependent variable, for each independent variable or category). Variables The variables must be numerical. If a REC file is used (see "Data files", below) the program will translate "Y" and "N" to 1 and 0 respectively, "F" and "M" to 1 and 0, other entries in single-letter fields to 9, and blank fields to 9. The dependent variable may be the occurrence of a disease or other outcome, or (in case-control studies) caseness, i.e. membership of the case or the control group. It is Y in the logistic regression equation log-odds(Y = 1) = a + bB + cC + d(B*C), where Y=1 refers to a specific category of Y, and it may have only two values, 0 and 1. In a cohort study or trial, occurrence of the disease or other outcome should be coded 1; in a case-control



study, cases should be coded 1. If other codes are used, the program will report this and offer options for their exclusion or their conversion to 1 or 0 for the purposes of the analysis. The independent (explanatory) variables whose associations with the dependent variable are under study, or that may confound or modify these associations, e.g. B and C in the above equation, may have any numerical values (e.g. continuous or discrete numbers, or numerical labels representing categories). If grouped data are to be analysed, there must be a frequency variable that specifies the number of subjects with identical specified data. If study findings displayed in a cross-tabulation are to be entered in the program, each cell represents a group, and the number in the cell is the frequency variable; zero cells can be ignored. If conditional logistic regression analysis is to be used, a matching variable is required. This variable identifies the matched sets, the same number being allotted to each member of the set. For convenience a decimal point and an extra digit may be added to identify individual members (e.g. 34.1, 34.2, etc. for members of set 34), but this extra digit will be ignored in the analysis. There may be other numerical variables, e.g. a case identity number, that are not used in the analysis. Data files The data file may be a text (ASCII) file, containing numbers only, prepared by a word processor or a data-entry or statistical program, or a REC file (which specifies the variables' names also) created by EpiData (Lauritsen and Bruus 2003-2004) or Epi Info version 6 (Dean AG et al. 1996) after direct entry of the data or after importing a data file created by SPSS or another program. In a text file, the data for each subject or (for grouped data) each group must be in a separate line. The values in each line must be in a defined order, separated by spaces or commas. Vertical alignment is not essential. For example, the first two lines might look like this: 1 1 74 0 0 1.56 1 3 96 1 1 2 1 0 75 0 0 9.03 0 0 0 0 0 A value must be shown for each variable, with no blanks (a "missing value" code should be used instead). If there are headers or comments before or after the numerical records, they should be deleted. The names and sequence of the variables must be supplied separately, either by entry at the keyboard or by loading a dictionary file previously created by this program. A REC file supplies the names of the variables, and a dictionary file is not required. But it is advisable, however, to afterwards use LOGISTIC s option for the creation of a dictionary file, so that other information about the variables will be available for future use. When LOGISTIC uses a REC file, it translates "Y" and 'N' to "1" and "0" respectively, "F" and "M" to "1" and "0", other entries in single-letter fields to 9's, and blank fields to 9's. Data in other non-numeric formats, date formats, and records marked as "deleted" are not used. A REC file may not exceed 49kb in size.



Dictionary files LOGISTIC can create dictionary files, which record information about the variables; it places them in the C:\PEPI\LR folder. The dictionary file stores the names and sequence of the variables, and specifies the dependent variable, the frequency variable (if data are grouped), the matching variable (if data are matched), whether variables are categorical and (if so) the number of categories, the cut-points, and the way the categories are to be handled in the analysis. LOGISTIC can create a new dictionary file at any time, recording the current set-up of the variables. Pasting data Instead of loading a data file, data that have been copied to the Windows clipboard (up to 56 kb) from a text file (created by Notepad or another word processor) or a spreadsheet (e.g. Excel) [usually by clicking on "Copy" or by pressing Ctrl-Ins or Ctrl-C] can be pasted into the data box provided for this purpose, by pressing Ctrl-V or Shift-Ins. The data box should be empty (it can be emptied by pressing Esc). Only numerical data should be pasted. The data for each subject or (for grouped data) each group must be in a separate line. The values in each line must be in a defined order, separated by spaces. Vertical alignment is not essential. For example, the first two lines might look like this: 1 1 74 0 0 1.56 1 3 96 1 1 2 1 0 75 0 0 9.03 0 0 0 0 0 A value must be shown for each variable, with no blanks (a "missing value" code should be used). The names and sequence of the variables must be supplied separately, either by entry at the keyboard or by loading a dictionary file previously created by this program. After entry in the box, the data can be modified if necessary, and copied and pasted to a text file for future re-use; to mark all the data in the box for copying, press Ctrl-A; copy to the clipboard by pressing Ctrl-Ins or Ctrl-C. The model The logistic regression equation is : log-odds(Y = 1) = constant + aA + bB + cC where Y is a dichotomous dependent variable, and A, B, etc. are independent variables. In addition, it may include first-degree interactions (terms involving two variables, such as A*B. In this program, the term "model" is taken to mean the list of independent variables and interactions, e.g. A B C A*C, using the names of the variables, e.g. AGE SEX WEIGHT AGE*WEIGHT.



Unconditional or conditional analysis The program does both unconditional analyses (for unmatched data) and conditional analyses (which are appropriate for matched or finely stratified data; Selvin 1996: 298-310), with easy switching between modes to permit comparison of the results. Conditional multiple logistic regression requires a matching variable that identifies matched sets. Conditional analyses may abort if the data are very extensive. Categorical variables Each independent variable can be treated either as simple (continuous) or as a categorical variable with up to 10 categories defined by the user (by specifying cutting-points); the program creates the required dummy variables. A category can contain a single value or more than one value. Optionally, the categories can be treated as nominal or ordinal. Nominal categories can be handled in two ways in the analysis: the reference category can be either the first (baseline) category (this is the default), or the preceding category. Contrasts with the preceding category (Walter et al. 1987) permit close scrutiny of the effects of successive levels, e.g. in trials using different doses. Variables with ordinal categories are treated as simple variables, the successive categories (1, 2, 3 etc.) constituting its levels. Missing values Missing values should preferably be denoted by an unattainably high number, preferably 9, 99, 999 etc., but other numbers can be defined as codes for missing values. Records that contain variables with missing values, or with categories that include missing values, are omitted from analyses. Restriction of sample Optionally, the analysis can be restricted to a specified subgroup of the sample, e.g women of a certain age. Centring The program can centre simple independent variables by subtracting the mean from each observation. This reduces the effect of collinearity (highly correlated independent variables), and may be especially useful if both a variable (e.g. age) and its quadratic term (age-squared, or age*age) are included in the model (Selvin 1996: 256-259; Breslow and Day 1980: 233-236). Logistic regression coefficients The program provides logistic regression coefficients (with their standard errors) and the corresponding odds ratios (with 90, 95 or 99% confidence intervals). Results are displayed for simple variables and for all categories (except the first) of categorical variables.



Each coefficient reflects the influence of the relevant variable or interaction (i..e., its effect on the log-odds) when other influences (all the other covariates in the model) are held constant. For a simple variable, the coefficient and corresponding odds ratio express the effect of a change in magnitude of one unit. For a categorical variable, they express the contrast with the reference category (baseline or preceding) or (if the categories are ordinal) the effect of a rise from one level to the next. Crude odds ratios (not holding other covariates constant) are displayed for comparison. Z-scores (the coefficients divided by their standard errors) are displayed for use in comparing the impact of different variables (Selvin 1996: 262). Statistical tests The significance of each coefficient is tested by the Wald test, which uses the square of the Z-score as chi-square, and may be over-conservative for a large coefficient (Hauck and Donner 1977). Two tests are applied to the total model: a score test (recommended for small samples; Breslow and Day 1980: 207- 208)) and a likelihood-ratio test (based on a comparison of G-statistics). For a factor that is treated as ordinal, these are tests of trend. A likelihood-ratio test is also applied to the last variable added to the model, for use when a model is gradually incremented. For "automatic" models, likelihood-ratio tests are applied to each total model, and to each "main effect". Log-likelihood The log-likelihood is displayed for the total model and for the null model (for the constant only), and the G statistic (-2 times the log-likelihood) is displayed for the model. Hosmer-Lemeshow goodness-of-fit test Goodness of fit of the model is appraised by the Hosmer-Lemeshow test; a low P value suggests that the logistic model is unsuitable, i.e. that there is a poor fit with the observed data. A good fit does not necessarily mean that the results of the logistic analysis are valid, but a poor fit points to low validity. If the sample is very small, the test has a low power for detecting a poor fit; and if the sample is very large, a very small deviation from the expected values may be highly significant. The test is not done if conditional analysis is used. For this test, the subjects are arranged in a rising sequence of probabilities of the outcome (dependent variable = 1, or "yes") as predicted by the model, and split into equally-sized "deciles of risk", in which observed and expected status (for both "yes" and "no") are compared. The test is replicated, because if there are tied probabilities (e.g. if grouped data were entered), tied individuals may be split among adjacent deciles in different ways, and their arrangement in the observed data may affect the findings. The test is therefore done 100 times, after randomly shuffling the subjects 20 times before each test, and the median P value is displayed. A warning is shown if there are deciles with expected numbers (of "yes" or of "no") that are less than 5; the program may then repeat the test (once) after combining deciles so as to avoid these low expected numbers.



Other indicators of the suitability of the logistic model The program provides several other indicators of the suitability of the logistic model: the Pearson correlation coefficient between the observed value of the dependent variable (0 = "no", 1 = "yes") and the probability (of "yes"') predicted by the logistic equation; the square of the correlation coefficient, which is an estimate of the proportion of explained variation (Mittlboeck and Schemper 1996); and Darlington's logistic regression fit index (Darlington 1990: 449) and the pseudo Rsquared value (Selvin 1996: 266), both of which are based on a comparison of likelihood statistics based on the full model and on the null model, and are not direct measures of goodness of fit. Probability of the outcome Optionally, the program can use the regression coefficients to compute the probability of the outcome (dependent variable = 1), for given values of the variables in the model. Confidence intervals are displayed. This option is not appropriate in case-control studies. Odds ratios comparing different sets of values Optionally, the program can use the regression coefficients to compute an odds ratio expressing the contrast between two given sets of values of some or all of the variables in the model. Confidence intervals are displayed. A variable with the same value in both sets can affect the result only if it is involved in an interaction. The program can also compute an odds ratio expressing the effect of a given difference between two values of a variable. Confidence intervals are displayed. This option is not appropriate for categorical variables.


Basic statistical techniques for multiple logistic regression analysis are described by Breslow and Day (1980: 182-227). Variance formulae for use in the estimation of confidence intervals are provided by Hosmer and Lemeshow (1989: 103-106). The basic computation is based on MULTLR (Campos-Filho and Franco 1989), which uses adapted algorithms from LOGRESS (McGee 1986) and PECAN (Lubin 1981), respectively, for unconditional and conditional maximum likelihood estimation of the logistic coefficients. We are grateful to Dan McGee for providing us with the Fortran code for LOGRESS and to Dr A. Negassa for his helpful comments. The maximum number of parameters (including dummy variables) in a model is 33. Pseudo R-squared is defined as

and Darlington's logistic regression fit index 1 as where


LLM = the log-likelihood for the model LLN = the log-likelihood for the null model N = sample size

{exp[(LLN - LLM) / N] - 1} / [exp(-LLN / N) - 1]



The formula for the goodness-of-fit test (Hosmer and Lemeshow 1989: 140-145; Selvin 1996: 264- 266) is

chi-sq = [(O - E)2 / E]

where O = observed number E = expected number, counted separately for "Yes" and "No" categories in each decile (20 cells altogether). The test is done 100 times, each time shuffling the observations 20 times, using the ran0 function of Press et al. (1989), and the median P value is displayed. Two test file may accompany this program. LA.DAT is an ASCII data file containing 315 individual records published by Breslow and Day (1980, Appendix III), from a study of endometrial cancer in Los Angeles (Mack et al. (1976), and LA.DIC is the corresponding dictionary file. To use them, they must be put in the C:\PEPI\LR folder. Missing values are coded 9 or 99.




Abramson JH (2004) WINPEPI (PEPI-for-Windows) computer programs for epidemiologists. Epidemiologic Perspectives & Innovations, 2004, 1:6 (available on the Internet at Abramson JH, Gahlinger PM (2001) Computer programs for epidemiologists: PEPI version 4. Sagebrush Press: Salt Lake City. Breslow NE, Day NE (1980) Statistical methods in cancer research. vol. I. The analysis of case-control studies. Lyon: International Agency for Research on Cancer. Campos-Filho N, Franco EL (1989) A microcomputer program for multiple logistic regression by unconditional and conditional maximum likelihood methods. American Journal of Epidemiology 129:439-444. Darlington RB (1990) Regression and Linear Models. New York, McGraw-Hill. Dean AG, Dean JA, Coulombier D, Brendel KA, SmithDC, Burton AH, Dicker RC, Sullivan K, Fagan RF, Arner, TG. Hathcock,L. Epi Info, Version 6: a word processing, database, and statistics program for public health on IBMcompatible microcomputers. Centers for Disease Control and Prevention, Atlanta, Georgia, U.S.A., 1996. Available on the Internet at Hauck WW, Donner A (1977) Wald's test as applied to hypotheses in logit analysis. Journal of the American Statistical Association 82 :371-386. Hosmer DW Jr, Lemeshow S (1989) Applied Logistic Regression. New York: John Wiley & Sons. Lauritsen JM, Bruus M (2003-2004) EpiData (version 3). A comprehensive tool for validated entry and documentation of data. The EpiData Association, Odense, Denmark. (Available on the Internet at Lubin JH (1981) A computer program for the analysis of matched case-control studies. Computers and Biomedical Research 14: 138-143. Mack TM, Pike MC, Henderson, BE, Pfeffer RI, Gerkins VR, Arthur BS, Brown SE (1976) Estrogens and endometrial cancer in a retirement community. New England Journal of Medicine 294: 1262-1267. McGee DL (1986) A program for logistic regression on the IBM PC. American Journal of Epidemiology 124: 702-705. Mittlboeck M, Schemper M (1996) Explained variation for logistic regression. Statistics in Medicine 15: 1987. Press WH, Flannery BP, Teukolsky SA, Vetterling WT (1989) Numerical recipes in Pascal: The art of scientific computing. Cambridge: Cambridge University Press. Selvin S (1996) Statistical analysis of epidemiologic data, 2nd edn. New York: Oxford University Press. Walter SD, Feinstein AR, Wells CK (1987) Coding ordinal independent variables in multiple regression analyses. American Journal of Epidemiology 125: 319-23.



12 pages

Report File (DMCA)

Our content is added by our users. We aim to remove reported files within 1 working day. Please use this link to notify us:

Report this file as copyright or inappropriate


You might also be interested in

Role of Nutrition in Learning and Behavior: A Resource List for Professionals
Microsoft Word - ROC Curves Analysis.doc