170x Filetype PDF File size 1.20 MB Source: www.apa.org
Statistical Methods in Psychology Journals Guidelines and Explanations Leland Wilkinson and the Task Force on Statistical Inference APA Board of Scientific Affairs n the light of continuing debate over the applications of statistical methods only and is not meant as an assessment significance testing in psychology journals and follow- of research methods in general. Psychology is a broad ing the publication of Cohen's (1994) article, the Board science. Methods appropriate in one area may be inappro- of Scientific Affairs (BSA) of the American Psychological priate in another. Association (APA) convened a committee called the Task The title and format of this report are adapted from a Force on Statistical Inference (TFSI) whose charge was "to similar article by Bailar and Mosteller (1988). That article elucidate some of the controversial issues surrounding ap- should be consulted, because it overlaps somewhat with plications of statistics including significance testing and its this one and discusses some issues relevant to research in alternatives; alternative underlying models and data trans- psychology. Further detail can also be found in the publi- formation; and newer methods made possible by powerful cations on this topic by several committee members (Abel- computers" (BSA, personal communication, February 28, son, 1995, 1997; Rosenthal, 1994; Thompson, 1996; 1996). Robert Rosenthal, Robert Abelson, and Jacob Co- Wainer, in press; see also articles in Harlow, Mulaik, & hen (cochairs) met initially and agreed on the desirability of Steiger, 1997). having several types of specialists on the task force: stat- Method isticians, teachers of statistics, journal editors, authors of statistics books, computer experts, and wise elders. Nine Design individuals were subsequently invited to join and all agreed. type of study you are doing. These were Leona Aiken, Mark Appelbaum, Gwyneth Boo- Make clear at the outset what doo, David A. Kenny, Helena Kraemer, Donald Rubin, Bruce Do not cloak a study in one guise to try to give it the that have mul- studies For of another. reputation Thompson, Howard Wainer, and Leland Wilkinson. In addi- assumed those goals. prioritize to define and be sure tion, Lee Cronbach, Paul Meehl, Frederick Mosteller and John tiple goals, Tukey served as Senior Advisors to the Task Force and There are many forms of empirical studies in psychol- commented on written materials. ogy, including case reports, controlled experiments, quasi- The TFSI met twice in two years and corresponded experiments, statistical simulations, surveys, observational throughout that period. After the first meeting, the task studies, and studies of studies (meta-analyses). Some are force circulated a preliminary report indicating its intention hypothesis generating: They explore data to form or sharpen to examine issues beyond null hypothesis significance test- hypotheses about a population for assessing future hypothe- ing. The task force invited comments and used this feed- ses. Some are hypothesis testing: They assess specific a priori back in the deliberations during its second meeting. hypotheses or estimate parameters by random sampling from After the second meeting, the task force recommended that population. Some are meta-analytic: They assess specific several possibilities for further action, chief of which a priori hypotheses or estimate parameters (or both) by syn- would be to revise the statistical sections of the American thesizing the results of available studies. Psychological Association Publication Manual (APA, Some researchers have the impression or have been information 1994). After extensive discussion, the BSA recommended taught to believe that some of these forms yield that "before the TFSI undertook a revision of the APA that is more valuable or credible than others (see Cronbach, Publication Manual, it might want to consider publishing 1975, for a discussion). Occasionally proponents of some an article in American Psychologist, as a way to initiate research methods disparage others. In fact, each form of discussion in the field about changes in current practices of research has its own strengths, weaknesses, and standards data analysis and reporting" (BSA, personal communica- of practice. tion, November 17, 1997). This report follows that request. The sections in italics Jacob Cohen died on January 20, 1998. Without his initiative and gentle are proposed guidelines that the TFSI recommends could persistence, this report most likely would not have appeared. Grant Blank be used for revising the APA publication manual or for provided Kahn and Udry's (1986) reference. Gerard Dallal and Paul developing other BSA supporting materials. Following Velleman offered helpful comments. each guideline are comments, explanations, or elaborations Correspondence concerning this report should be sent to the Task assembled by Leland Wilkinson for the task force and Force on Statistical Inference, c/o Sangeeta Panicker, APA Science Di- under its review. This report is concerned with the use of rectorate, 750 First Street, NE, Washington, DC 20002-4242. Electronic mail may be sent to spanicker@apa.org. 594 August 1999 * American Psychologist Copyright 1999 by the American Psychological Association. Inc. 0003-066X/99/$2.00 Vol. 54, No. 8, 594-604 that human participants are incapable of producing a ran- Population dom process (digits, spatial arrangements, etc.) or of rec- of any study depends on of the results The interpretation ognizing one. It is best not to trust the random behavior of for analysis. intended population of the the characteristics a physical device unless you are an expert in these matters. stimuli, or studies) Define the population (participants, It is safer to use the pseudorandom sequence from a well- part of the groups are If control or comparison clearly. designed computer generator or from published tables of defined. how they are present design, random numbers. The added benefit of such a procedure is Psychology students sometimes think that a statistical that you can supply a random number seed or starting population is the human race or, at least, college sopho- number in a table that other researchers can use to check mores. They also have some difficulty distinguishing a your methods later. class of objects versus a statistical population-that some- assignment. For some research times we make inferences about a population through sta- Nonrandom tistical methods, and other times we make inferences about questions, random assignment is not feasible. In such affect that a class through logical or other nonstatistical methods. cases, we need to minimize effects of variables and an variable a causal between relationship observed Populations may be sets of potential observations on peo- the confounds are commonly called ple, adjectives, or even research articles. How a population outcome. Such variables needs to attempt to deter- The researcher is defined in an article affects almost every conclusion in or covariates. that article. mine the relevant covariates, measure them adequately, or by analysis. effects either by design for their adjust and adjusted by analysis, the are Sample If the effects of covariates stated must be explicitly are made that assumptions emphasize any in- strong and procedures the sampling Describe Describe justified. to the extent possible, tested and (e.g., and, is stratified If the sample criteria. or exclusion clusion plans including of bias, sources methods used to attenuate rationale. fully the method and describe gender) by site or data. missing and noncompliance, dropouts, minimizing subgroup. for for each sample size Note the proposed Authors have used the term "control group" to de- Interval estimates for clustered and stratified random scribe, among other things, (a) a comparison group, (b) samples differ from those for simple random samples. members of pairs matched or blocked on one or more Statistical software is now becoming available for these nuisance variables, (c) a group not receiving a particular purposes. If you are using a convenience sample (whose treatment, (d) a statistical sample whose values are adjusted members are not selected at random), be sure to make that post hoc by the use of one or more covariates, or (e) a procedure clear to your readers. Using a convenience sam- group for which the experimenter acknowledges bias exists ple does not automatically disqualify a study from publi- and perhaps hopes that this admission will allow the reader objectivity to try to conceal this by cation, but it harms your to make appropriate discounts or other mental adjustments. implying that you used a random sample. Sometimes the None of these is an instance of a fully adequate control case for the representativeness of a convenience sample can group. be strengthened by explicit comparison of sample charac- If we can neither implement randomization nor ap- teristics with those of a defined population across a wide proach total control of variables that modify effects (out- range of variables. comes), then we should use the term "control group" cau- Assignment tiously. In most of these cases, it would be better to forgo the term and use "contrast group" instead. In any case, we research involving should describe exactly which confounding variables have Random assignment. For about which un- the been explicitly controlled and speculate units to levels of of the assignment inferences, causal measured ones could lead to incorrect inferences. In the (not to be Random assignment is critical. causal variable we should do our best to inves- for the strongest absence of randomization, confused with random selection) allows assumptions. tigate sensitivity to various untestable assumptions. free of extraneous inferences causal possible provide enough informa- is planned, assignment If random Measurement assign- the actual for making process tion to show that the ments is random. Variables. Explicitly define the variables in the of the study, to the goals related exem- how they are There is a strong research tradition and many study, show of measure- The units measured. how they are explain plars for random assignment in various fields of psychol- and and outcome, should fit the causal ogy. Even those who have elucidated quasi-experimental ment of all variables, sec- and discussion you use in the introduction designs in psychological research (e.g., Cook & Campbell, language of ran- of your report. 1979) have repeatedly emphasized the superiority tions dom assignment as a method for controlling bias and lurk- A variable is a method for assigning to a set of ing variables. "Random" does not mean "haphazard." Ran- observations a value from a set of possible outcomes. For domization is a fragile condition, easily corrupted deliber- example, a variable called "gender" might assign each of ately, as we see when a skilled magician flips a fair coin 50 observations to one of the values male or female. When we are prepared repeatedly to heads, or innocently, as we saw when the we define a variable, we are declaring what drum was not turned sufficiently to randomize the picks in to represent as a valid observation and what we must the Vietnam draft lottery. As psychologists, we also know consider as invalid. If we define the range of a particular American Psychologist 595 August 1999 * Psychologist 595 August 1999 • American possible outcomes) to be from 1 to 7 on area that is based on a previous researcher's well-defined (the set of variable then a value of 9 is not an construct implemented with a poorly developed psycho- a Likert scale, for example, instrument. Innovators, in the excitement of their outlier (an unusually extreme value). It is an illegal value. metric If we declare the range of a variable to be positive real discovery, sometimes give insufficient attention to the numbers and the domain to be observations of reaction time quality of their instruments. Once a defective measure (in milliseconds) to an administration of electric shock, enters the literature, subsequent researchers are reluctant to then a value of 3,000 is not illegal; it is an outlier. change it. In these cases, editors and reviewers should pay Naming a variable is almost as important as measuring special attention to the psychometric properties of the in- it. We do well to select a name that reflects how a variable struments used, and they might want to encourage revisions not by the scale's author) to prevent the accumu- is measured. On this basis, the name "IQ test score" is (even if preferable to "intelligence" and "retrospective self-report lation of results based on relatively invalid or unreliable of childhood sexual abuse" is preferable to "childhood measures. sources of sexual abuse." Without such precision, ambiguity in defin- Procedure. Describe any anticipated death, or other dropout, due to noncompliance, ing variables can give a theory an unfortunate resistance to attrition may affect the gener- how such attrition Indicate empirical falsification. Being precise does not make us factors. operationalists. It simply means that we try to avoid exces- alizability of the results. Clearly describe the conditions are taken (e.g., format, time, which measurements sive generalization. under the specific Describe data). who collected personnel Editors and reviewers should be suspicious when they place, especially if bias, with experimenter variables, methods used to deal notice authors changing definitions or names of yourself failing to make clear what would be contrary evidence, or you collected the data using measures with no history and thus no known prop- Despite the long-established findings of the effects of erties. Researchers should be suspicious when code books experimenter bias (Rosenthal, 1966), many published stud- and scoring systems are inscrutable or more voluminous ies appear to ignore or discount these problems. For exam- than the research articles on which they are based. Every- ple, some authors or their assistants with knowledge of one should worry when a system offers to code a specific hypotheses or study goals screen participants (through per- observation in two or more ways for the same variable. sonal interviews or telephone conversations) for inclusion to collect in their studies. Some authors administer questionnaires. used is a questionnaire Instruments. If Some authors give instructions to participants. Some au- its scores of properties summarize the psychometric data, thors perform experimental manipulations. Some tally or is used in a to the way the instrument regard with specific of code responses. Some rate videotapes. include measures properties Psychometric population. An author's self-awareness, experience, or resolve affecting con- qualities any other and validity, reliability, does not eliminate experimenter bias. In short, there are no enough provide is used, If a physical apparatus clusions. valid excuses, financial or otherwise, for avoiding an op- to allow specifications) model, design (brand, information portunity to double-blind. Researchers looking for guid- another experimenter to replicate your measurement should consult the classic book of process. ance on this matter There are many methods for constructing instruments Webb, Campbell, Schwartz, and Sechrest (1966) and an and psychometrically validating scores from such mea- exemplary dissertation (performed on a modest budget) by sures. Traditional true-score theory and item-response test Baker (1969). size. Provide information sample and theory provide appropriate frameworks for assessing reli- Power ability and internal validity. Signal detection theory and on sample size and the process that led to sample size various coefficients of association can be used to assess decisions. Document the effect sizes, sampling and mea- used procedures analytic well as as assumptions, external validity. Messick (1989) provides a comprehen- surement Because power computations are sive guide to validity. in power calculations. and collected are data when done before It is important to remember that a test is not reliable or most meaningful how effect-size estimates to show it is important unreliable. Reliability is a property of the scores on a test examined, and theory in from previous research Brennan, have been derived for a particular population of examinees (Feldt & been taken they might have that suspicions to dispel 1989). Thus, authors should provide reliability coefficients order to in the study or, even worse, constructed used of the scores for the data being analyzed even when the from data analyzed, the study is size. Once sample a particular research is not psychometric. Interpreting the justify focus of their in describ- power calculated size of observed effects requires an assessment of the confidence intervals replace reliability of the scores. ing results. Besides showing that an instrument is reliable, we Largely because of the work of Cohen (1969, 1988), need to show that it does not correlate strongly with other psychologists have become aware of the need to consider key constructs. It is just as important to establish that a power in the design of their studies, before they collect this stimulates The intellectual exercise required to do measure what it should not measure as it data. measure does not authors to take seriously prior research and theory in their measure what it should. is to show that it does field, and it gives an opportunity, with incumbent risk, for Researchers occasionally encounter a measurement that there is no applicable problem that has no obvious solution. This happens when a few to offer the challenge they decide to explore a new and rapidly growing research research behind a given study. If exploration were not August 1999 * American Psychologist 596 August 1999 * American Psychologist 596 in hypothetico-deductive language, then it might disguised to influence subsequent research Figure 1 have the opportunity Matrix constructively. Scatter-Plot Computer programs that calculate power for various 18 99 designs and distributions are now available. One can use them to conduct power analyses for a range of reasonable alpha values and effect sizes. Doing so reveals how power changes across this range and overcomes a tendency to regard a single power estimate as being absolutely definitive. for Many of us encounter power issues when applying grants. Even when not asking for money, think about power. Statistical power does not corrupt. LU Results u, Complications protocol .I Before presenting results, report complications, collec- events in data I unanticipated other violations, and tion. These include missing data, attrition, and nonre- 0 devised to ameliorate O techniques analytic sponse. Discuss statisti- I- these problems. Describe nonrepresentativeness of missing and distributions patterns cally by reporting AGE SEX TOGETHER anal- Document how the actual and contaminations. data before complications planned from the analysis ysis differs Note. M = male; F = female. that the reported arose. The use of techniques to ensure in the data (e.g., by anomalies produced results are not data, missing nonrandom high influence, of points outliers, problems) should be a standard selection bias, attrition stacked like a histogram) and scales used for each variable. component of all analyses. The three variables shown are questionnaire measures of As soon as you have collected your data, before you number of Data screening is respondent's age (AGE), gender (SEX), and data. your at statistics, look compute any The not data snooping. It is not an opportunity to discard data or years together in current relationship (TOGETHER). change values to favor your hypotheses. However, if you graphic in Figure 1 is not intended for final presentation of assess hypotheses without examining your data, you risk results; we use it instead to locate coding errors and other publishing nonsense. anomalies before we analyze our data. Figure 1 is a se- Computer malfunctions tend to be catastrophic: A lected portion of a computer screen display that offers tools system crashes; a file fails to import; data are lost. Less for zooming in and out, examining points, and linking to well-known are more subtle bugs that can be more cata- information in other graphical displays and data editors. strophic in the long run. For example, a single value in a SPLOM displays can be used to recognize unusual patterns file may be corrupted in reading or writing (often in the first in 20 or more variables simultaneously. We focus on these or last record). This circumstance usually produces a major three only. value error, the kind of singleton that can make large There are several anomalies in this graphic. The AGE correlations change sign and small correlations become histogram shows a spike at the right end, which corre- large. sponds to the value 99 in the data. This coded value most Graphical inspection of data offers an excellent pos- likely signifies a missing value, because it is unlikely that sibility for detecting serious compromises to data integrity. this many people in a sample of 3,000 would have an age The reason is simple: Graphics broadcast; statistics narrow- of 99 or greater. Using numerical values for missing value cast. Indeed, some international corporations that must codes is a risky practice (Kahn & Udry, 1986). defend themselves against rapidly evolving fraudulent The histogram for SEX shows an unremarkable divi- schemes use real-time graphic displays as their first line of sion into two values. The histogram for TOGETHER is defense and statistical analyses as a distant second. The highly skewed, with a spike at the lower end presumably following example shows why. signifying no relationship. The most remarkable pattern is Figure 1 shows a scatter-plot matrix (SPLOM) of the triangular joint distribution of TOGETHER and AGE. three variables from a national survey of approximately Triangular joint distributions often (but not necessarily) 3,000 counseling clients (Chartrand, 1997). This display, signal an implication or a relation rather than a linear pairwise scatter plots arranged in a matrix, is function with error. In this case, it makes sense that the consisting of diagonal span of a relationship should not exceed a person's age. found in most modern statistical packages. The wrong here, cells contain dot plots of each variable (with the dots Closer examination shows that something is Psychologist 597 August 1999 * American Psychologist 597 August 1999 • American
no reviews yet
Please Login to review.