Sour Cream

Sour Cream chocolate cream, sour cream or berry jam?

AbstractAn increasingly popular approach to statistical inference is to focus on the estimation of effect size. Yet this...
18/05/2022

Abstract
An increasingly popular approach to statistical inference is to focus on the estimation of effect size. Yet this approach is implicitly based on the assumption that there is an effect while ignoring the null hypothesis that the effect is absent. We demonstrate how this common null-hypothesis neglect may result in effect size estimates that are overly optimistic. As an alternative to the current approach, a spike-and-slab model explicitly incorporates the plausibility of the null hypothesis into the estimation process. We illustrate the implications of this approach and provide an empirical example.

Keywords
effect size, Bayesian estimation, modeling, shrinkage, open materials
Consider the following hypothetical scenario: A colleague from the biology department has just conducted an experiment and approaches you for statistical advice. The analysis yields p < .05, and your colleague believes that this is grounds to reject the null hypothesis. In line with recommendations both old (e.g., Grant, 1962; Loftus, 1996) and new (e.g., Cu***ng, 2014; Harrington et al., 2019), you convince your colleague that it is better to replace the p value with a point estimate of effect size and a 95% confidence interval (CI; but see Morey et al., 2016). You also manage to convince your colleague to plot the data (see Fig. 1). Mindful of the reporting guidelines of the Psychonomic Society1 and Psychological Science,2 your colleague reports the result as follows: “Cohen’s d = 0.30, 95% CI = [0.02, 0.58].”

figure

Fig. 1. Standard estimation results for the fictitious plant growth example. (Left) A descriptives plot with the mean and 95% confidence interval of plant growth in the two conditions. (Right) Point estimate and 95% confidence interval for Cohen’s d.

Given these results, what would be a reasonable point estimate of effect size? A straightforward and intuitive answer is “0.30.” However, your colleague now informs you of the hypothesis that the experiment was designed to assess: “Plants grow faster when you talk to them.”3 Suddenly, a population effect size of zero appears eminently plausible. Any observed difference may merely be due to the inevitable sampling variability.

The example above is rhetorical but serves to underscore the potential conflict between standard reporting guidelines and common sense. The example raises the following question: When are effect sizes overestimated? Standard point estimates and confidence intervals ignore the possibility that the effect is spurious (i.e., the null hypothesis, H0). This is not problematic when H0 is deeply implausible, either because H0 was highly unlikely a priori or because the data decisively undercut H0. But when the data fail to undercut H0 or when H0 is highly likely a priori (i.e., “plants do not grow faster when you talk to them”), then H0 is not ruled out as a plausible account of the data. Effect size estimates that ignore a plausible H0 are generally overly optimistic and overly confident: The fact that H0 provides an acceptable account of the data should shrink effect size estimates toward zero. The statistical benefits of shrinkage are described in Efron and Morris (1977; see also Davis-Stober et al., 2018; Rouder & Lu, 2005; Shiffrin et al., 2008); the benefits of shrinking estimates toward zero are discussed, for instance, in George and McCulloch (1993), Iverson et al. (2010), and van Erp et al. (2019).

The above point estimate, 0.30, may seem purely data-driven, but it is based on a model that assumes an effect size different from zero. In this article, we propose an alternative model to estimate effect size: the so-called spike-and-slab model. First, we formally introduce the spike-and-slab model. Second, we apply the spike-and-slab model to the example in the introduction and illustrate how it tempers the estimated effect size. Third, we visualize how the spike-and-slab model may shrink the estimated effect size toward zero in general. Fourth, we demonstrate the spike-and-slab model by reanalyzing the data of Heycke et al. (2018). Finally, we conclude with practical recommendations and a discussion on when to use the spike-and-slab model.

A Spike-and-Slab Perspective
The spike-and-slab approach has been widely discussed in the statistical literature (e.g., Clyde et al., 1996; Geweke, 1996; Ishwaran & Rao, 2005; Mitchell & Beauchamp, 1988; O’Hara & Sillanpää, 2009) and in the psychological literature (e.g., Bainter et al., 2020; Iverson et al., 2010; Rouder et al., 2018; Yu et al., 2018). Conceptually, the approach is relatively straightforward.

As usual, the statistical goal is to infer the population effect size from a set of sample observations. Let δ denote the population effect size, let δˆ denote a point estimate, and let δˆ∣∣H1 denote a point estimate assuming the alternative hypothesis, H1. Assuming the null hypothesis H0 leads to δˆ∣∣H0, this usually equals 0. Key is that both estimates, δˆ∣∣H1 and δˆ∣∣H0, are conditional on the hypotheses. For example, δˆ∣∣H1 should be read as “the estimated effect size under the alternative hypothesis that the effect exists.” To the best of our knowledge, all existing guidelines for reporting effect size estimates recommend that researchers provide δˆ∣∣H1; implicitly, the guidelines suggest to ignore H0, resulting in the notion that the population effect size is nonzero. In contrast, in the spike-and-slab model, the estimate of effect size is determined by both H1 and H0.

As the name suggests, the spike-and-slab model consists of two components. The first component, the spike, corresponds to the position that talking to plants does not affect their growth (i.e., δ=0), whereas the second component, the slab, corresponds to the position that speaking to plants does affect their growth (i.e., δ≠0). The spike and slab are analogous to H0 and H1 discussed above. Both components are commonly deemed a priori equally likely such that the prior probability for each component is one half. One can assign prior probabilities other than one half if this is motivated by prior research, prior data, or existing theories (e.g., Wilson & Wixted, 2018). After observing the data, the prior probabilities (Pr) of both components, Pr (spike) and Pr (slab), are updated to posterior probabilities, Pr (spike | data) and Pr (slab | data).

By applying the spike-and-slab model, we learn about the relative plausibility of the two components; in addition, the spike-and-slab model produces a marginal estimate of effect size—a weighted combination of effect sizes from the spike and from the slab (for mathematical detail, see the Appendix in the Supplemental Material available online). In other words, the spike-and-slab model yields an overall effect size averaged across the spike and the slab, with averaging weights determined by the respective posterior probabilities:

δˆ=(δˆ∣∣spike)Pr(spike∣∣data)+(δˆ∣∣slab)Pr(slab∣∣data).
(1)
Marginalizing across model components according to their posterior plausibility is a uniquely Bayesian operation, and this is the statistical framework we adopt in this article (for an accessible introduction to Bayesian inference, see Vandekerckhove et al., 2018). Researchers who prefer a frequentist approach can accomplish shrinkage by using penalized maximum likelihood methods such as least absolute shrinkage and selection operator and ridge regression (Tibshirani et al., 2005). Another option open to frequentists is to marginalize across the spike and the slab, for instance by using the Akaike information criterion (AIC; Akaike, 1973) and defining the averaging weights as follows. Let ΔAIC=(AIC|spike)−(AIC|slab), the difference in AIC between the spike and the slab. Next we use the Akaike weight, wspike, as a substitute for the posterior probability of the spike: wspike=exp(−1/2ΔAIC)/(1+exp(−1/2ΔAIC)) (Burnham & Anderson, 2002; Wagenmakers & Farrell, 2004). The substitute for the posterior probability of the slab is simply wslab=1−wspike.

Note that when the spike is located at δ=0, as is usually the case, then (δˆ∣∣spike)Pr(spike∣∣data)=0, and consequently, Equation 1 simplifies to

δˆ=(δˆ∣∣slab)Pr(slab∣∣data).
(2)
This equation shows that the spike-and-slab estimate δˆ equals the estimate that is generally recommended in reporting guidelines, (δˆ∣∣slab), but reduced by the posterior probability for H1. This shrinkage toward zero becomes negligible when the posterior probability for H1 approaches 1.

To illustrate both the overestimation and the spike-and-slab model, we reanalyze the fictitious data from Figure 1. R code for the analysis is available at https://osf.io/uq8st/. Remember that the frequentist point estimate for the effect size conditional on H1, or the slab, was δˆ=0.30, with 95% CI = [0.02, 0.58]. The Bayesian equivalent is δˆ=0.29, with 95% credible interval (CRI) = [0.02, 0.57]. Figure 2 contrasts this Bayesian slab-only estimate against the spike-and-slab estimate.

figure

Fig. 2. The spike-and-slab model. The black line represents the posterior distribution of effect size given the slab (i.e., the effect is nonzero). The posterior is scaled so that its mode (δˆ=0.29) equals the posterior probability of the alternative model (i.e., p(slab|data)=0.48). The gray line represents the posterior probability of the spike (i.e., δˆ=0: the effect is absent). The error bars and dots above the density show 95% credible intervals and the posterior mean for the slab-only model and for the spike-and-slab model.

Compared with the traditional results based only on the slab, the posterior mean and central 95% CRI of the spike-and-slab model are shrunken toward zero (i.e., 0.14, 95% CRI = [0.00, 0.48] vs. 0.29, 95% CRI = [0.02, 0.57]). This shrinkage is due to the nonnegligible probability that the effect is absent. Here, the posterior probability of the spike after seeing the data, 0.52, is almost identical to its prior probability. In Figure 2, the plausibility that the effect is absent is represented by the height of the spike, and the uncertainty about the effect’s magnitude, given that it is present, is represented by the width of the slab. Note that if the posterior probability of the spike was reduced, the spike-and-slab results would approach those of the slab-only model.

The Influence of the Spike
In the fictitious example, the spike-and-slab model reduces the estimated effect size by shrinking estimates of effect size toward zero. The result may not be surprising given that the effect was small. However, it makes one wonder to what extent the spike-and-slab model helps with estimation. What are the differences between a slab-only model and the spike-and-slab model? In this section, we illustrate how the estimated effect size shrinks toward zero under various circumstances. We visualize the shrinkage as a function of the observed effect size, the prior on the standard deviation of effect size under the slab, the sample size, and the prior probability of the spike. We chose these parameters because the posterior distribution is fully determined by these quantities (see the Appendix in the Supplemental Material).

Figure 3 shows the relation between the observed effect size and the estimated effect size for the slab and for the spike-and-slab models for 40 observations and 100 observations. All plots show that a smaller prior standard deviation of the slab induces some shrinkage toward zero. This effect is most obvious in the top left panel, and it makes sense because a small prior standard deviation implies there is more prior mass near the mean of the prior, which is zero. This influence of the prior standard deviation is typically referred to as prior shrinkage, and it intrinsic to a Bayesian approach but not to the spike-and-slab model. Comparing the plots between the two columns illustrates the influence of the spike; whenever the observed effect size is near zero, the estimate is shrunken toward zero in the right column but not in the left column. However, when the observed effect size is far from zero, there is little additional shrinkage to the prior shrinkage.

figure

Fig. 3. Observed effect size versus posterior mean for different model components and prior standard deviations. The left column shows inference based on the slab-only model, and the right column shows inference based on the spike-and-slab model. In the top row, the sample size was 40, and in the bottom row, the sample size was 100. Different lines represent different standard deviations for the prior distribution on δ. The prior probability of the spike was one half. Inspired by Figure 5 of Rouder et al. (2018).

The shrinkage in the spike-and-slab model can be explained in the following way. Whenever the observed effect size is small, the data are well described by an effect size of zero, and thus the posterior probability of the spike is substantial. As a result, the marginal estimate is shrunken toward the spike’s estimate, zero. In contrast, when the observed effect size is large, the data are poorly described by an effect size of zero and the posterior probability of the spike is negligible. As a consequence, the estimate of the spike-and-slab is practically equivalent to the estimate of the slab. The plots in the right column of Figure 3 show the effect of sample size on the shrinkage. For the bottom right plot, N=100. If the observed effect size is small, then the estimate is still shrunken toward zero, but as the observed effect size grows, the shrinkage decreases much more quickly than in the top right plot, where N=40. This makes sense from a signal-detection perspective. If the observed effect size is, for example, 0.3 after 40 observations, the posterior probability of the spike is substantial. However, after collecting 60 additional observations, while the observed effect size remains 0.3, the posterior probability of the spike decreases as it becomes increasingly less probable that the data-generating model had an effect size of zero.

Next, we explore the relationship between shrinkage and the prior probability of the spike. Figure 4 shows the shrinkage for various prior probabilities. The smaller the prior probability of the spike, the less the effect size is shrunken toward zero. If the prior probability is small, then the spike was a priori implausible, and less evidence is needed to make its influence negligible.

figure

Fig. 4. Observed effect (x-axis) versus the posterior mean of the spike-and-slab model (y-axis). The different lines represent different prior probabilities of the spike. The figure is based on 40 observations with a prior standard deviation of one.

Empirical Example: Reanalysis of Two Minds
We now highlight how the spike-and-slab approach can be used in psychological practice by reanalyzing the results of Heycke et al. (2018), who conducted two registered replications of Rydell et al. (2006). We first briefly explain the design of the study before reanalyzing the explicit evaluation and implicit evaluation analyses with a spike-and-slab model. For a detailed description, see the Procedure section in Heycke et al. (2018). Finally, we provide a robustness analysis.

The goal of Heycke et al. (2018) was to replicate key evidence for implicit-attitude formation. In the original study, Rydell et al. (2006) reported that attitudes induced by subliminal primes manifest when they are assessed by an implicit-attitude measure and that attitudes induced by supraliminal cues manifest when they are assessed by an explicit-attitude measure. This finding corresponds to a perhaps surprising dissociation of implicit- and explicit-attitude measures. In the Heycke et al. experiments, participants were briefly flashed a positive or negative prime followed by an image of a person. Next, several behavioral descriptions that were either negative or positive appeared with the image of the person (e.g., “Bob cheated during a poker game”). Afterward, participants explicitly evaluated the target person and performed an implicit association task (IAT). In total, data of 51 participants were analyzed. Heycke et al. could not find the dissociation between explicit- and implicit-attitude measures. They found that although positive descriptions resulted in a more favorable explicit evaluation than negative descriptions, positive subliminal primes did not result in more favorable IAT scores than negative subliminal primes. In contrast, both explicit- and implicit-attitude measures were in line with the explicit descriptions they learned during the experiment.

Explicit evaluation
In the analysis of the explicit evaluations, Heycke et al. (2018, p. 10) conducted a paired t test and concluded that the rating of the target character is more positive if positive information is shown before negative information: t(27)=11.52, p

18/05/2022

Although Bitcoin mining is technically outlawed in China, the country returned as the second-biggest mining hub globally.

How Can Scope and Parsimony Be Clarified?Recently, there has been much activity to protect psychology from fraud, improv...
17/05/2022

How Can Scope and Parsimony Be Clarified?
Recently, there has been much activity to protect psychology from fraud, improve the quality of research, and strengthen theory. This work has led to prominent recommendations for good practice with respect to a variety of goals. We now review and comment on some of these from our perspective of theoretical scope and parsimony.

One prominent recommendation is to preregister studies. Preregistration is often promoted as a way to decrease post hoc analyses and theorizing because it forces researchers to identify key hypotheses before data collection (e.g., Mistler, 2012; Moore, 2016; Simmons et al., 2021; Wagenmakers et al., 2012). However, as noted by others, preregistration is not a panacea for poor theory development, mediocre methods, or undiagnostic data (see e.g., Lakens & DeBruine, 2021; Szollosi & Donkin, 2021; Szollosi et al., 2020). It is unclear how preregistration guards against the problems and errors we have discussed here. Preregistered studies can engage in the same logical fallacies, use the same stylized statistics, perpetuate the same double standards, repeat the same asymmetric philosophies of science, and be as internally inconsistent as nonpreregistered studies. Preregistering a design that perpetuates ambiguous scope and ambiguous parsimony merely documents study flaws in advance. On the upside, preregistration provides an opportunity for scholars to discuss issues of scope, parsimony, diagnosticity of stimuli, fairness of model selection, and so on ahead of running a study if they so choose.

A second prominent recommendation is increased emphasis on replication (see e.g., Pashler & Harris, 2012; Simons, 2014). Replication helps to improve measurement precision and asses the reliability of an effect of interest. This is inherently useful and can also be leveraged to compute lower and upper bounds on the number of people who satisfy a theoretical claim or display a phenomenon (see e.g., Bogdan et al., n.d.; Davis-Stober & Regenwetter, 2019; Heck, 2021). Thus, replication can help assess scope. However, along with others, we advocate that replicability is far from a panacea: For one thing, efforts invested into reproducing and replicating a prior study as identically as possible are efforts not invested into exploring how the finding extends to other people, novel stimuli, different tasks, or new contexts. Relatedly, for arguments on the relative merits (e.g., of direct and conceptual replication), see also Carpenter (2012), Nosek et al. (2012), Pashler and Harris (2012), Schmidt (2009), and Simons (2014). In other words, replication can be orthogonal to explorations of theoretical scope. Just as importantly, like preregistration, successful replications of a phenomenon would not guard against most of the errors we identify here, such as fallacies of sweeping generalization, conjunction fallacies, and other problems associated with stylized statistics. To the contrary, it can repeat, reinforce, and even perpetuate reasoning errors and scientific biases (Davis-Stober & Regenwetter, 2019; Irvine, 2021; Regenwetter & Robinson, 2017, 2019a, 2019b; Rotello et al., 2015; Yarkoni, 2022). On the upside, we can envision situations in which scholars could both reproduce a prior study and enhance it with additional features that aim to bring theoretical scope and parsimony into better focus. We also advocate that scholars preface replication with a discussion of its impact on understanding scope.

A third recommendation, which is gaining traction especially in cognitive psychology, is to replace or supplement verbal theories with formal computational or mathematical models (e.g., Borsboom et al., 2021; Grahek et al., 2021; Guest & Martin, 2021; Navarro, 2021; Oberauer & Lewandowsky, 2019; Robinaugh et al., 2021; van Rooij & Baggio, 2020). We agree that formal modeling can force researchers to think more explicitly about both the intended scope and the flexibility of their theories. However, formal modeling on its own is not sufficient for addressing many of the double standards and ambiguity problems that we have identified. Clearly, CPT is a formal model. Yet our discussion above demonstrates that the value of formal modeling hinges on how it is implemented (and this point is reinforced by all of the references in this paragraph). Formal modeling can give the appearance of rigor and mask systemic errors (Chen et al., 2021). Formally precise models often force simplifying assumptions or omit hidden variables. These can become counterproductive (for related general points, see also Kellen et al., 2021; Yarkoni, 2022). To keep formal models tractable, scholars may limit themselves to overly simple tasks or simple stimuli (for a discussion, see e.g., Navarro, 2021). On the upside, in contrast to verbal theories, which we would consider inherently ambiguous, formal modeling does provide a common language (logic, computer code, mathematics, and/or statistics) through which to discuss theoretical scope, parsimony, and standards of scientific discourse openly and rigorously. However, as we show when we review the fifth recommendation, although mathematical formulas ostensibly eschew rhetoric, the connection between the mathematics and the substantive questions of interest can also be ambiguous. Our discussion of CPT in this article highlights an example of that broad problem.

A fourth prominent recommendation is to supplement or replace data fitting with prediction to other tasks or unseen data (e.g., Busemeyer & Wang, 2000; Erev et al., 2010, 2017; Pitt et al., 2003; Yarkoni & Westfall, 2017). This has been advocated as a tool for addressing overfitting and for developing theories that generalize. We agree that prediction is an important step toward recognizing and avoiding heuristic approaches to parsimony. However, to avoid asymmetries and double standards, scholars should provide a clear explanation why the participants, tasks, stimuli, and contexts are designed in such a way as not to provide an unfair advantage to some theories over others. Notice that searching a parameter space for best fitting parameters need not make a theory unparsimonious or cause overfitting. The number of parameters is no more than a heuristic measure parsimony.8 CPT is a prominent example in which prediction has gone awry: ‘Refuting’ CPT by testing predictions from CPTMED or other stylized distortions of the theory gives no consideration to either the theory’s scope or its parsimony. In some prediction tournaments (e.g., Erev et al., 2010), although the parameters of some theories could characterize specific properties of individuals,9 the tournament rules required participants to reduce their theory down to a single set of specific parameter values, thereby obfuscating individual differences and collapsing the scope of each theory to a single set of stylized predictions. Likewise and more generally, claiming that a theory performs poorly in predictions is counterproductive when those predictions hinge on untested and unquestioned auxiliary assumptions, such as off-the-shelf statistical models. We turn to this next.

A final major recommendation is increased attention to the problem of “coordination” in psychological research: Theory simultaneously presumes and guides measurement of latent constructs (Irvine, 2021; Kellen et al., 2021; Singmann et al., 2021; van Frassen, 2008). Some of these articles warn that heated debates about the relative merit of competing theories often heed no attention to the pivotal role of technical and auxiliary assumptions, such as analysis of variance or other off-the-shelf models. Attending to the circular connection between theory and measurement (e.g., attending to auxiliary assumptions) forces researchers to consider both jointly. For a related literature, see the extensive work on meaningfulness in psychological measurement and theory (Falmagne & Doble, 2016; Falmagne & Narens, 1983; Narens, 2002, 2007; Roberts, 1985; Roberts & Rosenbaum, 1986). Attention to the coordination problem and to meaningfulness may lead to a more nuanced understanding of both theoretical scope and theoretical parsimony.

Flagging symptoms of ambiguous scope or parsimony
We have touched on a number of features of Tversky and Kahneman (1992) whose parallels and analogues in other paradigms can flag ambiguous scope or ambiguous parsimony in psychological theory more broadly. First, the most prominent flags are all forms of asymmetric reasoning in which scholars point out shortcomings of others’ theories or evidence without discussing the possible shortcomings of the replacements they propose. Double standards, such as using many more or many fewer stimuli to test the old theory than the new one, using stimuli (even if picked ‘randomly’) that pressure the old theory but not the new one, pushing a novel theory merely on the basis of its ability to accommodate some ‘anomalies‘ that the old theory does not explain, may all create systematic biases against existing theory and in favor of the proposed new theory. When the latter is custom designed to handle certain phenomena, it is important to also understand the associated cost in parsimony. Extreme forms of asymmetric reasoning occur when scholars provide evidence only against special cases of a theory (e.g., CPTMED), thereby literally misrepresenting the theory they question. Second, serious questions of scope arise with mathematical errors or omissions. For example, the mathematical model in Tversky and Kahneman (1992) does not actually imply a fourfold pattern on their own fourfold pattern study stimuli. Third, more broadly, any internal inconsistencies in reasoning can flag problems with parsimony and/or scope. A very troubling yet extremely common practice is strawman null hypothesis testing. Testing hypotheses whose violation is a foregone conclusion (e.g., perfectly calibrated coin flipping as a null model of behavior, the null that two groups are identical) cannot legitimately provide evidence in favor of a proposed theoretical claim (see also Cohen, 1994; Meehl, 1978). More broadly, meaningless statistics, such as the ‘number of correct predictions’ or the number of correct modal choices in decision-making, generate useless evidence. Fourth, claims of “converging evidence” are often unsubstantiated. Here, it is useful to see whether it is possible to calculate or estimate how many people satisfy the conjunction of evidentiary phenomena (see also Davis-Stober & Regenwetter, 2019; Regenwetter et al., in press). Fifth, nontechnical readers should be aware that some commonly used model selection criteria, such as AIC and BIC, are heuristic in nature and thus may give an analysis a sheen of rigor that need not be warranted. A major improvement is the use of Bayes’s factors, especially in cases in which the researchers provide information on the possible range of Bayes’s factors for a given study.

How parsimonious is CPT?
We end with an illustration of Bayes’s factors as a quantitative measure of parsimony for CPT. We concentrate on the 8 + 8 + 17 + 17 = 50 stimuli from Tversky and Kahneman’s (1992) fourfold pattern study used in their Table 4 (we show and label some of them in our Table 2). As we have already seen, there are 12 possible preference patterns for the 25 gains prospects and 12 possible preference patterns for the 25 loss prospects. We briefly consider two “probabilistic specifications” of CPT (with Equations 1–3) from Regenwetter et al. (2014) and Zwilling et al. (2019). According to the “aggregation-based” model, each individual has one of the 12×12=144 allowed patterns (out of 250>1015 possible ones) as his or her single “true” preference state. If the person prefers prospect f to prospect g and if we allow the person to make a response error with probability10 at most τ, then the person will choose f with probability ≥1−τ. According to the “random preference model”, the probability of choosing f over g is the total probability of those preference patterns in which f is preferred to g in an unspecified probability distribution over all 144 possible preference patterns. Considering just the prospects and preference patterns in our Table 1, note that a decision maker who prefers the lottery in Prospect I also does so in Prospect IV and vice versa. The same holds for Prospects III and V. Moreover, in any pattern in which the lottery is preferable in Prospect II, the lottery is also preferable in Prospect IV, but the converse does not hold. Using similar reasoning or using polyhedral combinatorics,11 one can show that no matter what probability distribution we consider over these preferences, writing Pi for the probability of choosing the lottery in Prospect i,

1≥PI=PIV≥PII≥PV=PIII≥0;PII≥PVII≥PVIII
(4)
PIV≥PVI≥PVIII≥0;PV+PVI≥PVII.
(5)
The remaining choice probabilities for the other 17 loss prospects are 1. Similar constraints hold among the choice probabilities for the 25 gain prospects.

Regenwetter et al. (2018) and Zwilling et al. (2019) reviewed how to calculate the range of possible Bayes’s factors between such a given model and an unconstrained “encompassing” model. The aggregation-based model can generate a Bayes’s factor anywhere between 0 and 1144×τ50. For τ=0.5, the upper bound exceeds 1012, and for τ=14, it exceeds 1027. Because the random preference model predicts deterministic choice of the lottery in Tversky and Kahneman’s (1992) 17 “high-probability” loss prospects and deterministic choice of the sure gain in Tversky and Kahneman’s 17 “high-probability” gain prospects, the Bayes’s factor, either in favor or against the random preference model, is unbounded! To conclude about CPT’s parsimony through the lens of Bayes’s factors, for both of these probabilistic specifications, there is no limit as to how much evidence could be provided against CPT on these stimuli.

Conclusion and Discussion
Since the publication of prospect theory some 40 years ago (Kahneman & Tversky, 1979), scholars have prolifically cited Tversky and Kahneman’s (and others’) findings that EUT suffers from limited scope. Yet for nearly 30 years, it has gone unnoticed that Tversky and Kahneman (1992) provided evidence for the exact same limitation in CPT’s scope: Just like half their participants violated EUT early in their article, so did ostensibly half the respondents violate CPT in Problem 7 of their loss aversion study later in the same article. In addition, some aggregate measures do not align with each other, and the exact role of fourfold patterns is ambiguous. All in all, this raises the question about the overall balance of evidence that the original CPT article ultimately provided in favor of or against its own theory. Likewise, moving beyond the 1992 article and considering CPT’s entire “functional menagerie” (Stott, 2006) of potential utility and weighting functions, it is unclear to us whether these modifications make the theory’s scope or parsimony any less ambiguous and in what way. We are not aware of any consensus in the field as to how to weigh the many versions of the theory against each other and against competing theories or how to determine which version of the theory provides unambiguously the best trade-off between scope and parsimony across all possible stimuli and tasks.

The primary purpose of this article is not to question the validity of CPT as a theory or to provide further grounds to endorse it but, rather, to call attention to the ambiguity of the evidence provided by the authors in support of their own theory. Nor is our goal to single out Tversky and Kahneman for a practice that appears rather widespread. Our goal is conceptual: How should one really think of theoretical scope in psychology? History has been repeating itself in that much behavioral decision research turns CPT’s limitations against it. Scientific cost-benefit analysis in decision science all too often appears to focus on highlighting the cost of others’ theories and the benefits of one’s own proposals. The resulting fault lines have left us with various ‘camps:’ Some endorse EUT as their preferred theoretical idealization. Others consider CPT as their sweet spot for a theory of risky choice. Meanwhile, countless articles expand or modify prior theories to accommodate behavioral regularities or irregularities that have been reported as evidence against those theories. Over time, each new generation has tended to develop its new proposals by calling out limitations in previous ideas12 with little attention to the limitations of the new theses (for examples of notable exceptions, see Brandstätter et al., 2006, 2008; Loomes, 2010, who acknowledged and specified limitations of their own models) and, perhaps more importantly, with little discussion about what limitations are acceptable or inacceptable. The literature in risky choice often appears to follow a three-pronged research strategy: (a) Scrutinize the old theory by showing certain weaknesses and promote the new theory by showing that it overcomes that particular set of weaknesses, (b) leave it to others to explore the new proposed theory’s weaknesses, and (c) either ignore the other camps or defend the new theory vigorously against their challenges. We see little effort toward reconciliation among schools of thought that highlight different aspects of and approaches to decision-making. We also detect little effort to weigh strengths and weaknesses of competing theories in a comprehensive manner.

Although our discussion centered around decision-making, our conclusions apply to the discipline more widely. In our view, psychology should move from using theoretical scope primarily as a bludgeon to attack others’ theories and proceed toward pursuing more constructive goals. Every scientific theory has some limitations, especially in psychology. When proposing a new idea, behavioral scientists should make every effort to spell out the intended scope of this new theory. Proposals for new theories are far more interesting when they also delineate what would constitute critical tests, what would qualify as refutation of a new proposal, what is considered beyond a theory’s intended scope, and who the theory applies to when, where, and why. Likewise, more adversarial collaborations between camps would help bring sense to the balkanized landscape of entrenched schools of thought (for useful guidelines on how to run such projects, see e.g., Mellers et al., 2001, Table 1).

So, what is one to make of the fact that half of Tversky and Kahneman’s (1992) participants in one of their studies appear to have violated their own theory on one stimulus? What is one to make of the internal inconsistencies among reported findings within one article? It is unclear how to weigh the evidence in favor of CPT on the one hand and the limitations of CPT on the other hand against the corresponding strengths and weaknesses of competing theories. Statistical science actively researches and studies the trade-off in complexity and parsimony in statistical models by counting parameters and degrees of freedom, computing heuristic model selection indices such as AIC and BIC, and applying quantitative model selection tools such as Bayes’s factors. As Yarkoni (2022) argued in somewhat different words, the ‘sampling’ of stimuli, participants, and design features of psychological research effectively hides uncounted degrees of freedom in the data. Psychology still needs to properly define theoretical scope and theoretical parsimony beyond post hoc statistical models. The discipline should move beyond weighing, say, ‘good’ and ‘bad’ stimuli or study designs heuristically and develop methods, concepts, and standards for weighing theoretical scope against scientific simplicity. A first step is for scholars of different schools of thought to cooperate more systematically in synergizing the strengths of different theories. A second and easy-to-implement step is for scholars to specify what they mean by “diagnostic stimuli” and/or “critical tests” not only for existing theory but also for their own proposed new theory. A third step is for scholars to be more cognizant that every behavioral theory has limitations and therefore spell out, as explicitly as they can, what scope they envision for their proposed theory.

Address

Kharkiv

Alerts

Be the first to know and let us send you an email when Sour Cream posts news and promotions. Your email address will not be used for any other purpose, and you can unsubscribe at any time.

Share