SpaceP, Kyiv (2026)

18/05/2022

Caveats
There exist a number of caveats for both the proposed Bayesian meta-analysis approach specifically and meta-analysis in general. The main danger is that researchers treat the outcome of a meta-analysis as definitive without taking into account the assumptions and limitations of the approach. In general, there are many uncertainties when applying meta-analysis; the proposed approach attempts to address one of these uncertainties (i.e., should a fixed-effect or random-effects model be used) using Bayesian model averaging. One uncertainty that is not addressed by the approach is whether the assumption of a normal distribution of true study effects is plausible. It may be argued that this assumption is problematic because of a number of reasons. For example, there may be dependencies between different effect sizes due to including multiple effect sizes from the same articles or multiple studies from the same lab. Moreover, there may be sequential dependencies given that researchers may inform their study designs by reading the literature (this may be less of a concern for many-labs meta-analyses). Furthermore, researchers should be aware that there may be measurement-error and range-restriction issues. A number of methods have been proposed to address these caveats (e.g., Cheung & Chan, 2008; Schmidt & Hunter, 2015; Tipton, 2015). Another caveat is that the presence of publication bias may distort the meta-analytic result. Publication bias can be ruled out in case the complete set of studies has been preregistered (e.g., in the form of a Registered Replication Report, Chambers, 2017; van Elk et al., 2015). Whenever publication bias cannot be ruled out, a number of methods have been proposed for estimating the extent of this publication bias and for correcting the meta-analytic effect size estimate (e.g., Gronau, Duizer, et al., 2017; Simonsohn et al., 2014a, 2014b; van Assen et al., 2015).18 Furthermore, our lab has recently proposed an extension of the Bayesian model-averaged meta-analysis procedure that takes into account the possibility of publication bias (Bartoš et al., 2020; Maier et al., 2020). In any case, it is important to emphasize that researchers should not blindly trust meta-analysis results but should take into account substantive expertise and knowledge about the limitations of the procedure.

Beyond overall effects
In addition to the key questions Q1 and Q2, researchers may often be interested in incorporating discrete and continuous moderators at the study level. Although we did not discuss this possibility here, the metaBMA package does provide functionality for including moderators. Including moderators in the analysis is one way of accounting for the fact that different subsets of studies might have different latent effect sizes. Another possible way of incorporating and testing this assumption would be to change the distribution of the latent study effects. Instead of assuming a single continuous normal distribution of effect sizes, one could assume a latent mixture of normal distributions and then test how many components are necessary to describe the distribution of latent study effects best (e.g., Moreau & Corballis, 2019).

An additional approach to a Bayesian meta-analysis is to focus on the entire distribution of study effects instead of the overall effect. For instance, Rouder et al. (2019) proposed to test whether all studies in the meta-analytic sample show an effect in the same, expected direction or whether some studies show an opposite effect. An appropriate model for this analysis is one in which both the distribution of the overall effect and the distribution of individual study effects are truncated; the latter truncation is imposed to allow individual study effects in one direction only (upper level of Fig. 1). This model can then be compared with the unconstrained alternative (i.e., the random-effects alternative). Similar tests have been proposed in the clinical literature, in which meta-analysis also serves the purpose to test whether one treatment is superior for one patient population and another treatment is superior for another patient population (Gail & Simon, 1985). Such a “Does every study show an effect?” analysis is implemented in the metaBMA package.

As a final word of caution, we would like to stress again that, in line with the adage “garbage in, garbage out,” no statistical analysis can provide high-quality inference based on low-quality data that might be the result of problematic study design, shortcomings of the implementation or sample, publication bias, significance chasing, and so on; Bayesian model-averaged meta-analysis is no exception. For instance, one may use the procedure to analyze studies that have not been preregistered; however, the conclusions might need to be interpreted with skepticism in case the quality of the included studies is questionable or if the included studies represent a biased sample of all conducted studies in a field. In contrast, when the set of studies is of high quality, preregistered, and possibly even the result of a Registered (Replication) Report, we believe that Bayesian model-averaged meta-analysis can be a valuable tool that allows researchers to address key questions of interest in a principled manner.

Appendix
Changing the prior probabilities of the hypotheses
When computing Bayes factors (BFs) that compare two models, such as BFHf1,Hf0 (see Equation 2 and Equation 3), the prior probabilities of the hypotheses do not affect the resulting BF. For instance, when inserting the expressions for the posterior probabilities in Equation 3, the prior probabilities cancel out:

BFHf1,Hf0=p(data∣∣Hf1)p(Hf1)p(data∣∣Hf0)p(Hf0)/p(Hf1)p(Hf0)=p(data∣∣Hf1)p(data∣∣Hf0).
(6)
In contrast, when computing inclusion BFs that involve more than two models, the prior probabilities affect the resulting BFs. For instance, when inserting the expressions for the posterior probabilities in Equation 4, the prior probabilities do not cancel out:19

BF10=p(data∣∣Hf1)p(Hf1)+p(data∣∣Hr1)p(Hr1)p(data∣∣Hf0)p(Hf0)+p(data∣∣Hr0)p(Hr0)/p(Hf1)+p(Hr1)p(Hf0)+p(Hr0).
(7)
Here we demonstrate the effect of changing the prior probabilities of the hypotheses using the self-concept maintenance example. Specifically, we show how the posterior probabilities of the hypotheses and the inclusion BFs change when (a) increasing the prior probability of the winning hypothesis Hf0 from 0.25 to 0.70 and (b) increasing the prior probability of the worst hypothesis Hr1 from 0.25 to 0.70.

The remaining prior probability, 0.30, is distributed evenly across the other three hypotheses (i.e., each of the remaining hypotheses is assigned prior probability 0.10).

Increasing the prior probability of Hf0
Hypotheses posterior probabilities
Table 2 displays the prior probabilities of the hypotheses and the posterior probabilities of the hypotheses for each of the three different prior specifications for m. Although the numbers changed, the ordering of the posterior probabilities is identical to the one obtained when using equal prior probabilities for all four hypotheses: For all prior specifications, the fixed-effect null hypothesis (Hf0) receives most posterior probability, followed by the random-effects null hypothesis (Hr0), the fixed-effect alternative hypothesis (Hf1), and the random-effects alternative hypothesis (Hr1).

Table
Table 2. Prior and Posterior Probabilities of the Four Hypotheses of Interest

Table 2. Prior and Posterior Probabilities of the Four Hypotheses of Interest

View larger version
Model-averaged BF for an overall effect
For the default (two-sided) prior setting, BF10 ≈ 0.077. Consequently, BF01 ≈ 12.987, which indicates strong evidence for the absence of an effect. Recall that equal prior probabilities for all four hypotheses yielded BF01 ≈ 8.696, which indicates moderate evidence for the absence of an effect. For the default (one-sided) prior setting, BF10 ≈ 0.016. Consequently, BF01 ≈ 62.5, which indicates very strong evidence for the absence of an effect. Equal prior probabilities for all four hypotheses yielded BF01 ≈ 47.619, which also indicates very strong evidence for the absence of an effect. For the informed (one-sided) prior setting, BF10 ≈ 0.004. Consequently, BF01 ≈ 250, which indicates extreme evidence for the absence of an effect. Equal prior probabilities for all four hypotheses yielded BF01 ≈ 200, which also indicates extreme evidence for the absence of an effect. In sum, the inclusion BFs based on the different setting of the prior probabilities of the four hypotheses (see Table 2) qualitatively agree with the ones obtained when using equal prior probabilities: There is evidence for the absence of an effect. However, they differ in the degree of evidence for the absence of an effect.

Model-averaged BF for heterogeneity
For the default (two-sided) prior setting, BFrf ≈ 0.119. Consequently, BFfr ≈ 8.403, which indicates moderate evidence for the absence of heterogeneity. Recall that equal prior probabilities for all four hypotheses yielded BFfr ≈ 5.291, which also indicates moderate evidence for the absence of heterogeneity. For the default (one-sided) prior setting, BFrf ≈ 0.111. Consequently, BFfr ≈ 9.009 indicates moderate evidence for the absence of heterogeneity. Equal prior probabilities for all four hypotheses yielded BFfr ≈ 5.263, which also indicates moderate evidence for the absence of heterogeneity. For the informed (one-sided) prior setting, BFrf ≈ 0.107. Consequently, BFfr ≈ 9.346, which indicates moderate evidence for the absence of heterogeneity. Equal prior probabilities for all four hypotheses yielded BFfr ≈ 5.263, which also indicates moderate evidence for the absence of heterogeneity. In sum, the inclusion BFs based on the different setting of the prior probabilities of the four hypotheses (see Table 2) qualitatively agree with the ones obtained when using equal prior probabilities: There is evidence for the absence of heterogeneity. However, they differ in the degree of evidence for the absence of heterogeneity.

Increasing the prior probability of Hr1
Hypotheses posterior probabilities
Table 3 displays the prior probabilities of the hypotheses and the posterior probabilities of the hypotheses for each of the three different prior specifications for m. Although the numbers changed, the ordering of the posterior probabilities is similar to the one obtained when using equal prior probabilities for all four hypotheses: For all prior specifications, the fixed-effect null hypothesis Hf0 receives most posterior probability, followed by the random-effects null hypothesis Hr0. However, now the fixed-effect alternative hypothesis Hf1 receives less posterior probability than the random-effects alternative hypothesis Hr1.

Table
Table 3. Prior and Posterior Probabilities of the Four Hypotheses of Interest

Table 3. Prior and Posterior Probabilities of the Four Hypotheses of Interest

View larger version
Model-averaged BF for an overall effect
For the default (two-sided) prior setting, BF10 ≈ 0.056. Consequently, BF01 ≈ 17.857, which indicates strong evidence for the absence of an effect. Recall that equal prior probabilities for all four hypotheses yielded BF01 ≈ 8.696, which indicates moderate evidence for the absence of an effect. For the default (one-sided) prior setting, BF10 ≈ 0.011. Consequently, BF01 ≈ 90.909, which indicates very strong evidence for the absence of an effect. Equal prior probabilities for all four hypotheses yielded BF01 ≈ 47.619, which also indicates very strong evidence for the absence of an effect. For the informed (one-sided) prior setting, BF10 ≈ 0.003. Consequently, BF01 ≈ 333.333, which indicates extreme evidence for the absence of an effect. Equal prior probabilities for all four hypotheses yielded BF01 ≈ 200, which also indicates extreme evidence for the absence of an effect. In sum, the inclusion BFs based on the different setting of the prior probabilities of the four hypotheses (see Table 3) qualitatively agree with the ones obtained when using equal prior probabilities: There is evidence for the absence of an effect. However, they differ in the degree of evidence for the absence of an effect.

Model-averaged BF for heterogeneity
For the default (two-sided) prior setting, BFrf ≈ 0.076. Consequently, BFfr ≈ 13.158, which indicates strong evidence for the absence of heterogeneity. Recall that equal prior probabilities for all four hypotheses yielded BFfr ≈BFfr≈5.291 5.291, which indicates moderate evidence for the absence of heterogeneity. For the default (one-sided) prior setting, BFrf ≈ 0.054. Consequently, BFfr ≈ 18.519, which indicates strong evidence for the absence of heterogeneity. Equal prior probabilities for all four hypotheses yielded BFfr ≈ 5.263, which indicates moderate evidence for the absence of heterogeneity. For the informed (one-sided) prior setting, BFrf ≈ 0.049. Consequently, BFfr ≈ 20.408, which indicates strong evidence for the absence of heterogeneity. Equal prior probabilities for all four hypotheses yielded BFfr ≈ 5.263, which indicates moderate evidence for the absence of heterogeneity. In sum, the inclusion BFs based on the different setting of the prior probabilities of the four hypotheses (see Table 2) qualitatively agree with the ones obtained when using equal prior probabilities: There is evidence for the absence of heterogeneity. However, they differ in the degree of evidence for the absence of heterogeneity.

Summary
In sum, changing the prior probabilities of the hypotheses—as expected—has an effect on the posterior probabilities of the hypotheses. Furthermore, it also has an effect on the inclusion BFs, that is, it has an effect on the degree of model-averaged evidence. However, in this particular example, using the particular changes to the prior probability that we used, it does not change the qualitative overall conclusions that there is evidence for the absence of an effect and that there is evidence for the absence of heterogeneity. In general, we believe that unless there is strong prior knowledge that suggests to set the prior probabilities differently, it is prudent to set the prior probabilities of all four hypotheses uniformly to 0.25.

Transparency

17/05/2022

Lexical norm data sets and megastudies in psycholinguistics
Similar to behavioral genetics, psycholinguistics has witnessed a number of developments that have made it amenable to the introduction of evaluation benchmarks. The long-standing popularity of lexical decision paradigms has guaranteed a certain degree of methodological consensus, especially in research on lexical processing and semantics, and the use of large data sets has become common practice.

A number of large-scale lexical norms data sets have been developed and publicly shared (especially for English and Dutch) over the last decades. The first significant effort in this direction was the MRC Psycholinguistic database (Coltheart, 1981), which gathered semantic, syntactic, and lexical information for about 98,538 words, albeit many of these norms are now obsolete. More recent megastudies (Kessler et al., 2002; Spieler & Balota, 1997; Treiman et al., 1995; for a comprehensive list, see http://crr.ugent.be/programs-data/megastudy-data-available) have indexed increasingly large portions of the lexicon on wider sets of variables (Brysbaert et al., 2014; Brysbaert & New, 2009; Chateau & Jared, 2003; Kuperman et al., 2012; Tucker et al., 2019; Warriner et al., 2013). Among the most notable examples is the English Lexicon Project data set (Balota et al., 2007), which includes a wide range of psycholinguistic norms and reaction time data gathered from a large participant sample for around 40,000 words and around 40,000 nonwords. More recently, data sets explicitly targeting psychologically and neurologically grounded features (Binder et al., 2016; Lynott et al., 2020) and word association norms (De Deyne et al., 2016, 2019) have been developed and made publicly available.

Development of large-scale data sets has not only speeded up the transition to big data in psycholinguistics but also has contributed to a moving away from significance testing in favor of increased focus on model performance—although mostly still based on goodness-of-fit metrics. Furthermore, the use of large-scale reference data sets has enabled convergence on a set of consensus predictive tasks of common interest and on the operationalization of relevant variables, which paves the way for the introduction of common evaluation benchmarks.

Other potential applications
Beyond these examples, there are many other domains in which existing resources could be adapted into useful benchmarks or in which entirely new data sets could be acquired with this goal in mind. As a general heuristic, we expect that an emphasis on benchmarking will prove most productive in fields that have an applied focus, a robust methodological apparatus (solid constructs and consistent operationalizations), and/or agreement on core empirical problems and target metrics. However, there is no predefined set of criteria defining which fields are likely (or eligible) to develop into prominently prediction-based and benchmark-based fields. Ultimately, successful introduction of common evaluation benchmarks will depend on the community’s willingness to engage in constructive methodological rethinking and on individual efforts to pioneer benchmark development. Even in fields that do not necessarily check the above boxes, thinking about potentially useful benchmarks is possible, and it can draw much needed attention to the practical implications (or lack thereof) of existing research programs. Below, we provide a few examples drawn from diverse areas of psychology that vary considerably in feasibility and ambitiousness:

In educational psychology, prediction of student outcomes from individual features (e.g., demographics, proxies for personality/cognitive profiles) and features of the learning portfolio is an example of a task with immediate practical applicability. High-performing models could, in fact, not only be used for theoretical purposes, that is, to gain a better understanding of how individual profiles and teaching strategies interact in shaping successful learning experiences, but also deployed as tools for education professionals to tailor teaching strategies to target audiences or for individuals to optimize their educational choices (e.g., choice of higher education programs) on the basis of their own profile (for an overview of existing work using machine learning for precision education, see Luan & Tsai, 2021). Efforts to develop similar tools driven by the research communities and conducted in partnership with institutions would be highly beneficial to the public and provide a transparent, ethical, nonprofit framework that could help improve public education systems. A number of useful resources already exist that could be gathered into large data sets to train and evaluate relevant models. A number of international surveys are periodically conducted that target student outcomes at different ages and in a variety of domains and that also gather information about student background, demographics, attitudes, home environment, and school characteristics. Relevant resources include Program for International Student Assessment reports as well as data from the Progress in International Reading Literacy Study, Trends in International Mathematics and Science Study, and Program for International Assessment of Adult Competencies surveys. But there is also considerable opportunity for development of valuable new data sets. Massive open online courses, for example, are potentially rich data sources particularly suited to the purpose of benchmark development both by virtue of their scale and the advantages yielded by their natively digital format. Course metadata (e.g., duration and type of video lectures, degree of interaction between students, number of practical exercises, and amount of group work involved), automated annotations of the teaching material (e.g., quantification of linguistic styles and prosodic traits in videos), student feedback, and background information on learners’ profiles are some examples of features that could be extracted and used to develop models targeting prediction of student outcomes.

Traditionally, personality psychology has almost exclusively focused on profiling personality by using questionnaires and self-reports. Yet personality assessments based on traditional approaches have low predictive validity when tested on real-world outcomes, especially when aggregated traits are used as features rather than individual items (see Mõttus et al., 2017; Revelle et al., 2021; Saucier et al., 2020; Wiernik et al., 2020). Furthermore, self-reports are affected by biases (Kreitchmann et al., 2019; Müller & Moshagen, 2019), and standard questionnaires are extremely time-consuming, often consisting of hundreds of items, which intrinsically limits scalability. The availability of social media data is a potential game changer for data-driven approaches to personality modeling given that personality traits are likely expressed in online behavioral patterns. Building large data sets pairing social media data (e.g., Reddit submissions, https://reddit.com) with known personality scores and/or indices of real-world outcomes (shared, of course, with consent and safe data-protection protocols) could pave the way to the development of innovative (and potentially predictively powerful) approaches to personality modeling. Some attempts in this direction have already been made (e.g., predicting personality from Facebook likes or word use, see R. M. Brown et al., 2018; Park et al., 2015; or from musical preferences, see Nave et al., 2018). But in general, because of the scarce availability of suitable open-access data sets, these approaches have been mostly beyond reach for the academic community and remain a prerogative of corporate research.

A large proportion of clinical research has real-world prediction as its immediate goal and aims at early detection, prevention, monitoring, or treatment of clinical conditions. Extensive efforts are being made in clinical psychology to standardize measures and screening procedures administered to participants at intake of various studies as well as outcome measures administered to participants at fixed follow-up intervals. These developments are highly favorable to the creation of benchmarks (potentially paired with competitive challenges) enabling evaluation of models predicting, for example, changes in measures of global function, attrition, or other clinically useful targets using intake measures (either collapsing across all treatments or separately for specific interventions).

Prediction research in political and social psychology can inform policymaking or even guide intervention aimed at countering the emergence of toxic societal dynamics. A large amount of work has been devoted, for example, to modeling the dynamics of polarization. Models able to predict the emergence of extreme-polarization scenarios or even explicit violent outbursts of social conflict would be a highly valuable tool for policymakers and other agents involved in shaping public discourse. Once again, social media are highly relevant data sources. Data from social networks allow modeling of semantic and pragmatic aspects of public discourse and the dynamics of social interactions (e.g., the formation of echo chambers), both likely linked to levels of social polarization. Pairing these data with explicit indices of social polarization (e.g., periodic surveys) or conflict outbursts (e.g., relevant news events) would provide researchers with valuable resources to conduct predictive work on the matter. In the domain of political psychology, many other interesting challenges could be tackled by harvesting and adapting existing resources. Political prediction markets are another promising example. Prediction markets data could be leveraged not only to inform the understanding of what drives changes in political affiliation and beliefs but also to develop and evaluate models able to forecast the outcome of key political events (Wijesinghe & Rodrigues, 2012).

In social psychology, efforts have been made to develop large data sets that enable fine-grained modeling of social network structures alongside individuals’ psychological traits and behavior. One example is the Copenhagen Networks Study, which involved tracking the location and digital interactions of a large cohort of students over an extended period of time and profiling their personality and cognitive traits (see Sapiezynski et al., 2019). Similar longitudinal data sets could be leveraged to develop growth models predicting, for example, student dropout, performance, or likelihood to seek mental health help.

Common Pitfalls and Concerns
Benchmarks have the potential to revolutionize model evaluation practices in psychology in favor of higher reliability, objectivity, and practical utility. They are not, however, exempt from potential drawbacks. Furthermore, the strong emphasis placed on predictive validity is likely to raise (more or less well founded) concerns in the research community. We now review several caveats, potential pitfalls, and common objections that deserve special attention and provide suggestions on strategies to sidestep them wherever possible.

Generalization is hard
Some metrics are better than others in yielding trustworthy information on model performance. We have already highlighted how metrics that explicitly account for the risk of overfitting and value out-of-sample predictive accuracy tend to be more reliable than commonly used goodness-of-fit metrics. Still, no metric is perfect. The extent to which performance estimates from predictive metrics generalize to unseen data depends on characteristics of the data set, even for robust metrics such as cross-validation criteria. Small and/or nonrepresentative samples often yield unreliable estimates. High predictive accuracy on data generated using a single experimental paradigm, for instance, may be an artifact of methodology-specific biases in the sample (e.g., systematic measurement error), and those accuracy levels may not generalize to samples collected using different methodologies. Metrics such as the generalization criterion (Busemeyer & Wang, 2000) attempt to address this issue by imposing the constraint that validation sets should come from experimental designs different from those used in training. This is, however, only one of the many ways in which samples can fall short of being fully representative of the phenomenon or population of interest. Nonrepresentative participant samples or stimulus sets may also affect generalizability to novel data.

Note that generalization issues cannot always be fully solved through adoption of good practices such as cross-validation and out-of-sample evaluation. Concerns about result replicability and generalizability have also been raised for machine-learning models (Emmery et al., 2019; National Academies of Sciences Engineering and Medicine, 2019; Vijayakumar & Cheung, 2019). As we discuss in the next sections, careful design of large, representative samples; adoption of multibenchmark model evaluation; and dynamic updating of reference benchmarks (as in the ImageNet case) are further measures that could help limit (although not fully eradicate) the risk of generalization flaws in predictive contexts.

Goodhart’s law: when a measure becomes a target
Specific benchmarks can prove extremely useful in certain stages of development of a discipline. But their limitations tend to emerge as soon as they become the primary or only criterion to evaluate model performance. As Goodhart’s law states, “when a measure becomes a target, it ceases to be a good measure.” As the ImageNet versus ObjectNet example discussed earlier illustrates, when optimizing for a specific task becomes the exclusive focus of model engineering efforts, researchers may lose sight of the fundamental goal of generating good predictions on other (e.g., real-world) instances of the task at stake. The development of good benchmarks should consequently always be viewed as an ongoing process in which tasks and metrics are continuously evaluated, reevaluated, and updated.

One way to mitigate Goodhart’s law is to develop multidimensional benchmarks (i.e., batteries of tasks on which models are jointly evaluated). The concept of transfer learning in machine learning captures a similar principle—models trained to optimize performance on one set of tasks are evaluated according to their performance on multiple tasks (potentially including entirely new ones) without updating (or fine-tuning) parameters separately for each task (for an example and an overview of transfer learning in natural language processing, see McCann et al., 2018; Ruder et al., 2019).

Once again, note that these measures cannot protect from all potential side effects involved in placing exclusive emphasis on predictive accuracy. Recent literature has highlighted a number of scientific and ethical shortcomings associated with common machine-learning practices, especially with respect to recent developments in natural language processing (for a comprehensive overview, see Bender et al., 2021; Bender & Koller, 2020). These may not apply to psychology in the short term, but it is important to factor them in when designing benchmarks (e.g., by developing evaluation tools attuned to ethical concerns) and learn critically from the experience of machine learning rather than blindly imitating its mistakes.

Beyond predictive accuracy: validity, parsimony, and the ethical cost
High predictive accuracy is, of course, not the only criterion models should be evaluated on in psychology, and our advocacy of benchmarking is not meant to discourage researchers from taking other criteria into account when designing or evaluating models. Indeed, good predictive performance can sometimes be achieved by models that fail basic validity tests (J. I. Myung & Pitt, 2018).

Criteria such as simplicity and parsimony are also often important considerations in the choice of model. There are several reasons for this. A sophisticated algorithm that performs well but is too complicated for a clinician to effectively apply in a real-world setting may be less useful than a “fast-and-frugal” heuristic that performs more poorly under optimal conditions (Gigerenzer et al., 1999). And in cases in which predictive performance is roughly comparable, simpler models are generally more desirable; fewer parameters reduce the risk of overfitting and the amount of compute required for training and deploying. Conversely, large models can easily become prohibitive in terms of computing resources; they may come with large environmental costs (Strubell et al., 2019), which disproportionately affect populations that are not from Western, educated, industrialized, rich, and democratic societies and pose ethical dilemmas; and their complexity may not be entirely justified in terms of performance gains. Some state-of-the-art transformer language models, for example, have proven to be overparameterized (Clark et al., 2019; Kovaleva et al., 2019): They contain many idle architectural units, and their performance levels are matched by smaller models trained using parsimonious schemes such as model distillation (e.g., Sanh et al., n.d.). It is thus important to incentivize researchers to strive for parsimonious solutions whenever possible—for example, by introducing the custom of quantifying and reporting complexity and resource requirements alongside any mention of predictive performance (several proposals in this direction have been advanced in machine learning; e.g., see Henderson et al., 2020; Rogers, 2019).

At the same time, increased focus on benchmarking could potentially help discourage the somewhat common tendency to dismiss complex models out of hand merely because of their complexity or because of more subjective considerations. Qualitative criteria such as model plausibility, explanatory adequacy, or faithfulness, which reflect the extent to which models are grounded in existing literature or theoretical considerations, are often also advocated as central to model evaluation (I. J. Myung & Pitt, 2002; J. I. Myung & Pitt, 2018). However, the question of what role these criteria should play relative to clever model engineering is an open one. We lean toward a solution in which qualitative constraints should, in general, not play a role in benchmark-based evaluation (e.g., model scores should not depend on a panel of judges’ subjective assessment of theoretical elegance).

This does not mean that considerations such as model complexity, computational efficiency, and so on cannot be explicitly factored into a model’s score on a benchmark, only that the model’s score should be a deterministic function of clearly stated metrics and should not depend on irreproducible components. Our position should also not be taken to imply that subjective considerations are unimportant in model evaluation; indeed, we think that a large element of subjectivity is probably unavoidable in psychology given the complexity and underdetermined nature of most psychological phenomena. But at the very least, researchers should strive to make it clear where objective metrics end and subjective valuations begin.

Interpretability
As we argued above, to achieve maximal reliability and utility, benchmark-based evaluation should rely on metrics that adequately capture the predictive validity (and, therefore, the generalizability) of a model. However, one particularly common objection to a push for greater emphasis on predictive accuracy is that models that yield higher predictive accuracy are intrinsically less interpretable than simpler (e.g., linear) ones and thus fail to contribute to understanding the empirical phenomenon of interest.

It is true that in many cases, the models yielding best performance are the more complex ones because more parameters reduce prediction bias and offer the flexibility needed to capture complex patterns in the data. It is also true that parameters from complex models (e.g., weights in deep-learning models) often lack a straightforward direct interpretation. However, the argument that higher complexity corresponds to lower interpretability—and for that matter, that interpretability is intrinsically desirable—builds on a number of questionable assumptions. Debunking these fallacies is especially important because these arguments are among the reasons why complex models (e.g., deep-learning models) are often “discarded for scientific purposes such as theory building and testing [omit understanding]” (Shmueli, 2010, p. 291).

One of the assumptions behind the interpretability counterargument is that simpler (linear) models are more straightforwardly interpretable than complex (e.g., nonlinear) ones. This statement is far from uncontroversial. Direct interpretability of individual parameter estimates in linear models is, in fact, highly dependent on overall properties of the model (i.e., covariates, degree of collinearity, interactions), and although researchers in the explanatory tradition often tend to forget it, parameter estimates are always conditional on the model itself.

This point is nicely illustrated by recent neuroimaging and psychology studies (Botvinik-Nezer et al., 2020; Silberzahn et al., 2018) in which independent research teams were asked to test a given set of hypotheses on the same data set. Analytical approaches adopted by the teams can vary along a number of parameters (e.g., choice of covariates, statistical tests, coding procedures) and result in quantitatively different feature coefficients and thus, potentially, in qualitatively different interpretations of the effect of such features. This variability in estimates cannot always be explained away by disparities in performance. Even models achieving comparable predictive accuracy can diverge widely in the importance attributed to individual features (Churchill et al., 2014; Schmah et al., 2010) regardless of the similarity between the models’ structure or analytical form.

The flip side of this assumption is the idea that complex models are intrinsically uninterpretable, or, to put it more mildly, that they are much less interpretable than the cognitive processes they are meant to model. This common misconception likely stems from a certain ambiguity in the way interpretability is defined is the literature, but it does not ultimately stand up to scrutiny under any plausible definition (Lipton, 2018). If interpretability is defined in terms of transparency, that is, as the possibility to directly observe the sequence of algorithmic steps through which a decision/output has been produced, human cognitive processes are obviously no more interpretable than complex models (and probably much less so). If, on the other hand, interpretability is defined as the possibility to provide a post hoc, compact, natural language explanation of why a certain output was produced in response to a certain input, then humans and complex artificial models can, in principle, be equally interpretable. Saliency maps (Simonyan et al., 2013), behavioral testing (Ribeiro et al., 2020), probing methods (Bolukbasi et al., 2016; Bordia & Bowman, 2019; Gardner et al., 2020; Kim et al., 2019; Linzen & Baroni, 2020), game-theoretic feature attribution methods (Lundberg & Lee, 2017), and adversarial attacks (Goodfellow, Pouget-Abadie, et al., 2014; Goodfellow, Shlens, & Szegedy, 2014) are among the techniques that can be used to make the behavior of machine-learning models more transparent (for a more comprehensive overview of interpretability research in machine learning, see Molnar, 2020), which can, in turn, inform theoretical accounts of target phenomena.

As these examples show, emphasis on predictive accuracy and, consequently, increased reliance on complex models in psychology need not result in a loss of interpretability. Likewise, it does not imply relinquishing ambitions of scientific understanding: Even highly complex models can be probed and compared in ways that can inform mechanistic explanations of cognitive phenomena. Once again, machine learning offers some useful examples. A few studies have already shown how systematic comparison of benchmark performance of language models (e.g., Talmor et al., 2019) that differ in the presence/absence and type of self-attention mechanism or in the size of input context (long- vs. short-context models) can be used to corroborate theoretical intuitions on key aspects of human language processing (e.g., the role and time scales of predictive processing and context integration in online language understanding). For instance, in a recent study, Schrimpf et al. (2020) compared the performance of different language model architectures in predicting psycholinguistic and neural language comprehension data and showed that whereas bidirectional attention models perform best on traditional natural language processing tasks, unidirectional language models predict fMRI data best. Recent work has more generally advocated for the development of cognitive benchmarks for language models (Artemova et al., 2020; Hollenstein et al., 2019), which would enable comparative assessment of the cognitive plausibility of natural language processing models but also shed light on key characteristics of the cognitive underpinnings of human linguistic behavior.

The possibility to retain control over the heuristics complex models develop to generate predictions makes sure that stronger emphasis on prediction will not alter psychology’s ability to cumulatively build knowledge whose scope goes beyond practical applicability. Predicting and generating knowledge about phenomena are incompatible goals only if the latter is reduced to the impoverished notion of explanation intended as inference about (causal) relationships between variables achieved through direct interpretation of model estimates (which, as we argued above, is a deeply problematic one). There are several alternative and more fruitful pathways to generating knowledge through predictive modeling, some combining computational research practices with insights from experimental design familiar to psychologists. These “mixed” approaches are far from new to psychology and its neighboring fields; similar frameworks have been widely adopted, for example, in agent-based modeling research.

What role for domain expertise?
A final concern associated with an increased emphasis on benchmarking is that achieving good predictive performance on many tasks might turn out to require less traditional domain expertise than many researchers currently suppose. That is, it may turn out that the skills and knowledge bases needed to build good applied models in psychology have relatively little overlap with those emphasized in traditional psychological theorizing. This worry is not entirely without basis; it has been frequently observed that the winners of prediction-focused competitions, even when ostensibly focused on narrowly domain-specific problems, disproportionately tend to be teams of machine-learning experts or data scientists with relatively little, and sometimes no, domain expertise (e.g., AlQuraishi, 2019; Bentzien et al., 2013).

We think that such a prospect—although perhaps uncomfortable in some ways—should, if anything, increase the utility of benchmarks in psychology. If it turned out that achieving good performance on major benchmarks in applied fields such as educational psychology, psychopathology, and I/O psychology primarily requires computational rather than substantive expertise, this would constitute a powerful argument for emphasizing those skills to a greater extent in our training programs. To the degree that we believe that the public’s investment in psychological science is ultimately aimed at improving the human condition, one might even argue that we have a collective moral obligation to subject the field to rigorous challenges of this kind—in much the same way that we have advocated for the objective evaluation of individual models.

Conclusion
Lack of consensus on robust metrics for model evaluation has long hindered progress in psychology. In this article, we argued that introducing benchmarks can help overcome this fundamental issue, and we provided general guidelines and concrete suggestions on how to go about developing metrics that suit the multifaceted profile of the field. Benchmarks inspired by real-world predictive challenges will not only provide psychology with communal evaluation metrics but also motivate the community to redirect efforts toward solving problems with practical implications and high societal relevance.

Address

Kyiv

Website

http://www.dior.com/

Alerts

Be the first to know and let us send you an email when SpaceP posts news and promotions. Your email address will not be used for any other purpose, and you can unsubscribe at any time.

Contact The Business

Send a message to SpaceP:

SpaceP

18/05/2022

17/05/2022

Address

Website

Alerts

Contact The Business

Shortcuts

Share

Category