The Misinformation Susceptibility Test (MIST): A psychometrically validated measure of news veracity discernment

Preparatory steps

Phase 1: Item generation

Fake news

There is a debate in the literature on whether the misinformation items administered in misinformation studies should be actual news items circulating in society, or news items created by experts that are fictional but feature common misinformation techniques. The former approach arguably provides better ecological validity (Pennycook, Binnendyk, et al., 2021), while the latter provides a cleaner and less confounded measure since it is less influenced by memory and identity effects (van der Linden & Roozenbeek, 2020). Considering these two approaches and reflecting on representative stimulus sampling (Dhami et al., 2004), we opted for a novel approach that combines the best of both worlds. We employed the generative pretrained transformer 2 (GPT-2)—a neutral-network-based artificial intelligence developed by OpenAI (Radford et al., 2019)—to generate fake news items (cf., Götz et al., 2022; Hommel et al., 2022). The GPT-2 is one of the most powerful open-source text generation tools currently available for free use by researchers. It was trained on eight million text pages, combines 1.5 billion parameters, and is able to write coherent and credible articles based on just one or a few words of input.Footnote 5 We did this by asking the GPT-2 to generate a list of fake news items inspired by a smaller set of items. This smaller set contained items from any of five different scales that encompass a wide range of misinformation properties: the Belief in Conspiracy Theories Inventory (BCTI; Swami et al., 2010), the Generic Conspiracist Beliefs scale (GCB; Brotherton et al., 2013), specific Conspiracy Beliefs scales (van Prooijen et al., 2015), the Bullshit Receptivity scale (BSR; Pennycook et al., 2015), and the Discrediting-Emotion-Polarization-Impersonation-Conspiracy-Trolling deceptive headlines inventory (DEPICT; Maertens et al., 2021; Roozenbeek & van der Linden, 2019). We set out to generate 100 items of good quality, but as this is a new approach, we opted for the generation of at least 300 items. More specifically, we let GPT-2 generate thousands of fake news headlines, and tossed out any duplicates and clearly irrelevant items (see Supplement S1 for a full overview of all items generated and those that have been removed).

Real news

For the real news items, we decided to include items that met each of the following three selection criteria: (1) the news items are actual news items (i.e., they circulated as real news), (2) the news source is the most factually correct (i.e., accurate), and (3) is the least biased (i.e., nonpartisan or politically centrist). To do this, we used the Media Bias/Fact Check database (MBFC; https://mediabiasfactcheck.com/) to select news sources marked as least biased and scoring very high on factual reporting.Footnote 6 The news sources we chose were Pew Research (https://www.pewresearch.org/), Gallup (https://www.gallup.com/), MapLight (https://maplight.org/), Associated Press (https://www.ap.org/), and World Press Review (http://worldpress.org/). We also diversified the selection by including the non-US outlets Reuters (https://www.reuters.com/), Africa Check (https://africacheck.org/), and JStor Daily (https://daily.jstor.org/). All outlets received the maximum MBFC score at the time of item selection.Footnote 7 A full list of the real news items selected can be found in Supplement S1.

Overall, this item-generation process resulted in an initial pool of 413 items. The full list of items we produced and methods through which each of them was obtained can be found in Supplement S1.

Phase 2: Item condensation

To reduce the number of headlines generated in Phase 1, we followed previous scale development research and practices (Carpenter, 2018; Haynes et al., 1995; Simms, 2008) and established an expert committee with misinformation researchers from four different cultural backgrounds: Canada, Germany, the Netherlands, and the United States. Each expert conducted an independent review and classified each of the 413 items generated in Phase 1 as either fake news or real news. All items with a three-fourths expert consensus and matching the correct answer key (i.e., the source veracity category)—a total of 289 items—were selected for the next phase.Footnote 8 A full list of the expert judgments and inter-rater agreement can be found in Supplement S1.

Phase 3: Quality control

As a final quality control before continuing to the psychometrics study, the two-person item generation committee in combination with an extra third expert—who had not been previously exposed to any of the items—made a final selection of items from Phase 2. Applying a two-thirds expert consensus as cutoff, we selected 100 items (44 fake news, 56 real news) out of the 289 from the previous stage (i.e., we cut 189 items), thus creating a fairly balanced item pool for empirical probing that hosted five times as many items as the final scale that we aimed to construct—in keeping with conservative guidelines (Boateng et al., 2018; Weiner et al., 2012). A full list of the item sets selected per expert and expert agreement can be found in Supplement S1.

Implementation

Participants

In line with widespread recommendations to assess at least 300 respondents during initial scale implementation (Boateng et al., 2018; Clark & Watson, 1995, 2019; Comrey & Lee, 1992; Guadagnoli & Velicer, 1988), we recruited a community sample of 452 US residents (for a comprehensive sample description see Table 1). The study was carried out on Prolific Academic (https://www.prolific.co/), an established crowd-working platform which provides competitive data quality (Palan & Schitter, 2018; Peer et al., 2017). Based on the exclusion criteria laid out in the preregistration, we removed incomplete cases, participants who took either an unreasonably short or long time to complete the study (less than 8 minutes or more than 2 hours), participants who failed an attention check, underage participants, and participants who did not live in the United States, retaining 409 cases for data analysis.Footnote 9 Of these, 225 participants (i.e., 55.01%) participated in the follow-up data collection eight months later (T2).Footnote 10

Participants received a set remuneration of 1.67 GBP (equivalent to US$ 2.32) for participating in the T1 questionnaire and 1.10 GBP (equivalent to US$ 1.53) for T2.

Procedure, measures, transparency, and openness

The preregistrations for T1 and T2 are available on AsPredicted https://aspredicted.org/m7vb3.pdf; https://aspredicted.org/js2jz.pdf; any deviations can be found in Supplement S2). The supplement, raw and clean datasets, and all analysis scripts in R can be found in the OSF repository (https://osf.io/r7phc/).

Participants took part in a preregistered online survey. After providing informed consent, participants had to categorize the 100 news headlines from Phase 3 (i.e., the items that were retained after the previous three phases) in two categories: Fake/Deceptive and Real/Factual.Footnote 11 Participants were told that each headline had only one correct answer. See the preregistration or the Qualtrics files on the OSF repository for the exact survey framing (https://osf.io/r7phc/).

After completing the 100-item categorization task, participants completed the 21 items from the DEPICT inventory (a misleading social media post reliability judgment task; Maertens et al., 2021), a 30-item COVID-19 fact-check task (a classical true/false headline evaluation task; Pennycook, McPhetres, et al., 2020), the Bullshit Receptivity scale (BSR; Pennycook et al., 2015), the Conspiracy Mentality Questionnaire (CMQ; Bruder et al., 2013), the Cognitive Reflection Test (CRT; Frederick, 2005), a COVID-19 compliance index (sample item: “I kept a distance of at least two meters to other people”: 1 – does not apply at all, 4 – applies very much), and a demographics questionnaire (see Table 1 for an overview). Finally, participants were debriefed. Eight months later, the participants were recruited again for a test-retest follow-up survey.Footnote 12 In the follow-up survey, after participants provided informed consent to participate, the final 20-item MIST was administered, the same COVID-19 fact-check task (Pennycook, McPhetres, et al., 2020) and CMQ (Bruder et al., 2013) were repeated, a new COVID-19 compliance index was administered, and finally a full debrief was presented. The complete surveys are available in the OSF repository: https://osf.io/r7phc/.

The full study received institutional review board (IRB) approval from the Psychology Research Ethics Committee of the University of Cambridge (PRE.2019.108).

Analytical strategy 1: Exploratory factor analysis (EFA) and item response theory (IRT)

To extract the final MIST-20 and MIST-8 scales from the pre-filtered MIST-100 item pool, we followed an item selection decision tree, which can be found in Supplement S3. Specifically—after ascertaining the general suitability of the data for such procedures—the following EFA- and IRT-based exclusion criteria were employed: (1) factor loadings below .40 (Clark & Watson, 2019; Ford et al., 1986; Hair et al., 2010; Rosellini & Brown, 2021); (2) cross-loadings above .30 (Boateng et al., 2018; Costello & Osborne, 2005); (3) communalities below .4 (Carpenter, 2018; Fabrigar et al., 1999; Worthington & Whittaker, 2006); (4) Cronbach’s α reliability analysis; (5) differential item functioning (DIF) analysis (Holland & Wainer, 1993; Nguyen et al., 2014; Reise et al., 1993); (6) item information function (IIF) analysis. Finally, we sought to establish initial evidence for construct validity (Cronbach & Meehl, 1955). To do this, we investigated the associations between the MIST scales and the DEPICT deceptive headline recognition task (Maertens et al., 2021) and COVID-19 fact-check (Pennycook et al., 2020; concurrent validity). We further examined additional predictive accuracy of the MIST in accounting for variance in DEPICT and fact-check scores above and beyond the CMQ (Bruder et al., 2013), BSR (Pennycook et al., 2015), and CRT (Frederick, 2005; incremental validity).

Analytical strategy 2: Exploratory graph analysis (EGA)

In this section we explore an alternative method of scale development, based on the new field of exploratory graph analysis (Golino & Epskamp, 2017), rooted in network methods. Network methods in psychology gained momentum with the publication of the mutualism model of intelligence (Van Der Maas et al., 2006) and network perspective on psychopathology (Borsboom, 2008; Borsboom et al., 2011; Cramer et al., 2010), giving rise to a new subfield of quantitative psychology called network psychometrics (Epskamp et al., 2017; Epskamp et al., 2018). Network models are used to estimate the relationship between multiple variables—typically using the Gaussian graphical model (GGM; Lauritzen, 1996), where nodes (e.g., test items) are connected by edges (or links) that indicate the strength of the association between the variables (Epskamp & Fried, 2018), forming a system of mutually reinforcing elements (Christensen et al., 2020b; Cramer, 2012). Network and latent variable models have been shown to be closely related, and can produce model parameters that are consistent with one another (Boker, 2018; Christensen & Golino, 2021c; Epskamp et al., 2017; Golino et al., 2021; Golino & Epskamp, 2017; Marsman et al., 2018). These statistical similarities can be used as a way to explore the dimensionality structure of measurement instruments in a new framework termed exploratory graph analysis (Christensen et al., 2019; Golino & Demetriou, 2017; Golino & Epskamp, 2017; Golino et al., 2020a, 2020b).

In network psychometrics (Christensen et al., 2019; Epskamp et al., 2018; Epskamp et al., 2017; Golino & Demetriou, 2017; Golino & Epskamp, 2017; Golino et al., 2020a, 2020b), networks are typically estimated using the Gaussian graphical model (Lauritzen, 1996) using the EBICglasso approach (Epskamp & Fried, 2018). The EBICglasso approach operates by minimizing a penalized log-likelihood function and selecting the best model fit (i.e., the optimum level of sparsity in a network) using the extended Bayesian information criterion (EBIC; Chen & Chen, 2008). As Golino et al. (2022) argue, the use of weighted network models in psychology opened the doors for network science methods developed in other areas of science to psychological problems such as dimensionality (e.g., factor analysis).

Exploratory graph analysis was originally proposed by Golino and Epskamp et al. (2017), which showed that the GGM model combined with a clustering algorithm for weighted networks (Walktrap; Pons & Latapy, 2005) could accurately recover the number of simulated factors, presenting higher accuracy than traditional factor analytic-based methods. Later, Golino, Shi, et al. (2020b) compared EGA with different types of factor analytic methods (including two types of parallel analysis), finding that EGA achieves the highest overall accuracy (87.91%) in estimating the number of simulated factors, followed by the traditional parallel analysis with principal components of Horn (1965; 83.01%), and parallel analysis using principal axis factoring proposed by Humphreys and Ilgen (1969; 81.88%).

Golino et al. (2022) summarized the advantages of the EGA framework over more traditional methods (Golino, Shi, et al., 2020b): (1) unlike exploratory factor analysis (EFA) methods, EGA does not require a rotation method to interpret the estimated first-order factors (although rotations are rarely discussed in the validation literature, they have significant consequences for validation, e.g., estimation of factor loadings; Sass & Schmitt, 2010); (2) EGA automatically places items into factors without the researcher’s direction, which contrasts with exploratory factor analysis, where researchers must decipher a factor loading matrix (such a placement opens the door for dimension and item stability methods, which is presented next); and (3) the network representation depicts how items relate within and between dimensions.

Over the past couple of years, the EGA framework has expanded into several important areas of psychometrics. Christensen and Christensen and Golino (2021c) developed a new metric termed network loadings computed by standardizing node strength—the sum of the edges a node is connected to—split between dimensions identified by EGA. Christensen and Christensen and Golino (2021c) showed in their simulation study that network loadings are akin to factor loadings, but with different reference values. Network loadings of .15, .25, and .35 are equivalent to low (.40), moderate (.55), and high (.70) network loadings, respectively (Christensen & Golino, 2021c). The development of network loadings opened new lines of research, such as the development of metric invariance using EGA and permutation tests in a network perspective (Jamison et al., 2022), and determining whether data are generated from a factor or network model (Christensen & Golino, 2021b).

Based on the automated item placement of EGA, Christensen and Golino (2021a) developed a bootstrap approach to investigate the stability of items and dimensions estimated by EGA, termed bootstrap exploratory graph analysis, and proposed two new metrics of psychometric quality: item stability and structural consistency. Item stability indicates how often an item replicates in their designated EGA dimension, with values lower than .75 (i.e., that are estimated in their original dimensions in 75% of the bootstrapped samples) indicating problematic (or unstable) items. Structural consistency, by its turn, indicates how often an EGA dimension exactly replicates and can be used to verify configural (or structural) invariance and determine poor-functioning items (Golino et al., 2022). A complementary approach, called unique variable analysis, was developed to identify redundant items and can be used to identify the reason why some items function poorly (Christensen, Garrido, & Golino, 2020a).

The fit of a dimensionality structure estimated using EGA to the data can be verified using an innovative fit index termed total entropy fit index (TEFI; Golino, Moulder, et al., 2020a), developed as an alternative to traditional fit measures used in factor analysis and structural equation modeling (SEM). In a comprehensive simulation study, the TEFI demonstrated higher accuracy in correctly identifying the number of simulated factors than the comparative fit index (CFI), the root mean square error of approximation (RMSEA), and other indices used in SEM (Golino, Moulder, et al., 2020a). The TEFI is based on the Von Neumann entropy (Von Neumann, 1927)—a measure developed to quantify both the amount of disorder in a system and the entanglement between two subsystems (Preskill, 2018). The TEFI index is a relative measure of fit that can be used to compare two or more dimensionality structures. The dimensionality structure with the lowest TEFI value indicates the best fit for the data.

Another recent development within the EGA framework is the hierarchical EGA (hierEGA) technique by Jimenez et al. (2022). In their work, Jimenez et al. (2022) proposed an alternative variation to a popular clustering algorithm called Louvain (Blondel et al., 2008) to detect lower- and higher-order factors in data, and showed that this new technique is more effective than traditional factor analytic techniques to estimate the structure of first- and second-order factors in generalized bifactor structures.

All the EGA-based techniques/metrics mentioned above use the free and open-source R package EGAnet (Golino & Christensen, 2019), which has become one of the main software programs in network psychometrics. In the current paper, version 1.2.4 of the EGAnet package (Golino & Christensen, 2019) was used, and several strategies were implemented. The first strategy aimed at estimating the dimensionality structure of the 100 MIST items. Then, redundant items were identified using unique variable analysis (Christensen et al., 2020a), and for every group or pair of redundant items the one with the higher ratio of main network loadings to cross-loadings was kept in the analysis. The stability of the items and the structural consistency of the dimensions were obtained via bootstrap exploratory graph analysis (Christensen & Golino, 2021a) with 500 iterations (using parametric bootstrapping), and items with stability lower than 75% and network loadings lower than .15 were removed from subsequent steps. Once a subset of stable items with at least low to moderate network loadings were found, a subset of the best items per dimension (i.e., with moderate to high network loadings—with a network loading of at least .23) were identified, and further item stability and structural consistency metrics were computed until all items were highly stable (with item stability greater than 90%). The metric invariance of the final pool of best items per dimension (moderate to high network loadings and high item stability) was investigated using the EGA permutation test developed by Jamison et al. (2022), having as reference groups sex, age (above or below the median birth year), and education (above or below the median level of formal education received). The fit of the EGA-estimated dimensions to the data was computed using the total entropy fit index (Golino, Moulder, et al., 2020a) and compared to the two-factor structure of real and fake news items identified using EFA. CFI and RMSEA computed after fitting a confirmatory factor model to the EGA-estimated dimensions were also obtained, and compared to the CFI and RMSEA of the two-factor structure. Additionally, the Satorra (Satorra, 2000) scaled difference test was implemented to verify the structure with the best fit to the data.

Results

EFA/IRT results

Item selection

Using parallel analysis with the psych package (Revelle, 2021), we aimed to select a parsimonious factor structure, with each factor reflecting eigenvalues above the 95th percentile of corresponding eigenvalues from 500 simulated random datasets.Footnote 13 Parallel analysis (with 500 iterations) suggested a total of six factors, but only five factors (eigenvalues: F 1 = 10.89, F 2 = 7.82, F 3 = 1.89, F 4 = 1.42, F 5 = 1.23, F 6 = 0.98) matched our criteria and were above the 95th percentile of corresponding eigenvalues from the 500 simulated random datasets (eigenvalue 95th percentile = 0.99).Footnote 14 Two factors explained most of the variance, which is in line with our theoretical model of two main factors (fake news detection and real news detection). An EFA using the tetrachoric correlation matrix with unweighted least squares (ULS) estimation without rotation using the EFAtools package (Steiner & Grieder, 2020) indicated that for both the two-factor structure and the five-factor structure, the first two factors were specifically linked to the real news items and the fake news items, respectively, while the other three factors did not show a pattern easy to interpret and in general showed low factor loadings (< .30).Footnote 15 See Supplement S4 for a pattern matrix.

As we set out to create a measurement instrument for two distinct abilities, real news detection and fake news detection, we continued with a two-factor EFA, employing principal axis factoring and varimax rotation using the psych package (Revelle, 2021).Footnote 16 Theoretically we would expect a balancing out of positive and negative correlations between the two factors: positive because of the underlying veracity discernment ability, and negative because of the response biases. In line with this, we chose an orthogonal rotation instead of an oblique rotation to separate out fake news detection and real news detection as cleanly as possible.

Three iterations were needed to remove all items with a factor loading under .40 (43 items were removed). After this pruning, no items showed cross-loadings larger than .30. Communality analysis using the three-parameter logistic model function in the mirt package (Chalmers, 2012) with 50% guessing chance (c = .50) indicated two items with communality lower than .40 after one iteration. These items were removed. No further iterations yielded any additional removals. A final list of the communalities can be found in Supplement S5. Cronbach’s α reliability analysis with the psych package was used to remove all items that had negative effects (∆α > .001) on the overall reliability of the test (Revelle, 2021). No items had to be removed based on this analysis.Footnote 17 Differential item functioning using the mirt package was used to explore whether differences in gender or ideology would alter the functioning of the items (Chalmers, 2012). None of the items showed differential functioning for gender or ideology.

Finally, using the three-parameter logistic model IRT functions in the mirt package (Chalmers, 2012), we selected the 20 best items (10 fake, 10 real) and the 8 best items (4 fake, 4 real), resulting in the MIST-20 and the MIST-8, respectively. These items were selected based on their discrimination and difficulty values, where we aimed to select a diverse set of items that have high discrimination (a ≥ 2.00 for the MIST-20, a ≥ 3.00 for the MIST-8) yet have a wide range of difficulties (b = [−0.50, 0.50], for each ability), while keeping the guessing parameter at 50% chance (c =.50). We also took into account the topics to ensure both that we covered a wide range of news areas and that there was no repetition of content (Flake et al., 2017). A list of the IRT coefficients and plots can be found in Supplement S1 and Supplement S6, respectively. See Fig. 3 for a MIST-20 item trace line plot, and Fig. 4 for a multidimensional plot of the MIST-20 IRT model predictions. The final items that make up the MIST-20 and MIST-8 are shown in Table 2.Footnote 18 An overview of different candidate sets and how they performed, as well as the full analysis scripts and the supplement, can be found in the OSF repository: https://osf.io/r7phc/.

Fig. 3 Item trace lines for MIST-20 items, for the fake news items in Panel A and real news items in Panel B. The items in the legend are ordered according to their difficulty level Full size image

Fig. 4 Multidimensional IRT plot representing the final MIST-20 test Full size image

Table 2 Final items selected for MIST-20 and MIST-8 Full size table

Reliability

Inter-item correlations show good internal consistency for both the MIST-8 (IIC min = .20, IIC max = .27) and the MIST-20 (IIC min = .22, IIC max = .29). Item-total correlations also show good reliability for both the MIST-8 (ITC min = .44, ITC max = .53) and the MIST-20 (ITC min = .31, ITC max = .54).

Looking further into the MIST-20, we analyze the reliability of veracity discernment (V; M = 15.71, SD = 3.35), real news detection (r; M = 7.62, SD = 2.43), and fake news detection (f; M = 8.09, SD = 2.10). In line with the guidelines by Revelle and Condon (2019), we calculate a two-factor McDonald’s ω (McDonald, 1999) as a measure of internal consistency using the psych package (Revelle, 2021), and find good reliability for the general scale and the two facet scales (ω g = 0.79, ω F1 = 0.78, ω F2 = 0.75). Also using the psych package (Revelle, 2021), we calculate the variance decomposition metrics as a measure of stability, finding that F1 explains 14% of the total variance and F2 explains 12% of the total variance. Of all variance explained, 53% comes from F1 (r) and 47% comes from F2 (f), demonstrating a good balance between the two factors.

Finally, test–retest reliability analysis indicates that MIST scores are moderately positively correlated over a period of eight to nine months (r T1,T2 = 0.58).Footnote 19

Validity

To assess initial validity, we examined the associations between the MIST scales and two scales that have been used regularly in previous misinformation research—the COVID-19 fact-check by Pennycook, McPhetres, et al. (2020) and the DEPICT task by Maertens et al. (2021)—expecting high correlations (r > .50; concurrent validity) and additional variance explained as compared to the existing CMQ, BSR, and CRT scales (incremental validity; Clark & Watson, 2019; Meehl, 1978). As can be seen in Table 3, we found that the MIST-8 displays a medium to high correlation with the fact-check (r fact-check,MIST-8 = .49) and DEPICT task (r DEPICT,MIST-8 = .45), while the MIST-20 shows a large positive correlation with both the fact-check (r fact-check,MIST-20 = .58) and the DEPICT task (r DEPICT,MIST-20 = .50). Using a linear model, we found that the explained variance in the fact-check indicates that the MIST-20 can explain 33% (adjusted R2) of variance by itself. The CMQ, BSR, and CRT combined account for 19%. Adding the MIST-20 on top provides an incremental 18% of explained variance (adjusted R2 = 0.37). The MIST-20 is the strongest predictor in the combined model (t(404) = 10.82, p < .001, β = 0.49, 95% CI [0.40, 0.57]). For the DEPICT task we found that the CMQ, BSR, and CRT combined explain 12% of variance in deceptive headline recognition and 26% when the MIST-20 is added (∆R2 = 0.14), while the MIST-20 alone explains 25%. For the DEPICT task we found the MIST-20 to be the only significant predictor in the combined model (t(404) = 8.94, p < .001, β = 0.43, 95% CI [0.34, 0.53]).Footnote 20

Table 3 Incremental validity of MIST-8 and MIST-20 with existing measures Full size table

EGA results

In this section we re-analyze the pool of 100 MIST items using EGA. EGA estimated four dimensions (see Fig. 5), which can be identified as two dimensions of real news headlines and two of fake news headlines. Dimension 1 (red nodes on Fig. 5) is a combination of US and international real news headlines, with items such as MIST 96 (US Hispanic Population Reached New High in 2018, But Growth Has Slowed), MIST 92 (Taiwan Seeks to Join Fight Against Global Warming), and MIST 60 (Hyatt Will Remove Small Bottles from Hotel Bathrooms by 2021). Dimension 2 (blue nodes on Fig. 5) has fake news items about science, such as item MIST 8 (Climate Scientists’ Work Is “Unreliable”, a “Deceptive Method of Communication”), and false statements against people with a liberal world view, such as items MIST 16 (Left-Wingers Are More Likely to Lie to Get a Good Grade) and MIST 20 (New Study: Left-Wingers Are More Likely to Lie to Get a Higher Salary). The third dimension (green nodes on Fig. 5) has real news items related to politically charged topics in the US, such as items MIST 70 (Majority in US Still Want Abortion Legal, with Limits), MIST 74 (Most Americans Say It’s OK for Professional Athletes to Speak out Publicly about Politics), and MIST 94 (United Nations Gets Mostly Positive Marks from People Around the World). Dimension 4 (orange nodes on Fig. 5) has fake news items related to general conspiracy beliefs, such as item MIST 1 (A Small Group of People Control the World Economy by Manipulating the Price of Gold and Oil), and conspiracies related to the government, such as items MIST 31 (The Government Is Actively Destroying Evidence Related to the JFK Assassination) and MIST 32 (The Government Is Conducting a Massive Cover-Up of Their Involvement in 9/11).

Fig. 5 Structure of the 100 MIST items estimated using exploratory graph analysis Full size image

The unique variable analysis technique (Christensen et al., 2020a) identified two redundant items: MIST 43 (UN: New Report Shows Shark Fin Soup as ‘the Most Important Source of Protein’ for World’s Poor) and MIST 17 (New Data Show Shark Fins Are the ‘Most Important Source of Protein’ for the World’s Poor). The ratio of network loadings (main/cross-loadings) for these items (8.47 and 6.9, respectively) suggested that item MIST 43 should be kept in the subsequent analyses. A bootstrap exploratory graph analysis with 500 iterations (parametric bootstrapping) identified four median dimensions (95% CI: 2.11, 5.89) but with very low structural consistency for each dimension (0.09, 0.14, 0.07, and 0.43 for dimensions 1, 2, 3, and 4, respectively). The item stability metric (Christensen & Golino, 2021a) varied from 23% to 98%, with 40% of items presenting inadequate or moderate stability (i.e., lower than 75%, see Fig. 6).

Fig. 6 Item stability metric of the MIST-100 items in Study 1 Full size image

Removing the items with item stability lower than 75% and repeating the parametric bootstrap EGA technique with 500 iterations showed that the stability improved considerably, leading to structural consistency between 0.61 (dimension 2) and 0.96 (dimension 4), and mean item stability of 93%. From the 59 items selected in the steps above, a subset with network loadings equal to or higher than .155 were selected from each dimension estimated via EGA, resulting in 34 items. A parametric bootstrap EGA with 500 iterations followed by item stability analysis was implemented once again, and items with stability lower than 75% were removed, resulting in 32 items.

The final selection of items was implemented using the following strategy. Out of the 32 items selected in the previous steps, only those with relatively high network loadings (≥ .23 or ≥ . 235) were used in the subsequent bootEGA and item stability analysis, which identified 16 highly stable items (see Fig. 7). Exploratory graph analysis identified the same four dimensions described in the first paragraph of this section, but now they presented very high structural consistency ranging from .982 to 1, and very high item stability (ranging from 98 to 100%). The network loadings of the final MIST-16 EGA items are presented in Table 4.

Fig. 7 Final structure of the MIST-16 EGA items (left) and their stability indices (right) estimated using parametric bootstrap EGA with 500 iterations Full size image

Table 4 Network loadings per item and dimension estimated via EGA. Network loadings of .15, .25, and .35 are equivalent to low (.40), moderate (.55), and high (.70) network loadings, respectively (Christensen & Golino, 2021c) Full size table

A metric invariance analysis for EGA using permutation tests (Jamison et al., 2022) was conducted using sex, mean age, and mean education as grouping variables. None of the items exhibited a significant (p < .05) difference in network loadings across the tested groups, suggesting that the 16 items selected using the EGA framework work similarly irrespective of sex, age, and education (see Supplement S19 for an overview).

The fit of the four-dimensional structure estimated via EGA was compared to the fit of the two-factor structure of real and fake news items using the total entropy fit index (Golino, Moulder, et al., 2020a), and two traditional factor-analytic fit measures (CFI and RMSEA). To compute the traditional factor-analytic fit indices, a confirmatory factor analysis was implemented using the WLSMV estimator for each structure (see Fig. 8). Table 5 shows that the EGA four-factor structure presented the lowest TEFI and RMSEA, and the highest CFI, suggesting that the four-factor first-order dimensions estimated via EGA fit the data better than the theoretical two-factor structure, although the two-factor structure also has an acceptable fit. The Satorra (Satorra, 2000; Table 6) scaled difference test also showed that the EGA four-factor structure is preferable to the theoretical two first-order factor structure.

Fig. 8 Plot of the confirmatory factor model estimated using the EGA four-factor structure (left) and the theoretical two-factor structure (right) Full size image

Table 5 Comparison of fit indices of the EGA four-factor model and the theoretical two-factor model Full size table

Table 6 The Satorra scaled difference test comparing the EGA four-factor structure to the theoretical two first-order factor structure Full size table

Two different traditions were used to select a subset of items, one relying on traditional techniques (EFA and IRT) and another relying on modern network psychometric methods (EGA). Looking at the item stability and structural consistency of the dimensions between the two, we found that the MIST-16 EGA items are stable and consistent, indicating that the four dimensions estimated using exploratory graph analysis are robust and likely to be identified in independent samples. The 20 items selected using EFA/IRT were less robust in terms of stability (see Supplement S19: EGA Metric Invariance Tests). The low stability for some of the items of MIST-20 might indicate that there are a higher or lower number of dimensions underlying the data. The parametric bootstrap EGA analysis (with 500 iterations) of the MIST-20 items indicates that two dimensions are estimated in 21.0% of the bootstrapped samples, three dimensions in 68.2%, and four dimensions in 10.0%. The item stability of the most common structure (three dimensions, see Supplement S20) reveals that the items are relatively stable, but still not as stable as the MIST-16 EGA items. A comparison of the three-dimensional structure estimated using EGA in the MIST-20 items with the theoretical two-factor structure (see Table 7) shows that the three-factor solution performs slightly better, since it presents lower TEFI and RMSEA, and higher CFI.

Table 7 Fit of the three- and two-dimensional structures of the MIST-20 items Full size table

Discussion

In Study 1, we generated 413 news items using GPT-2 automated item generation for fake news, and trusted sources for real news. Through two independent expert committees, we reduced the item pool to 100 items (44 fake and 56 real). We then combined item response theory with factor analysis to reduce the item set to the 20 best items for the MIST-20 and the 8 best items for the MIST-8. We found that the final items demonstrate good reliability. In an initial test of validity, we found strong concurrent validity for both the MIST-8 and the MIST-20 as evidenced by their strong associations with the COVID-19 fact-check (a headline evaluation task) and the DEPICT deceptive headline recognition task (a social media post reliability judgment task). Moreover, we found that both the MIST-20 and the MIST-8 outperformed the combined model of the CMQ, BSR, and CRT, when explaining variance in fact-check and DEPICT scores, evidencing incremental validity. This study provides the first indication that both the MIST-20 and MIST-8 are psychometrically sound, and can explain and test misinformation susceptibility above and beyond the existing scales. Finally, we also presented an alternative approach to item selection, namely one based on EGA that uses network psychometrics to identify the best partition of the multidimensional space, combined with a bootstrap analysis of item and dimensional stability (structural consistency), to identify a set of highly stable items with moderate or high network loadings, leading to the selection of 16 items measuring four dimensions of misinformation susceptibility.