Introduction

Bullying at work denotes a class of “situations where an employee repeatedly and over a prolonged time period is exposed to harassing behavior from one or more colleagues (including subordinates and leaders) and where the targeted person is unable to defend him-/herself against this systematic mistreatment” (Nielsen & Einarsen, 2018, p. 73). Scholars are largely in agreement about two characteristics of bullying at work, namely it “is repeated and systematic negative social behavior [and it] endures over a longer period of time” (Notelaers & Van der Heijden, 2021, p. 370). Thus, workplace bullyingFootnote 1 is a widely studied phenomenon in organizations (Nielsen & Einarsen, 2018) for two main reasons: First, it has been widely attested that workplace bullying has severe negative consequences for both work organization and the target’s mental health (Balducci et al., 2020); second, it is not a rare phenomenon, since a meta-analysis of prevalence rates has shown that approximately 15% of workers (on a global basis) are exposed to some level of bullying at work (Hershcovis et al., 2015; Nielsen et al., 2010).

That said, an easy-to-use self-report instrument for assessing workplace bullying is of pivotal importance for researchers and practitioners in order to (a) study this phenomenon in large national surveys, (b) enhance the investigation of personal x environmental antecedents, health and organizational consequences, and moderators/mediators affecting those relationships (Balducci et al., 2020, 2021; Howard et al., 2020; Reknes et al., 2019; Van den Brande et al., 2016), and (c) help organizations and their managers through initial assessments and (eventually) subsequent monitoring of the course of interventions.

Across the years, many self-report instruments have been developed for assessing various typologies of work-related abusive behaviors (see Table A1 in Appendix, for an overview). Among them, the 22-item Negative Acts Questionnaire – Revised (NAQ-R; Einarsen et al., 2009) proved to be a valid and reliable instrument for assessing workplace bullying, and its recent 9-item short version (SNAQ; Notelaers & Einarsen, 2008; Notelaers et al., 2019) has the advantage of being easily utilized by researchers and practitioners. In the Italian context, a first validation of the scale was provided by Balducci et al. (2010); here we aimed at expanding their work by adding an investigation of its structural validity and classification performance. Moreover, given that the SNAQ scores may be affected by a certain amount of measurement error (as the majority of self-report instruments), consistent with the recent attention of researchers to control for measurement errors in predictive models (e.g., in machine learning analytic strategies; Jacobucci & Grimm, 2020), classification performance was investigated through a novel procedure that uses Structural Equation Modeling for building ROC curves (the gold standard instrument for evaluating the classification performance of a continuous variable; Fawcett, 2006; Kuhn & Johnson, 2020).

In what follows, we outlined the rationale and procedures adopted in this contribution to investigate the SNAQ’s structural validity and classification performance.

Investigating the SNAQ’s structural validity

Structural validity regards the internal characteristics of a measurement instrument (Bleidorn & Hopwood, 2019; Loevinger, 1957). We investigated the structural validity of the Italian SNAQ by means of a SEM approach for categorical data, given that each item is rated by participants by means of a 5-point frequency scale. More in detail, we investigated measurement invariance and reliability through a series of categorical confirmatory factor analyses (CFAs). Indeed, measurement invariance is among the most important latent variable approaches to prove that measurement properties of latent variables are stable and thus, to ensure that the meaning of the construct being assessed is consistent across groups or time (Little, 2013; Millsap, 2012; Newsom, 2015; van de Schoot et al., 2012, 2015). We chose to investigate gender invariance because it is one of the most studied types of invariance in organizational measures (Vandenberg & Lance, 2000), in that it allows verifying the “generalizability of scale properties across gender” (Vandenberg & Lance, 2000, p. 22), which is a fundamental component of the internal robustness of the test (see Ock et al., 2020). In conclusion, given that gender differences in workplace bullying have been widely studied (Rosander et al., 2020) and we are interested in estimating classification performance across gender, the Italian SNAQ must show an acceptable degree of invariance before being used for these purposes.

Investigating the SNAQ’s classification performance

Self-labeling bullying as a classifier

Classification performance is the ability of a continuous variable or instrument to correctly predict a qualitative (usually dichotomous) variable (James et al., 2021). In this case, the aim is to attest how well the SNAQ can classify people who felt they were bullied vs. those who did not feel they were bullied at their workplace. In doing so, we adopted a self-labeling approach (Notelaers & Van der Heijden, 2021), which consists of providing a thorough definition of workplace bullying along with a list of characteristics, and then ask workers if they perceive to have been bullied or not. This approach has been consistently used in workplace bullying research since the seminal introduction by Einarsen and Skogstad (1996), but with some refinements. For example, a meta-analysis by Nielsen et al. (2010) showed the importance of including a definition of workplace bullying when using a self-labeling approach, given that the prevalence of reported workplace bullying may be overestimated if no definition is provided. Indeed, they found a workplace bullying rate of 11.3% for studies using a “self-labeling with definition” approach, 14.8% for studies using a “behavioral experiences” approach (e.g., self-report instruments like the SNAQ), and a rate of 18.1% for studies using a “self-labeling without definition” approach. Hence, they concluded that the best method to investigate workplace bullying is the use of both “self-labeling with definition” and “behavioral experiences” approach. This conclusion is echoed in more recent contributions (Nielsen et al., 2020; Notelaers & Van der Heijden, 2021), in which the authors pointed out the importance of estimating “whether respondents were exposed to systematic harassment at their workplace with the behavioural experience method [as well as to] ask them whether they perceive themselves as victimized (by this exposure) with the self-labelling method” (Nielsen et al., 2020, p. 256).

From an empirical perspective, there are a number of studies that have used the self-labeling approach (see Notelaers & Van der Heijden, 2021, for a review). For example, Bonde et al. (2016), in a longitudinal study on 7502 public service and private sector Danish employees, showed that self-labeled bullying is negatively associated with a number of health-related variables (e.g., self-rated health, sleep quality and mood symptoms) and that self-labeled bullying persisted even after 4 years for 20–40% of the sample. Again, Rosander and Blomberg (2019) found a good degree of overlap between self-labeled victimization and the NAQ-R cut-off score provided by Notelaers and Einarsen (2013), but with some discrepancies.

That said, the above findings attest that self-labeling bullying may be a valid classifier for SNAQ validation purposes, given that it showed good external validity (Blonde et al., 2016), good stability across time (Blonde et al., 2016), and it overlaps sufficiently with behavioral experience methods (Rosander & Blomberg, 2019).

Building ROC curves with a SEM approach

The classification performance of the Italian SNAQ was investigated through a two-step approach. In the first step, we used the previous final (i.e., best fitting) CFA model for predicting a dichotomous variable of self-labeled bullying, which was used as a classifier according to the aforementioned reasons. This SEM was used for gathering the predicted values (in terms of factor scores) of the effect of SNAQ (predictor) on self-labeled bullying (outcome). In the second step, predicted values (factor scores) were extracted and then used for analyzing the classification performance of the SNAQ through a widely used machine learning tool, namely the ROC curve. As Calì and Longobardi (2015, p. 395) put it: “a ROC curve is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. By considering all possible values of the cut-off c, the ROC curve can be constructed as a plot of sensitivity (TPR) versus 1 − specificity (FPR)”, where TPR stands for True Positive Rate, and FPR stands for False Positive Rate. Hence, following Calì and Longobardi’s (2015) notation, a ROC curve is defined as

$$\mathrm{ROC}(\cdot)\:=\:\left\{\mathrm{FPR}(c),\mathrm{TPR}(c),c\:\in\:(-\infty,\:+\infty)\right\},$$

that may also be written as

$$\mathrm{ROC}(\cdot)=\left\{(t,\;\mathrm{ROC}(t)),\;t\:\in\:(0,1)\right\},$$

where the ROC function maps t to TPR(c), and c is the cut-off corresponding to FPR(c) = t (Calì & Longobardi, 2015, p. 395). Then, the most informative index that establishes the classification performance of a continuous variable is the Area Under Curve (AUC) of the ROC(t) function, that ranges from 0 to 1 and can be interpreted as a summary (in terms of percentage) of “the discriminatory accuracy of a test” (Gonçalves et al., 2014, p. 5; see also Fawcett, 2006). More formally, AUC is defined by

$$\mathrm{AUC}={\int }_{0}^{1}\mathrm{ROC}\left(t\right)dt$$

However, prediction values for performing a ROC curve are usually gathered from analyses using “observed” predictors, such as logistic regression and random forest (Lantz, 2019; Zumel & Mount, 2020), that do not control for their degree of measurement error (Jacobucci & Grimm, 2020). In fact, Jacobucci and Grimm (2020) contended that the consequences of measurement error are widely known in traditional psychological science (e.g., Bollen, 1989; Cole & Preacher, 2014; McNeish & Wolf, 2020), while in machine learning or predictive models those consequences have not been sufficiently addressed (see also Brandmaier & Jacobucci, in press). In organizational psychology, it is rare to deal with perfectly reliable tools, and the SNAQ is surely an instrument with a certain degree of measurement error (Notelaers & Einarsen, 2008; Notelaers et al., 2019). Hence, in this study we show a method for attesting the classification performance of an instrument through a ROC curve estimated by a model (i.e., a SEM) that takes into account the measurement error of the predictor variable (i.e., SNAQ factor scores) as well as its structural validity (i.e., gender invariance). This strategy allows for an AUC value that is free from the effect of measurement error, consistent with Jacobucci and Grimm’s (2020).

The present study

To summarize, the present study has three aims. First, to attest whether the Italian SNAQ is a reliable, gender-invariant, and valid instrument in classifying those who self-labeled themselves as bullied and those who did not. Second, to provide a data analytic strategy for building ROC curves that takes into account the measurement quality (i.e., degree of measurement error and gender invariance parameters) of the predictor variable, thus maximizing predictive values, that in turn allow to better estimate AUC. Third, to compare the results obtained with our approach (i.e., ROC curves estimated with best fitting SEM obtained from structural validity routine) with those obtained with more classical approaches for estimating ROC curve (that use observed variables), namely logistic regression and random forest.

Method

Procedure

In the first half of 2018, the Provincial Antimobbing CommitteeFootnote 2 of an Italian province asked the local University to develop a survey project on the state of workplace bullying in the province. In order to do this, during 2019, the first and third author (University of Trento) collaborated with the local branch of the national statistical center, which is a public institution officially appointed to collect data, in full compliance with national regulations to protect the privacy and anonymity of respondents (indeed, the procedures adopted have been validated by a public guarantee body, namely their research ethics committee). In more detail, a two-phase sampling strategy was carried out, selecting 23 municipalities that were chosen starting from a subdivision into three groups on the basis of the distribution of employees: (a) so-called “representative municipalities” – or the 5 most populous municipalities – with a number of employees exceeding 4500; (b) municipalities with a number of employees between 1000 and 4500; (c) municipalities with a number of employees less than 1000. Subsequently, the statistical offices of the sampled municipalities were asked to extract 4 random samples (one “base” sample and three “back-up” samples) consisting of 1060 families each. Each family consisted of at least one worker that received a mail, delivered at home, which included (a) a brief description of the project; (b) an invitation to a link on a website to complete an anonymous self-report battery; (c) specific requirements to fill in the battery: Be at least 18 years old; be currently employed (not being on leave, on layoffs, or retired); be an employee or assimilated worker (not a freelancer, not a self-employed, not an entrepreneur); and (d) a username and password (distinctive for each household) that would allow starting the battery and ensuring that only people from selected households could complete the battery (the survey started in July 2019). After the survey ended (November 2019), the local branch of the national statistical center removed the username and password from the database and returned the anonymized dataset to the researchers.

Participants

Participants included 357 workers (i.e., ~ 8% of the total 4240 invitations and ~ 33% of the 1060 target sample size) from various sectors (see Table A2 in Appendix) that took part in the above-mentioned provincial survey. Males were 177 (49.6%) and females 180 (50.4%). In regards to age, 66 were aged 18–34 (18.5%), 200 were aged 35–54 (56%), and 91 were aged 55-more (25.5%). Tenure ranged from 1 to 42 years (M = 14.90, SD = 11.22). Regarding education, 29 (8.1%) had no certificate or had an elementary/junior-high school degree, 192 (53.8%) had a vocational/high school degree, 136 (38.1%) had a bachelor or higher degree. Concerning the type of contract, 52 (14.6%) had a fixed-term contract and 305 (85.4%) had a permanent contract. Regarding sector, 230 (64.4%) worked in the private sector, while 127 (35.6%) worked in the public sector. As for the job position, 72 (20.2%) were laborers, 213 (59.6%) were clerks, 55 (15.4) were quadri (a specific Italian job classification, ranked between clerks and managers), and 17 (4.8%) were managers.

Measures

Short-Negative Acts Questionnaire (SNAQ)

We used the Italian version of the SNAQ (Notelaers & Einarsen, 2008; Notelaers et al., 2019) validated in Italy by Balducci et al. (2010). The SNAQ measures the extent to which a worker feels they have been exposed to workplace bullying behaviors. Items describe common negative acts at work, such as work-related bullying (item example, “Someone withholding information which affects your performance”), personal bullying (item example “Having insulting or offensive remarks made about your person, attitudes or your private life”), and social isolation/exclusion (item example “Being ignored or excluded”). The SNAQ consists of 9 items anchored with a 5-point frequency scale that expresses frequency of exposure (anchors: 1 = never, 2 = once or occasionally, 3 = monthly, 4 = weekly, 5 = daily). The introduction to the scale reads “Please, indicate how often you have experienced each of the following behaviors in your workplace, over the past six months”.

Self-Labeled Bullying

In order to specifically measure whether each participant perceived to be bullied at their workplace or not, we followed an approach used by the Workplace Bullying Institute (2017). First, we provided a specific definition and description of workplace bullying, reported as follows:

Workplace bullying is defined as a set of aggressive behaviors, carried out by one or more people (called "Bully"), which are protracted over time and directed towards a worker (victim), whose purpose is to affect one or more of the following aspects

- the victim’s reputation

- his/her ability to communicate

- his/her social relations

- his/her employment quality (e.g., no or only insignificant tasks are assigned)

- his/her health and well-being

Then, according to the provided definition of workplace bullying, participants were asked to respond Yes (coded as 1) or No (coded as 0) to the following question: “I have experienced it now or have experienced it in the last year”.

Measures used for external validity

In order to preliminarly attest the external validity of the SNAQ, we used six measures of constructs belonging to three areas that proved to be significantly correlated with workplace bullying: Organizational climate, namely organizational social climate (Balducci et al., 2010) and unethical climate (Bulutlar & Öz, 2009); job attitudes, namely burnout (Conway et al., 2021b), job satisfaction (Rodríguez-Muñoz et al., 2009), and intention to quit (Bentley et al., 2021); and personality, namely self-efficacy beliefs (Fida et al., 2018).

Organizational social climate (α = .77) was measured by means of 4 items gathered from Vartia (1996) and Balducci et al. (2010, p. 152). The Likert scale ranged from 1 (disagree) to 5 (agree). Three items were reversed in order to measure positive organizational climate (e.g., “There is envy at my workplace”, “Workmates compete with each other”; a straight item example is, “In my workplace, there is a good level of cooperation and agreement among workers”).

Unethical climate (α = .88) was measured by means of 4 items gathered from the self-interest scale subscale of the ethical climate questionnaire by Cullen et al. (1993; Italian version: Pagliaro et al., 2018). The Likert scale ranged from 1 (disagree) to 5 (agree). An item example includes, “There is no room for one’s own personal morals or ethics in my workplace”.

Burnout (α = .83) was measured by means of 4 items gathered from the Copenhagen Burnout Inventory by Kristensen et al. (2005; Italian validation: Avanzi et al., 2013). The Likert scale ranged from 1 (never) to 5 (always). An item example is, “I feel worn out at the end of the work day”.

Job satisfaction was measured with one item gathered from the Satisfaction with Job – General scale by Dubinsky and Hartley (1986; Barbaranelli et al., 2010), that reads, “Generally speaking, I am very satisfied of my job”, and has a Likert scale ranging from 1 (disagree) to 5 (agree).

Intention to quit (α = .90) was measured by means of 3 items gathered from Landau and Hammer (1986, p. 404). The Likert scale ranged from 1 (disagree) to 5 (agree). An item example includes, “I am seriously thinking about quitting my job”.

Self-efficacy beliefs in managing negative emotions at work (α = .81) were measured by means of the six-item scale used by Alessandri et al. (2018). This scale assesses the degree to which a worker perceives how well he/she is able to manage negative emotions due to adverse or stressful work-related events. The Likert scale ranged from 1 (not well at all) to 5 (very well). The introductory question was: “At work, how well can you:” and an item example is, “Get over irritation quickly after the experience of a failure?”.

Data analyses

All analyses were conducted using the statistical open source software R (Version 4.0.2; R Core Team, 2020) and proceeded through the following steps.

Descriptive statistics and external validity

After examining the correlations between SNAQ items – which we expected to be of at least medium size (≥ .30) given that the tool has been used as a unidimensional measure – we investigated the external validity of the SNAQ by inspecting a series of zero-order correlations between the SNAQ and six external constructs mentioned in the “Measures used for External Validity” paragraph. First, we built a composite score for each of the seven constructs (computing the mean of all items) and we expected that the SNAQ correlates (a) significantly and positively with unethical climate, burnout, intention to quit; and (b) significantly and negatively with organizational social climate, job satisfaction, self-efficacy beliefs in managing negative emotions at work. Data wrangling and descriptive statistics were conducted with the R package dplyr (which is part of the tidyverse ecosystem; Wickham et al., 2019; Wickham & Grolemund, 2017) and apaTables (Stanley, 2018).

Structural validity

Structural validity was investigated through inspection of reliability indices and gender invariance. Both analyses were conducted by taking into account the categorical nature of SNAQ items and through the R packages lavaan (Rosseel, 2012) and semTools (Jorgensen et al., 2020).

First, structural validity was investigated through the analyses of two reliability coefficients, namely the nonlinear SEM reliability coefficient (ρNL; Yang & Green, 2015) and the Average Variance Extracted (AVE; Fornell & Larcker, 1981). These indices were computed for the overall scale and for males/females separately. Both indices were computed by means of the reliability() function of the semTools package, so as to take into account the ordinal nature of the scale (Flora, 2020). Indeed, these reliability indices were computed on the basis of a confirmatory factor model (composed by one latent variable and the nine SNAQ items) estimated in lavaan using the WLSMV estimator, which is the most used estimator for structural equation models with categorical indicators (Finney & DiStefano, 2013). Good reliability indices were > .70 for ρNL and > .50 for AVE.

Second, structural validity was investigated by probing the tenability of measurement invariance across gender. In particular, we conducted measurement invariance for categorical variables following Wu and Estabrook’s (2016) approach, through the routine provided by Svetina et al. (2020). Thus, gender invariance is attested if two conditions are met: (a) constraining thresholdsFootnote 3 to be equal across gender (Equal Thresholds Model) does not worsen the fit of the Baseline Model and, (b) constraining loadings to be equal across gender (Equal Thresholds and Loadings Model) does not worsen the fit of the Equal Thresholds Model. The non-significant worsening of a model is attested if (a) the scaled chi-square difference test is not significant, (b) the CFI-scaled does not decrease more than 0.01, and (c) the RMSEA-scaled does not increase more than 0.01 (see Svetina et al., 2020). In order to ensure the robustness of findings (see Svetina et al., 2020) – and to provide scripts with a more familiar software for latent variable users – we also estimated measurement invariance models with Mplus Version 8.4 (Muthén & Muthén, 1998–2017).

Classification performance

Classification performance of the SNAQ was investigated by performing a receiver operating characteristic (ROC) curve using the dichotomous variable self-labeled bullying as the outcome and the SNAQ as the independent variable. In order to take into account the measurement error of the SNAQ (Jacobucci & Grimm, 2020), the ROC curve was estimated using predicted probabilities (in regards of being vs. not being bullied) gathered from a structural equation model (SEM). In particular, we estimated two nested SEMs: In the first model (Conditional and Unconstrained model), we specified the same measurement structure as an Equal Thresholds and Loadings Model and added a path from the latent variable SNAQ to the dichotomous variable self-labeled bullying; in the second model (Conditional and Constrained model), we constrained that path to be equal across gender and tested whether the model significantly worsened. Then, we used the best fitting model to extract factor scores to be used in estimating the area under the ROC curve. Moreover, in order to compare models’ performance and consistency (see Lantz, 2019), ROC curves were estimated using predicted probabilities gathered from two of the most popular predictive models, namely, a logistic regression model and a random forest model.

We evaluated the validity of the SNAQ as a classifier of workplace bullying by inspecting the rate of the area under the ROC curve (for the above three models and for males and females, separately), according to the following criteria outlined by Lantz (2019, p. 333): Outstanding classifier = 0.9 to 1.0; excellent/good classifier = 0.8 to 0.9; acceptable/fair classifier = 0.7 to 0.8; poor classifier = 0.6 to 0.7; no discrimination = 0.5 to 0.6; classifier with no predictive value < 0.50.

In order to improve the visualization of results, we also provide double density plots for each model (Zumel & Mount, 2020). The above new analyses were conducted with the R packages pROC (for ROC curve; Robin et al., 2011), randomForest (for random forest model; Liaw & Wiener, 2002), and WVPlots (for double density plots; Mount & Zumel, 2020).

Comparison of ROC curves

In order to verify whether our latent variable approach for the estimation of ROC curves outperforms traditional observed-variable approaches, we compared the ROC curves obtained with the SEM approach with those obtained by logistic regression and random forest models by means of the roc.test()function of the pROC package (Robin et al., 2011), using bootstrap as the specified method (resampling = 2000). This method estimates a statistics D where \(D= \frac{AUC1-AUC2}{s}\), s is the standard deviation of the bootstrap differences, and AUC1 and AUC2 the AUC of the two (original) ROC curves. The D statistic is z-distributed and if the resulting p-value is significant (we fixed our alpha-level to .05, thus if p < .05), then there is evidence for the significant difference between AUC1 and AUC2.

Results

Descriptive statistics and external validity

In Table 1 we reported descriptive statistics and correlation matricesFootnote 4 (for the whole sample and separated for gender, given that we subsequently ran a multiple-group analysis) for the SNAQ items. As can be seen, all correlations were significant at p < .01 and ranged from .28 (item 6 with item 3) to .67 (item 4 with item 2) for the whole sample; from .30 (item 6 with item 1, item 6 with item 3, item 7 with item 6) to .73 (item 4 with item 2) for the male sample; and from .18 (item 9 with item 7) to .69 (item 7 with item 3) for the female sample. Overall, the size of the correlations ranged from medium to high (as expected for a one-dimension variable), with only 7 zero-order correlations below |.30| (which is a medium effect size, according to Cohen, 1992), 2 for the whole sample and 5 for the female sample. Interestingly, 5 (out of 7) regard Item 6 (“Repeated reminders of your errors or mistakes”). This finding might attest that the content of this item is related less to the others, but at the same time there are no reasons to eliminate it, given that correlations are in any case positive and significant. In Table 2, we reported the external validity analysis along with descriptive statistics for each composite score. All zero-order correlations were in the expected direction and significant, with the exception of the correlation with self-efficacy beliefs in managing negative emotions at work in the female sample. This finding might suggest that organizational climate and job attitudes are more related to workplace bullying than a personal-oriented variable like self-efficacy (Balducci et al., 2021), while such a low effect size as far as self-efficacy was concerned was not expected (Fida et al., 2018). Nonetheless, overall, the external validity of the scale was supported.

Table 1 Descriptive statistics and zero-order correlations for SNAQ items
Table 2 Zero-order correlations for SNAQ composite score (external validity analyses) and descriptive statistics for each composite score

Structural validity

Before running structural validity analyses, we replaced the value of ‘5’ with ‘4’ in all items, given that we noticed an imbalance of cells across gender (i.e., two zero-count cells in the female sample: cell ‘5’ for item 8, cell ‘4’ for item 9). This “collapsing” procedure is suggested when a dataset includes few (or zero) responses in extreme categories (DiStefano et al., 2021). Hence, subsequent models have 3 thresholds instead of 4.

The first step of structural validity was the examination of the above-mentioned reliability indices. First, we ran two SEMs for the measurement model of the scale: A model for computing reliability indices for the whole sample and a multiple group SEM for computing reliability indices for male and female samples. The models fit the data well, with the exception of scaled (or robust) RMSEA only.Footnote 5 For the whole sample, the unscaled (or standard) indices of fit were: WLSMV-based χ2(27) = 51.884, p = .003; CFI = 0.995; TLI = 0.994; RMSEA = 0.051; while the scaled (robust) indices of fit were: WLSMV-based χ2(27) = 93.554, p < .001; CFI = 0.976; TLI = 0.968; RMSEA = 0.083. For the multiple group model, the unscaled (or standard) indices of fit were: WLSMV-based χ2(54) = 87.526, p = .003; CFI = 0.995; TLI = 0.993; RMSEA = 0.059; while the scaled (robust) indices of fit were: WLSMV-based χ2(54) = 150.014, p < .001; CFI = 0.970; TLI = 0.960; RMSEA = 0.100. Hence, from these models we extracted reliability indices through the reliability()function of the semTools package: For the whole sample, ρNL = .90 and AVE = .59; for the male sample, ρNL = .91 and AVE = .61; for the female sample, ρNL = .90 and AVE = .58. All indices were above the recommended thresholds; thus, we can conclude that the SNAQ has a good degree of reliability.

The second step of structural validity analysis was the examination of measurement invariance across gender. Results of measurement invariance for categorical variables (Svetina et al., 2020; Wu & Estabrook, 2016) were reported in Table 3. A model comparison demonstrated that both thresholds and loadings are invariant across gender, without worsening the model (all delta chi-square tests were not significant, and both CFI and RMSEA did not worsen more than 0.01). Parameter estimates from Equal Thresholds and Loadings Model are reported in Table 4.

Table 3 Measurement invariance results
Table 4 Parameters of interest from the ‘Equal Thresholds and Loadings Model’ estimated in lavaan

Classification performance

In the Appendix 1, we reported descriptive statistics (Appendix Table A3) and double density plots (Appendix Fig. A1) for SNAQ composite scores as a function of gender and self-labeled bullying.

The first step of classification performance analysis was the estimation and comparison of the Conditional and Unconstrained model and the Conditional and Constrained model. A representation of both models is reported in Fig. 1, in which the only difference was that parameter β21 was free in the first model and constrained to be equal across gender in the second. Both models fit the data well [Conditional and Unconstrained model: WLSMV-based χ2(87) = 183.628, p < .001; CFI = 0.973; TLI = 0.972; RMSEA = 0.079. Conditional and Constrained model: WLSMV-based χ2(88) = 201.340, p = .003; CFI = 0.968; TLI = 0.967; RMSEA = 0.085], but constraining the path β21 to be equal across groups significantly worsened the fit of the model (Scaled-Δχ2 = 7.6138, Δdf = 7, p = .006). Thus, we gathered predictive estimates from the Conditional and Unconstrained model, in which β21 was not constrained to be equal across gender models (for the male model: β21 = 0.774, se = 0.078, z = 9.890, p < .001; R2 = 0.600; for the female model: β21 = 1.134, se = 0.138, z = 8.228, p < .001; R2 = 0.826). After performing a logistic regression model and a random forest model, we computed the areas under the curve obtained through estimates from the three models. In Fig. 2, we reported ROC curve and percentage of area under the curve for each model and separated for gender. Following the aforementioned criteria (Lantz, 2019), the classification performance of the SNAQ was outstanding for the SEM model, excellent (male) and outstanding (female) for the logistic regression model, and excellent for the random forest model.

Fig. 1
figure 1

Specified structural equation model. Note. Asterisks indicate latent continuous variables assumed to underlie the observed categorical indicators. Given that we collapsed category 5 within category 4 (due to cells imbalance across gender), each item has 3 thresholds. Residual variance of each y* is fixed to be zero. Paths for mean-level structure are reported in grey for sake of clarity

Fig. 2
figure 2

ROC curves estimated from three different models. Note. The line along the diagonal represents the hypothetical performance of a random classifier (AUC = 50%). SEM (WLSMV) = Structural Equation Model estimated with WLSMV estimator; AUC = Area Under the Curve

Comparison of ROC curves

For the male sample, the AUC computed on the basis of the ROC curve built using SEM (90.3%) outperformed the AUC computed by logistic regression (87.3%, D = 1.9773, p = .048), but was not significantly better than the one computed on the basis of random forest (87.6%, D = 1.2438, p = .214). For the female sample, instead, the AUC computed on the basis of the ROC curve built using SEM (96.6%) outperformed both that of logistic regression (94.5%, D = 2.6164, p = .009) and random forest (84.4%, D = 2.7108, p = .007). In sum, 3 out of 4 comparisons attested that AUC computed with our approach outperformed those obtained by observed-variable approaches.

Discussion

This contribution had a three-fold aim, that is, to further confirm the good psychometric properties of the Italian version of the SNAQ, to show a method for computing AUC values from ROC curves that are not biased from measurement errors (unlike more traditional approaches), and to compare the AUC obtained after controlling for measurement error with the AUC computed by means of more traditional (observed-variable) approaches.

Our findings showed that the Italian SNAQ holds optimal reliability values and gender-invariant parameters, thus supporting the structural validity of the instrument. In particular, gender-invariance analysis showed the consistency of latent variable parameters across gender. This implies that this scale is able to (a) assess the same construct without being affected by potential gender-related differences (Ock et al., 2020; Vandenberg & Lance, 2000) and (b) detect parameter differences across gender since “the assumption that observed scores on a scale accurately reflect respondents’ standings on a measured construct” (Ock et al., 2020, p. 657) has been supported. Given that no previous study has investigated SNAQ gender invariance, this finding is an important contribution to the literature on the psychometric properties of the SNAQ. Following, ROC curves showed that the classification performance of the Italian SNAQ was outstanding (according to Lantz, 2019, p. 333) for both male (90.3%) and female (96.6%) samples; hence, the SNAQ is a good and brief instrument for classifying those who self-labeled themselves as bullied and those who did not. In conclusion, the Italian SNAQ showed very satisfactory psychometric properties that ensures its validity.

Another important point is that the definition we used for the self-labeled measure included a clear intent to harm (that is, the bully does the listed examples on purpose).Footnote 6 There is disagreement with regards to the inclusion of intent to harm among the features of bullying at work, in particular in the European tradition (see Einarsen et al., 2020, for an in-depth discussion). However, as Einarsen et al. (2020) put it, “whereas intent may be a controversial feature of bullying definitions, there is little doubt that perception of intent is important as to whether an individual decides to label their experience as bullying or not” (Einarsen et al., 2020, p. 12–13). Our classification performance analysis supports the above reasoning, given that the SNAQ appears to be an optimal classifier of a self-labeled measure that includes “perceived” intent to harm, and the prevalence rate we found is similar to those commonly reported in the literature (i.e., from 6 to 11%; Zapf et al., 2020).

In this contribution, we also provided a data analytic strategy for investigating the classification performance of a continuous variable that may have a certain degree of measurement error (Jacobucci & Grimm, 2020). Our results are consistent with Jacobucci and Grimm (2020), in that the comparison of ROC curves highlighted the importance of taking into account measurement error when calculating AUC values, in order to evaluate the classification performance of an imperfectly reliable instrument (in our case, the SNAQ). Indeed, in all cases, the AUCs computed from SEM were higher than those computed with observed-variable techniques, and in three out of four cases, this difference was significant. Furthermore, we should also point out that our data analytic approach allows verifying whether the path linking a predictor variable to its classifier (β21 in Fig. 1) can be fixed to be equal or not across groups, and accordingly, to decide whether to build an overall ROC curve or separate ROC curves. Indeed, if the likelihood ratio test attests that the aforementioned path cannot be constrained to equality across groups, then ROC curves should be estimated separately, because this means that the classification performance of the instrument may vary across groups. Given that this information requires performing a preliminary multiple-group analysis along with a likelihood ratio test analysis, it would not be possible using a more traditional observed-variable approach. This is another advantage of using SEM when predictive values are of interest (Jacobucci & Grimm, 2020).

Related to the above point, Appendix Fig. A1 (see also Appendix Table A3) appeared to explain the gender difference in the parameter β21: It seems that the SNAQ distribution for self-labeling bullied is different between males and females, in fact males reported a higher SNAQ mean but a wider distribution, as also evidenced by a standard deviation approximately double that found in the female sample. Accordingly, results from the ROC curves showed that the Italian SNAQ is more reliable in classifying self-labeled bullying among females than males. In fact, as Appendix Fig. A1 shows, females who responded yes to the self-labeled bullying single-item were mostly concentrated in the right part of the SNAQ scores distribution, while this was not the case for men. This finding is consistent with Rosander et al. (2020), that found (a) how gender moderates workplace bullying across different measurement methods (behavioral experience vs. self-labeling) and (b) that men are more reluctant to admit being the victim of bullying. These findings again reinforce the need to previously attest gender measurement invariance of behavioral experience methods when studying workplace bullying.

Limitations and future research

The present study has several limitations that should be acknowledged. First of all, data are self-report and cross-sectional. While we believe that the expressed purposes of the paper match with the design – and thus justify the use of a cross-sectional and self-report dataset, with retrospective assessment of work-related bullying experiences (see Spector, 2019) – we also acknowledge that the use of longitudinal data and/or objective measures of self-labeled bullying may significantly enhance analysis on the validity of the SNAQ.

Second, in predictive analytic models (i.e., machine learning, data mining, big data approaches), models are both trained and tested, in order to prevent overfitting and maximize predictive accuracy. In this study, we only trained data for two reasons: First, the low sample size and the low frequencies of self-labeled bullied (see Appendix Table A3) made it difficult to create k-folds and cross-validate results; second, to our knowledge, there are no contributions or guidelines on procedures for test-and-train datasets when using Structural Equation Models (e.g., for CFAs design – such as categorical CFA – and measurement invariance routine). Hence, we decided to use classical approaches for estimating SEM, but we hope that future contributions may provide specific guidelines on training/testing data and using k-fold cross validation in such scenarios (e.g., Brandmaier & Jacobucci, in press).

Third, given the presence of some zero-cell counts, we collapsed categories before testing measurement invariance. While this procedure is advised by some scholars (DiStefano et al., 2021), it is also criticized by others (Rutkowski et al., 2019); again, we hope that future studies will resolve this matter.

Fourth, ROC relies upon “a gold standard” (Streiner & Cairney, 2007). In the Introduction, we reported important reasons for testing the classification performance of the SNAQ by means of the self-labeling method. However, as Notelaers and Van der Heijden (2021) put it, while in medical science the gold standard is in most of the cases objective in nature (e.g., having or not having a disease) in psychology (in particular, in organizational psychology), the standard is less strong and subjective to researcher’s decision. Therefore, “in workplace bullying and harassment research, an objective standard is often not available, as also in this scholarly field the empirical evidence is less strong” (Notelaers & Van der Heijden, 2021, p. 404). Thus, in future research alternative outcomes could be used to attest the classification performance of the Italian SNAQ. For example, Notelaers and Einarsen (2013) relied on diagnostic criterion that identified the need for psychiatric treatment to prevent depression by means of the Hopkins Checklist (Derogatis et al., 1974). While future studies could adopt a similar “gold standard” (as well as a multi-item self-labeling measure of workplace bullying; see Conway et al., 2021a), our contribution adds to the literature a way of taking into account measurement error in SNAQ (and in other instruments with a non-negligible degree of measurement error) when building ROC curves.

Fifth, given that we used a categorical latent variable approach, it was not possible to extract a specific cut-off, as in Notelaers and Einarsen (2013). However, this approach is useful for the computation of an overall classification performance for validity purposes (as in this case).

Sixth, as a constraint on the generalizability of results (Simons et al., 2017), our findings may be generalized to a specific Italian province, while a country-level generalization could be done only when data are collected on a representative workforce of the whole country (e.g., Notelaers & Einarsen, 2013).

Seventh, we used a six-month time frame for the SNAQ (according to Balducci et al., 2010), while a one-year time frame for self-labeled bullying due to constraints on the purposes of the whole project. However, given the high stability of the self-labeling victimization (attested even after four years; see Rosander & Blomberg, 2019), results should not be significantly affected by this slight difference. Yet, in future research it would be preferable to use the same time frame for both methods.

Finally, we ran a series of comparisons between ROC curve estimated with a latent variable approach and those estimated with an observed variable approach. Albeit our findings are promising (three out of four AUCs are significantly higher), we recognize that in order to attest differences between these methods a large-scale simulation is needed.

Conclusion

In conclusion, our study demonstrated the validity of the Italian version of the SNAQ, hence providing further support to the utility of this instrument in assessing (with only 9 items) workplace bullying. Moreover, we provided new insights for the organizational research methodology literature by providing a novel analytic strategy for building ROC curves through a Structural Equation Modeling approach, that performed significantly better than classical observed-variable approach in the analyses of the classification performance of an instrument that is not perfectly reliable – as it is often the case with self-report instruments commonly used in organizational psychology and management. Finally, we hope that our contribution may be a step forward in the ongoing integration between predictive modeling approaches and latent variable models (e.g., Brandmaier & Jacobucci, in press; Jacobucci & Grimm, 2020), in particular with regard to the classification performance of measures with a non-negligible degree of measurement error.