The Hierarchical Rater Thresholds Model for Multiple Raters and Multiple Items


 In educational measurement, various methods have been proposed to infer student proficiency from the ratings of multiple items (e.g., essays) by multiple raters. However, suitable models quickly become numerically demanding or even unfeasible as separate latent variables are needed to account for local dependencies between the ratings of the same response. Therefore, in the present paper we derive a flexible approach based on Thurstone’s law of categorical judgment. The advantage of this approach is that it can be fit using weighted least squares estimation which is computationally less demanding as compared to most of the previous approaches in the case of an increasing number of latent variables. In addition, the new approach can be applied using existing latent variable modeling software. We illustrate the model on a real dataset from the Trends in International Mathematics and Science Study (TIMMSS) comprising ratings of 10 items by 4 raters for 150 subjects. In addition, we compare the new model to existing models including the facet model, the hierarchical rater model, and the hierarchical rater latent class model.


Introduction
In the field of educational measurement inferences about students' latent proficiencies underlying educational tests are commonly based on item response theory (IRT) modeling tools. In standard cases, the educational test is purported to measures a single proficiency (unidimensionality) without residual associations between the students' observed scores on the different items (local independence). In educational practice however, items may be clustered because they measure multiple latent variables (e.g., in worded arithmetic problems) or because they concern different testlets. In addition, students may be clustered within classrooms and schools. Neglecting these properties of the testing situation violates the assumptions of the IRT model which can result in biased or inefficient inferences concerning the students' proficiency (Sireci, Wainer & Thissen, 1991;Wainer, 1995;Wainer & Thissen, 1996). Therefore, more sophisticated models need to be considered including multidimensional IRT models (Béguin & Glas, 2001;Reckase, 2009), models for testlets (Bradlow, Wainer, & Wang, 1999;Cai, 2010a), and multilevel or hierarchical IRT models (Adams, Wilson, & Wu, 1997;Mislevy & Bock, 1989;Fox & Glas, 2001).
A specific testing situation in education measurement concerns the case in which the scoring of the student's performance is not objectively possible using an answer key. For instance, in an essay assignment, the grade that a student obtains may depend on the rater that graded the essay. A straightforward and standard solution is to use multiple raters for the same essay. However, the data will violate both the assumption of unidimensionality and local independence, as the items and the students are clustered within the raters. The challenge is to account for these violations by modeling differences among raters in their rater characteristics. The so-called facet model (Linacre, 1989) has previously being used to address this challenge (e.g., Engelhard, 1994Engelhard, , 1996Wilson & Wang, 1995) and can be considered a common method to model students' ratings. The facet model extends the standard IRT modeling approach by adding a fixed effect for the 'rater severity' to account for differences in rater characteristics. Rater severity refers to the tendency of a rater to assign lower scores to a given response as compared to other raters. That is, the rater severity parameter can be seen as the rater counterpart of the item difficulty parameter.
However, although the rater severity parameter accounts for differences across raters, it does not account for the dependency between ratings of the same responses. That is, all rater-item combinations are treated as locally independent. Neglecting the fact that only a subset of these data are locally independent indicators for the students' proficiency has been shown to overestimate the reliability of the students' assessment (see Mariano, 2002, andHoskens, 2001).
As will be discussed below, effort has been devoted to develop suitable models that take both the common rater and the common item effects into account. Although valuable, these models quickly become numerically demanding if the number of items and number of subjects increases. Therefore, in this paper, we propose a new approach that takes the different dependencies in the data into account in a similar way as the existing models, but which can be estimated in a numerically less demanding way. The outline of this paper is as follows: Below we review the current modeling approach and discuss the numerical challenges that motivated the development of our model. Next, as our modeling approach builds upon these models, we start by formally presenting the hierarchical rater model by Patz et al. (2002), the hierarchical rater latent class model by DeCarlo et al. (2011), and the generalized rater model by Wang et al., (2014). Next, we derive our new model, the hierarchical rater thresholds model, and we present an application to real data from the Trends in International Mathematics and Science Study (TIMMSS) comprising ratings of 10 items by 4 raters for 150 subjects. We end with a general discussion.

Current Models
As mentioned above, effort has been devoted to develop suitable models for local dependencies due to multiple ratings of the same responses. Among these approaches are the IRT model for multiple raters (Verhelst & Verstralen, 2001), the rater bundle model (Wilson & Hoskens, 2001), the hierarchical rater model (Casabianca, Junker, & Patz, 2016;Patz, Junker, Johnson, & Mariano, 2002), the hierarchical rater latent class model (DeCarlo, Kim, & Johnson, 2011), and the generalized rater model (Wang, Su, & Qiu, 2014). The models are similar in that they account for the common rater effect by rater parameters similarly as in the facet model. The models differ however in the way that they account for the common item effect, see Table  1. The rater bundle model is the most different from the other approaches. In this model, the common item effects are introduced by means of a fixed linear interaction term between pairs of raters of the same item. The interaction effect accounts for the degree to which two raters agree more than what would have been expected on the basis of the student proficiency and the raters' severities.
In the other models, the common item effect is modeled by introducing a separate latent variable for each item, denoted the ideal ratings. The most important difference between the models in Table 1 is that the hierarchical rater model and the hierarchical rater latent class model treat these ideal ratings as categorical variables (see also DeCarlo, 2005) while in the IRT model for multiple raters and the generalized raters model, the ideal ratings are continuous latent variables. Besides these differences, the models differ in the number of rater parameters. That is, all models, except the facet model and the IRT model for multiple raters, include an additional rater parameter besides the rater severity. The so-called rater variability parameter models the variability in the ratings by the raters. That is, some raters can be more variable in their ratings as compared to other raters. In these models, the rater severity and the rater variability can be either rater specific, or item and rater specific. In the hierarchical rater latent class model by DeCarlo et al. (2011) the rater parameters are specified in a somewhat different way with rater severities being item, rater, and response category depended, and the rater variability being rater and item depended. Note that DeCarlo et al. (2011) use respectively 'response criteria' and 'detection parameters' to refer to these parameters respectively. We will explicate this difference later in the formal model presentation.

Estimation Challenges
The models in Table 1 differ in the estimation procedures used to apply the models. That is, while the hierarchical rater model and the generalized rater model rely on Markov Chain Monte Carlo (MCMC) estimation, the hierarchical rater latent class model, the rater bundle model, and the facet model rely mainly on Marginal Maximum Likelihood (MML). Although certainly feasible, both MCMC and MML have their computational challenges with respect to models with a large number of latent variables. As all models but the rater bundle and the facet model include a separate latent variable for each item, the number of latent variables increases rapidly for an increasing number of items. In such high dimensional models, MCMC will become slow and MML may even become unfeasible (see e.g., Wood et al., 2002; although more efficient MML algorithms have been proposed by e.g., Cai, 2010b). In the case of the rater bundle model, as noted by Patz et al. and Wilson and Hoskens (2001), MML estimation becomes computationally infeasible for an increasing number of items and/or raters due to the rapid increase in the number of interaction terms that are needed. In addition, for the IRT model for multiple raters no estimation algorithm has been developed yet as the maximization of the marginal likelihood function is not easy as discussed in Verhelst and Verstralen (2001).

The Hierarchical Rater Thresholds Model
As discussed above, both MML and MCMC have their challenges when it comes to datasets with an increasing number of items. Therefore, if the number of items is large, a practical model is desirable that both accounts for the variability due to multiple raters and multiple items, but which can also be estimated in large datasets in a numerically efficient way. Therefore, in this paper, we present such an approach. We refer to this approach as 'The Hierarchical Rater Thresholds Model' as the model is a hierarchical model and accounts for differences across raters by explicitly separating the raters' effects from the item effects on the threshold parameters of the categorical observed variables.
The hierarchical rater threshold model draws from the hierarchical rater model by Patz et al (2002), the hierarchical rater latent class model by DeCarlo et al., (2011), and the generalized rater model by Wang et al. (2014). In addition, it utilizes Thurstone's model for categorical judgement (1928; see Torgerson, 1958) in the rater part of the model, and it includes the IRT model for multiple raters by Verhelst and Verstralen (2002) as a special case. The most important features of the hierarchical rater thresholds model are: 1) The model includes less latent variables as compared to the generalized rater model while still accounting for both variability due to difference in items and differences in raters; 2) The model is formulated in such a way that the parameters can be estimated using Weighted Least Squares (WLS; Muthén, 1984), an estimation procedure that is more robust to an increasing numbers of latent variables in terms of convergence and computation time as compared to MCMC and MML; and 3) Similarly to the generalized rater model by Wang et al. (2014), the proposed model is a generalized linear latent variable model (Bartholomew, Knott, & Moustaki, 2011;Moustaki & Knott, 2000;Skrondal, & Rabe-Hesketh, 2004) which provides possibilities to fit the model in standard and flexible software packages like Mplus (Muthén & Muthén, 2007), Lisrel (Jöreskog, & Sörbom , 2001), Amos (Arbuckle, 1997), Mx (Neale, Boker, Xie, & Maes, 2006), SAS (SAS Institute, 2011), and OpenMX (Boker et al., 2010).

The Hierarchical Rater Model
Level 1. On the first level of the hierarchical rater model the observed ordinal rating, X pir , of student p = 1, …, N on item i = 1,…,n by rater r = 1,…,R, are linked to the ordinal ideal ratings, ξ pi , by a normal model, that is, the probability that X pir equals c is given by where ∝ denotes 'proportionally to' and φ ir and ψ ir are respectively the item specific rater severity and rater scale parameters (or rater variability parameter ψ 2 r ).1 A lenient 1 The original parameterization uses ψ r to represent the rater specific variance, however, to enable a comparison to the hierarchical rater threshold model later, we use ψ r as scale parameter with variability ψ 2 r . rater who scores on average more towards the upper end of the scale -as compared to the other raters -will be characterized by a large positive φ ir . A severe rater who scores on average more in the lower end of the scale -as compared to the other raters -will be characterized by a large negative φ ir . In addition, a rater who shows more variability in the scores as compared to the other raters will be characterized by a larger ψ ir . Note that these rater effects may be item dependent, that is, a rater can be more lenient or variable on one item and less lenient or variable on another item. Level 2. At the next level, the ordinal ideal ratings, ξ pi , are linked to the continuous latent student proficiency variable, q p , using a (generalized) partial credit model, that is, the probability that ξ pi equals k is given by is then given by is then given by where a i is an item discrimination parameter, b i is the general difficulty of item i (some items are harder than other items) and γ jk is the difficulty of category k with respect to this general difficulty level. At the top level of the model, a normal distribution function for q p is specified. The hierarchical rater model above is attractive as it accounts for both the variability in the data due to common raters and variability due to common items. In addition, parameters φ ir and ψ ir enable to quantify rater characteristics which might be helpful to assess rater reliability and to judge individual ratings by a given rater.

The Hierarchical Rater Latent Class Model
Level 1. The hierarchical rater latent class model is a latent class version of the model above. That is, at level 1, the observed data X pir are connected to the categorical ideal ratings ξ pi by using a graded response model for categorical latent variables: is then given by Note that although DeCarlo et al. (2011) explicitly leave open the link function, here we focus on the logit link. In the model, λ ir is a rater discrimination parameter (referred to as 'detection parameter' by DeCarlo et al., 2011) which can be item specific, and τ irc are item, rater, and response category specific threshold parameters (referred to as 'relative criteria' by DeCarlo et al., 2011). Note that the model is general, containing many parameters, but both λ ir and τ irc can be constraint to only contain rater, item, or category effects. Level 2. At level 2 of the hierarchical rater latent class model, a generalized partial credit model is specified similar as in Eq. 2 above, but with category specific difficulty parameters, β ik , instead of a separate overall difficulty β i and a category parameter γ ik , that is:

The Generalized Rater Model
Both hierarchical rater models above are based on two categorizations, one on the rater level and one on the observed variable level. This complicates model estimation as each category on each level of the model is associated with separate parameters. An important feature of the generalized rater model by Wang et al. (2014) is that it assumes the ideal scores, ξ pi , to be continuous variables with a normal distribution. Due to this assumption, the model becomes numerically less complex as there are less parameters involved at the latent level (i.e., in the model by Wang, the distribution of latent variable ξ pi can be characterized by a mean and variance parameter, while in the hierarchical rater model separate parameters for all K -1 latent categories need to be estimated). Besides these continuous latent ideal score variables, the model includes a separate normally distributed latent variable ω pr for each rater with its the mean equal to the iteminvariant rater severity, μ ω =φ r , and the variance equal to the item-invariant rater specific variability, σ 2 ω =ψ 2 r . The resulting model is then given by where all other parameters have similar interpretations as in the above. Note that the model does not include the hierarchical structure that is characteristic of the hierarchical rater model because there are no higherorder random person effects that are nested in lower-order random person effects. However, the model does account for the local dependencies in the data with ξ pi accounting for the dependencies due to the raters rating the same items, and ω pr accounting for the dependencies due to the items being rated by the same raters.

The Hierarchical Rater Thresholds Model
The model we propose in this paper, the hierarchical rater thresholds model, has a hierarchical structure similar to the hierarchical rater model by Patz et al. (2002) and DeCarlo et al. (2011). In addition, the rater thresholds model assumes continuous ideal scores ξ pi similarly as in the generalized rater model by Wang et al. (2014). However, here, we will treat the observed ratings as if they arose from the categorization of the ideal ratings at increasing thresholds. This idea is motivated by Thurstone's model for categorical judgement (1928; see Torgerson, 1958). The resulting models are all special cases in the item factor analysis framework (Wirth & Edwards, 2007) and can therefore be applied using WLS estimation. See Figure  1 for a graphical representation of the hierarchical rater threshold model. Below we describe the model.

Level 1: Thurstone's model for categorical judgement
Thurstone's model for categorical judgement postulates that categorical judgements in X pir arise by categorization of an underlying normally distributed continuum at increasing thresholds. Here, we use this idea by assuming that the ideal scores ξ pi represent the continuous score that item i by student p should have obtained in the case that an unambiguous and objective continuous scoring rule was available. These continuous scores ξ pi are then internally categorized by the raters at rater and item specific thresholds. Thus, the model postulates that observed ratings of the same item by the same student may differ depending on the rater as raters differ in the thresholds they use to categorize the continuous ideal scores ξ pi , that is where τ irc denotes the c-th threshold of rater r on item i with τ i0r = -∞ and τ iCr = ∞. Note that the thresholds do not depend on the student p as we assume, as explicitly discussed in Verhest and Verstralen (2001), that the raters solely base their assessment of item i by student p on the quality of the response and not on other indications for that specific student proficiency (as might for instance happen when the raters are also the students' teachers such that they have an a priori expectation about the students proficiency). The thresholds in Eq. 5 do not yet include separate item and rater parameters. That is, irrespective of the rater, on some items it may be harder to obtain a high rating because the item is more difficult. In addition, irrespective of the item, it may be harder to obtain a higher rating for some raters because these raters are less lenient. Therefore, we separate item and rater effects on the thresholds by using where β ic is the difficulty for the c-th category of item i, and φ ir and ψ ir are the item specific rater severity and rater scale parameter (or rater variability ψ 2 r ) similarly to the hierarchical rater model in Eq. 1. For reasons of identification, φ ir and ψ ir should be fixed to 0 and 1 for all items for an arbitrarily chosen rater to provide a reference point. Note that φ ir and ψ ir are identified as they account for item and rater specific departures from the main item effect in β ic . Parameters φ ir and ψ ir can be tested to be iteminvariant, that is, φ ir =φ r and ψ ir = ψ r . However, these items cannot be specified as rater-invariant as = − will result in an unidentified model as parameters φ i and ψ i can be absorbed in β ic . Therefore, in the model specification in Eq. 6, φ ir and ψ ir can truly be seen as rater parameters and not as item parameters. In addition, β ic can truly be seen as item parameters and as not rater parameters.
The way we apply Thurstone's model here is different from the so-called underlying variable approach that is applied in item factor analysis and item response theory (Wirth & Edwards, 2007;Takane & De Leeuw, 1987;Samejima, 1969). Specifically, in the underlying variable approach, a separate variable is assumed to underlie each X pir while, here, we assume a common underlying variable for all r (i.e., ξ pi ). Differences in X pir across raters arise due to the differences in the thresholds and not due to differences in the underlying variable.
As the ideal scores are assumed to be normal, applying the categorization scheme in Eq. 5 results in the following model for the probability of a score in category c or higher is then given by with the thresholds τ irc given by Eq. 6 and with λ ir being an item specific rater discrimination parameter. Note that in this model, the thresholds (and thus φ ir and ψ ir ) are on a probit scale. The item specific discrimination parameter can be identified by fixing λ ir = 1 for all items (i) of an arbitrary rater (τ), or by fixing σ 2 ξi = 1 for all items (i). Parameter λ ir can be constraint to only reflect a rater effect (λ ir = λ r ), an item effect (λ ir = λ i ), or a common effect (λ ir = λ) which is equivalent to omitting λ ir from Eq. 7 (as the common effect λ can be absorbed in σ 2 ξi ). All these extension can be combined with item specific or item invariant rater parameters for τ irc in Eq. 6. We illustrate this in the real data application. A final note is that the model above is an equivalent to a graded response model (Samejima, 1969) with linear constraints on the item category parameter. In addition, for C = 2, φ ir = φ r , and ψ ir = 1, the model simplifies to the model by Verhelst and Verstralen (2001).

Level 2: A linear latent variable model
Now that the ideal ratings, ξ pi in Eq. 5 are defined and linked to the observed data, they can be submitted to an appropriate measurement model. Here we use the linear common factor model as proposed by Mellenbergh (1994) for continuous item scores, that is, where α i is the item discrimination parameter and ε pi is the first-order item specific residual with variance σ 2 εi . Note that the regression does not include an intercept as this parameter is not identified in single group applications. To finalize the model specification, at the top of the model we assume -similarly as in the hierarchical model -that the student proficiency parameter is (standard) normally distributed. For reasons of identification, VAR(θ p ) should be fixed to 1 or α i should be fixed to 1 for an arbitrary item.

Estimation of the Hierarchical Rater Thresholds Model
The full hierarchical rater thresholds model is given by Eq. 6, Eq. 7, and Eq. 8 with a normal distribution for q p . It is a hierarchical model with first-order factors, ξ pi , second-order factor q p and constraints on the item category parameters. As all relations in this full model are generalized linear, the model is a member of the generalized linear latent variable modeling framework (Bartholomew, Knott, & Moustaki, 2011;Moustaki & Knott, 2000;Skrondal, & Rabe-Hesketh, 2004). An appealing property of the hierarchical rater thresholds model being a generalized linear latent variable model is that we can estimate the parameters by WLS estimation (see Muthén, 1984). As discussed above, WLS is a suitable and numerically less demanding estimation procedure in the case of an increasing number of latent variables. As in typical models for raters, the number of random effects increases rapidly if the number of items increases. In WLS, the following fitting function is minimized over σ, the vector of model implied polychoric means and correlations (which are a function of the free model parameters in Eq. 6, Eq. 7, and Eq. 8): where s is a vector of the estimated polychoric means and correlations of the observed data and W is the covariance matrix of these estimates, V. Matrix W can become large for an increasing number of observed variables. As W has to be inverted during estimation, the true benefits of WLS are obtained if W is taken to be the diagonal of V (see e.g., Li, 2016;Flora, & Curran, 2004). In that case, the standard errors of the model parameters and the χ 2 goodness-of-fit statistic need to be corrected for the loss of information due to the neglect of the off-diagonal elements of V during estimation (see Muthén, du Toit, & Spisic, 1997).
As in WLS, expectations are taken over the latent variables ξ pi and θ p in vector in σ, these latent variables do not have to be estimated (as in MCMC) or marginalized out (as in MML). The estimation procedure is therefore numerically less demanding, even if there are many latent ξ pi variables (see Muthén, Muthén & Asparouhov, 2005). In a simulation study, Beauducel and Herzberg (2006) showed that a sample size of 250 subjects is enough to fit models up to 8 latent variables and 40 ordinal variables with 2, 3, 4, 5, or 6 point scales. To compare: Wood et al (2002) discuss that more than 5 latent variables is already infeasible for MML (i.e., MML using conventional Gauss-Hermite quadrature approximation; the maximum number of latent variables can be increased by using more efficient approximations as proposed by e.g., Cai, 2010b, but still the estimation can quickly become -at least practically -infeasible).

Data
In this section, we apply the hierarchical rater thresholds model from Eq. 6, Eq. 7, and Eq. 8 to a real data set pertaining mathematics ability. Specifically, the data were collected within the Trends in International Mathematics and Science Study (TIMMSS) in 2007 in Turkey. The data comprise ratings from four raters of 10 items by 150 8 th grade students. The raters are all middle school teachers. The items were rated on a 6-point scale. The data were collected in a fully crossed design with all students responding to all items which in turn were rated by all teachers.
Although the data come from a large-scale international assessment, the size of the dataset is admittedly not extremely large (10 items, 4 raters, and 150 subjects). However, we note that the dataset is large enough to demonstrate the value of our approach. That is, in the present dataset MML based approaches are already (practically) infeasible as the model contains 10 latent variables for the common item effects. In addition, as described below, estimation time for the hierarchical rater model by MCMC is already five times longer than the estimation time of our WLS based approach.

Models
To the data, we fit various versions of the hierarchical rater threshold model with different structures for τ irc and λ ir in Eq. 6. In addition, for a comparison, we fit two rater thresholds models that are based on the facet model in which we omit the common item effects, ξ pi , from the model similarly as in the original facet model by Linacre (1989). That is, we consider a model without ξ pi but with item specific rater parameters, φ ir and ψ ir , and a model without ξ pi but with item invariant rater thresholds, φ r and ψ r . Finally, we also consider (different versions of) the hierarchical rater latent class model by DeCarlo (2005) and the hierarchical rater model by Patz et al. (2002).

Estimation
The hierarchical rater threshold models (including the facet model version) are fit in Mplus (Muthén & Muthén, 2007) using WLS estimation with a diagonal weight matrix W (also referred to as "diagonally weighted least squares") and a correction on the standard errors and the χ 2 goodness of fit statistic (Muthén et al., 1997). The script to fit the full model in the case of 4 raters and 10 items is available in the Supplementary Materials. As changing this script to a desired number of items or raters can be cumbersome, we wrote an R-script that can be used to generate Mplusscripts for any number of items and any number of raters, and for all special cases of the model as discussed in this paper. The R-script is available on MASKED.
To assess model fit we use the Root Mean Square Error of Approximation (RMSEA; Browne & Cudeck, 1993), the Comparative Fit Index (CFI; Bentler & Bonett, 1980) and the Tucker-Lewis Index (TLI; Bentler & Bonett, 1980). According to the guidelines by Schermelleh-Engel, Moosbrugger, and Müller (2003), the RMSEA indicates an acceptable fit for values between 0.08 and 0.05, and a good fit for values smaller than 0.05. For the CFI and TLI values between 0.95 and 0. 97 indicate acceptable model fit, and values above 0.97 indicate good model fit. The hierarchical rater model by Patz et al. (2002) is estimated using MCMC estimation in the Immer R-package (Robitzsch & Steinfeld, 2018a; see also Robitzsch & Steinfeld, 2018b; we used 5000 iterations with 1000 burnin). In addition, the hierarchical rater model by DeCarlo (2005) is estimated using maximum likelihood in the R-package Sirt (Robitzsch, 2020; see also Robitzsch & Steinfeld, 2018b). Model fit for the latter model is assessed using the AIC, BIC, CAIC, and AICc fit indices.

Estimation time
To illustrate the efficiency of the present model, we first compared estimation time between the full hierarchical rater thresholds model estimated using WLS and the hierarchical rater model estimated using MCMC estimation as discussed above. The rater threshold model took 55 seconds to estimate on an average laptop, while the hierarchical model took 4 minutes and 38 seconds.2 Of course, the estimation time of the hierarchical rater model depends on the number of iterations, but we think it does illustrate the difference between both estimation approaches.

Modeling results: Rater threshold models
The model fit results concerning the rater threshold models are in presented in Table 2. As can be seen, all hierarchical rater thresholds models have an acceptable fit according to the RMSEA with values between 0.058 and 0.79. According to the CFI and TLI, all rater thresholds models fit well with values close to 0.98/0.99. On the contrary, the facet version of the rater threshold model fits poorly. The facet model with item specific rater parameters has a CFI value of 0.911, a TLI value of 0.916, and a RMSEA value of 0.178. The item invariant model also fits poorly with a CFI value of 0.908, a TLI value of 0.919, and a RMSEA value of 0.175.
Parameter estimates for the rater parameters, φ ir and ψ ir in the different rater threshold models are in Table 4, 5, 2 Configuration of the 'average laptop' is: Intel Core i5 CPU (2.30 Ghz) with 8Gb RAM memory and 6 for Models 1a to 2d from Table 2. Note that Models 3a to 3d do not have separate rater parameters (i.e., φ ir = φ = 0 and ψ ir = ψ = 1 for identification reasons in these models). As can be seen from Table 4 and 5, estimates for φ ir and ψ ir hardly differ across Models 1a to 1d. That is, the exact configuration of λ ir does not importantly affect the rater severity and variability estimates, at least for this dataset. However, it can be seen that within a rater, φ ir and ψ ir differ across items. For instance, in Table 4 it can be seen that the estimated severity of rater 2 equals -0.445 (SE: 0.234) on item 5 and 0.535 (SE: 0.079) on item 8. Similar differences are notable for ψ ri in Table 5 Thus, the rater threshold model clearly outperforms the facet model, which was to be expected as the facet model does not take the common item effects into account. However, for the different rater threshold models, it is more difficult to select to best fitting model as all fit indices (especially the TLI and CFI) are very close. Therefore, the question arises which model to choose. As can be seen in Table 2, have restrictions on φ ir and ψ ir generally decreases model fit. That is, all models with restrictions on φ ir and ψ ir fit less well as compared to their corresponding model without restrictions on φ ir and ψ ir . With respect to the restrictions on λ ir is can be seen that have item specific (λ i ) or rater specific (λ r ) parameters generally decreases model fit as compared to an unrestricted λ ir . However, having a common parameter, λ ir = λ is associated with the best fit. Thus statistically, one would choose the model with an unrestricted φ ir and ψ ir and a common parameter for λ as this model is -at least statistically -the best fitting model. However, as the model fit is close, in practice one would want to take practical and substantive considerations into account as well. In addition, one should keep in mind that the fit measures that we used (CFI, TLI, and RMSEA) tell something about the model implied polychoric correlations. For instance, a difference in RMSEA of 0.002 between Model A and B indicates that Model B is better in describing the polychoric correlation matrix than Model A. Thus in practice, if models have similar RMSEA values, the question is if these differences (i.e., differences in how well the polychoric correlation matrix is described) are important for the research goal at hand.

Modeling results: Hierarchical latent class models
See Table 3 for the model fit of different hierarchical latent class models according to the AIC, BIC, CAIC, and AICc. As can be seen, the fit indices are slightly mixed with respect to the model that is indicated as the best fitting model. That is, the AIC prefers the model with τ irc = τ ic and λ ir = λ ir while the AICc prefers the model with τ irc = τ rc and λ ir = λ. The BIC and CAIC both agree in that the model with τ irc = τ rc and λ ir = λ ir is the best fitting model. Therefore, we accept this model as the final hierarchical latent class model.

Modeling results: Comparison
A direct comparison between the rater threshold model and the hierarchical rater latent class model is not possible as estimation of the former is not based on a likelihood function. As a result, for the rater threshold model, no is not considered as it is not identified (see main text). In addition, '#par' denotes: number of parameters in the model. Note. Best values of the fit indices are in bold face. *: as the number of parameters equals the sample size, the AICc cannot be calculated for this model Table 4: Parameter estimates (standard errors) for the rater severity, φ ri across Model 1a to Model 1d and the hierarchical rater model (HRM).  Table 4: Parameter estimates (standard errors) for the rater severity, φ ri across Model 1a to Model 1d and the hierarchical rater model (HRM). likelihood based model fit statistics can be calculated. However, an indirect comparison is possible as the best fitting models can be compared in terms of the effects that are included. That is, the final rater threshold model was argued to be the model with item and rater specific severity and scale parameters φ ir and ψ ir and with rater and item invariant parameter λ. For the hierarchical rater latent class model, the final model includes τ irc = τ rc and λ ir = λ ir . Both models thus agree on rater and item specific effects, however, in the latent class model, the rater effects are modeled in the thresholds, while the items effects are modeled in the rater discrimination parameter λ ir . This is in accordance with how DeCarlo et al. (2011) proposed the compare the hierarchical rater latent class model to the hierarchical rater model by Patz et al. (2002). That is, De Carlo et al compare τ irc and λ ir from the hierarchical latent class model to respectively φ r and ψ r from the hierarchical rater model. A comparison in terms of parameter estimates between the rater threshold model and the hierarchical rater latent class model is challenging as the threshold parameters τ irc are restricted in a different way (but see DeCarlo et al., 2011 for a possible graphical approach). Here, we focus on a comparison between the rater threshold model and the hierarchical rater model by Patz et al. (2002) as both models include comparable rater severity and rater scale parameters φ ir and ψ ir . See below.

Modeling results: Hierarchical rater models
Parameter estimates for the rater parameters, φ ir and ψ ir from the hierarchical rater model are in Table 4 and 5. Table 6 contains the rater parameters in a model with item invariant rater parameter φ r and ψ r . Note that in the original hierarchical rater model by Patz et al (2002), ψ 2 ir is estimated, but here we focus on ψ ir to facilitate comparison to the rater threshold models.

Modeling results: Comparison
In comparing the rater parameters φ ir and ψ ir between the hierarchical rater model and the rater threshold model a transformation of the hierarchical rater model parameters is necessary. That is, in the rater threshold model, the rater parameters reflect the differences among the raters relative to rater 1, as for this rater, φ i1 = 0 and ψ i1 = 0 for identification reasons. Therefore, the comparison between  the rater parameters in Table 4, 5, and 6 should be based on the ordering of the raters. That is, in Table 5 it can be seen that for rater severity, φ r , the hierarchical rater model and the rater threshold model result in the same ordering of the raters (i.e., from least to most severe: 1,2,4,3). For the rater scale parameter however, the models differ in their implied ordering. That is, for the rater threshold model the ordering from smallest scale to widest scale is: 1, 2, 4, and 3, while the ordering for the hierarchical rater model is: 4, 2, 1, and 3.
To look more into the relation between the rater parameters from the two models, we transform the hierarchical rater model parameters to reflect rater severity and scale relative to the first rater, that is: Next, for raters 2, 3, and 4, we correlate these transformed parameters from the hierarchical rater model to the parameter from the rater threshold model. In addition, we consider the observed marginal means (M ir = MEAN p (X pir )) and standard deviations (S ir = SD p (X pir )) of the rater's item ratings. Similar to above, we also transformed these statistics into a measure relative to rater 1 (M ' ir and S ' ir ). Table 7 depicts the correlations between the relative hierarchical rater parameters, the relative rater statistics, and the rater parameters from the rater threshold model. As can be seen, the correlations between M ' ir and φ ' HRM are generally large for all three raters (around 0.970). The  correlation between M ' ir and φ RTM is about as large for rater 2 (0.936), but somewhat smaller for rater 2 (0.853) and rater 3 (0.729). The correlation between φ ' HRM and φ RTM is large for raters 2 and 3 (around 0.900), but somewhat smaller for rater 4 (0.648). Next, the correlation between SD ir and ψ ' HRM is small and negative (between -0.129 and -0.363) while the correlation between SD ir and ψ RTM is larger and positive (between 0.358 and 0.609). The correlation between ψ ' HRM and ψ RTM is also negative (between -0.311 and -0.521). Thus overall, it seems that both φ ' HRM and φ RTM capture marginal relative mean differences between raters well, with the hierarchical rater model being somewhat better, while the rater threshold model captures relative differences in the marginal standard deviation better. As we focused on the transformed parameters of the hierarchical rater model relative to rater 1, we also provide the correlations between the untransformed hierarchical rater parameters and M ir and S ir , see Table 8. As can be seen, the S ir now correlates positively with the rater scale parameter φ ir although this correlation is small for rater 1 and 3.
In conclusion, φ ir from the hierarchical rater model and from the rater threshold model can both be used to quantify relative differences between raters in the marginal means. The scale parameter ψ ir from the hierarchical rater model is not suitable to quantify relative differences between raters in the marginal standard deviations while ψ ir from the rater threshold model performs better in this respect. Thus, the rater threshold should ideally be preferred if the aim is to quantify relative differences between raters. However, if the aim is to quantify differences between items within a rater, the hierarchical rater model is better suitable as for the rater threshold model, only relative differences between raters are quantified.

Discussion
In this paper we presented a method to infer student proficiency from items rated by multiple raters. As we relied on WLS estimation, we argued that our approach is suitable for large scale educational settings where multiple latent variables are needed to account for violations of local independence. Our argument is mainly a pragmatic one. There may be good reasons to choose the hierarchical rater model or a Bayesian approach over the present approach. For instance, contrary to the Bayesian hierarchical rater model by Patz et al. (2002), our approach is sensitive to small sample sizes. That is, for decreasing sample sizes, parameter estimates in the present model will become biased or even infeasible. In such cases, a Bayesian approach like the hierarchical rater model is desired. In addition, there may be substantive/conceptual reasons to choose for the hierarchical rater model due to the categorical nature of the ideal scores in this model. For instance, as argued by Mariano & Junker (2007), if the common item effects are treated as continuous variables, these cannot be interpreted as "ideal scores" because the scale differs from those of the observed ratings.
Despite sample size considerations, there are two more trade-offs associated with our approach (see also Muthén, Muthén, & Asparouhov, 2015). First, our approach is more sensitive to the number of variables in the analysis as compared to MCMC or MML. That is, as the estimation of our model is based on the asymptotic covariance matrix of the polychoric means and correlations, it becomes more and more challenging to estimate the full matrix if there are both many raters and many items. However, as in practice the number of raters is commonly limited, the present approach will be suitable in many cases. Second, contrary to MML and MCMC approaches, our approach is a limited information approach which only uses the information from the first two polychoric moments in the data. However, the effects of neglecting higher- Table 8: For each rater, the correlations between the rater mean (M r ) and the rater standard deviation (SD r ) of the observed item ratings (X pir ), and the rater parameters φ ir and ψ ir from the hierarchical rater model (HRM). order moments is generally considered to be small (see Christoffersson, 1975).
The main difference between the present approach and other approaches (e.g., Patz et al., 2002;Wilson & Hoskens, 2001) is that the rater effects are treated as effects on the thresholds at which an underlying continuum is categorized by the raters. We chose this approach both for pragmatic reasons (i.e., to be in the generalized linear latent variable modeling framework) and substantive reasons (i.e., to facilitate the psychological interpretation of the parameters using Thurstone's law of categorical judgment, 1928). This choice implies that the rater effect is treated as a fixed effect. This different from, for instance, Snijders and Bosker (1999, Chapter 11) and Wang et al. (2014). The question whether the rater effect is a random or a fixed effect is probably an empirical one. If the raters are truly selected randomly from a larger pool of raters, it might be best to account for this source of variation. However, we think that in practice, raters are selected according to specific criteria (e.g., availability, rating skill, acquaintance with the topic to be rated, etc.) such that the assumption of fixed rater effects is defendable.