UvA-DARE (Digital Academic Repository) CrowdGP: A Gaussian process model for inferring relevance from crowd annotations

Test collection has been a crucial factor for developing information retrieval systems. Constructing a test collection requires annotators to assess the relevance of massive query-document pairs. Relevance annotations acquired through crowdsourcing platforms alleviate the enormous cost of this process but they are often noisy. Existing models to denoise crowd annotations mostly assume that annotations are generated independently, based on which a probabilistic graphical model is designed to model the annotation generation process. However, tasks are often correlated with each other in reality. It is an understudied problem whether and how task correlation helps in denoising crowd annotations. In this paper, we relax the independence assumption to model task correlation in terms of relevance. We propose a new crowd annotation generation model named CrowdGP, where true relevance labels, annotator competence, annotator’s bias towards rele-vancy, task difficulty, and task’s bias towards relevancy are modelled through a Gaussian process and multiple Gaussian variables respectively. The CrowdGP model shows better performance in terms of interring true relevance labels compared with state-of-the-art base-lines on two crowdsourcing datasets on relevance. The experiments also demonstrate its effectiveness in terms of selecting new tasks for future crowd annotation, which is a new functionality of CrowdGP. Ablation studies indicate that the effectiveness is attributed to the modelling of task correlation based on the auxiliary information of tasks and the prior relevance information of documents to queries.


INTRODUCTION
Test collections have significantly benefited the development of information retrieval (IR) systems [42].They are used both as training data to develop new retrieval models and as test data to evaluate model performance.Building such test collections requires assessing the relevance of massive documents to a set of queries.Traditionally the assessment of relevance is performed by trained professionals in a controlled lab environment [42].As the need for creating new test collections to support the development of algorithms increases, so does the number of documents that need to be labeled.This makes the construction of test collections expensive and time-consuming.Crowdsourcing has been widely adopted as a cost-effective solution by the IR community [1,12,[15][16][17].Typically, the task of obtaining the relevance label of a document to a query is assigned to crowd annotators (or workers) in the form of human intelligence tasks.
Despite its cost-effectiveness, crowdsourcing introduces a major challenge -improving the quality of the crowd labels through careful label aggregation approaches [19].majority voting (MV) has been the most prominent aggregation method, with early experiments showing that labels derived by a majority vote of multiple untrained crowd annotators can reach a quality comparable to that of a trained NIST assessor [2].More recent work has shown that the quality of crowd annotations is affected by a number of factors such as the difficulty of tasks and the competence of annotators [13,17], which are not considered by majority voting.
Mainstream approaches assume different annotation generation processes and use different probabilistic graphical models (PGMs) to model the two factors.The hypothesis is that if a task is inherently easy, the labels from different crowd annotators will be consistent; otherwise, the labels will differ.Similarly, if an annotator is familiar with the topic of the search query and is well motivated, he or she is more likely to give correct answers.Note that they also assume that the latent true labels of tasks are independent and the annotations are generated independently by different annotators as well.
However, the independence assumption existing in most PGMs is over simplifying the actual crowd annotation and hence they do not show consistently better performance than majority voting on diverse benchmarks [22].In fact, the label of a task may be roughly inferred from the labels of those similar to it.It has not been well investigated on how to model this property to improve label quality from noisy crowd annotations.In this paper, we extend the existing work by two means: (1) relaxing the independence assumption and instead modelling task correlation in terms of relevance, and (2) assuming a different annotation generation process to model the latent true labels using a Gaussian process (GP), as well as annotator competence and annotator's bias towards relevancy, task difficulty and task's bias towards relevancy through multiple Gaussian variables.The model is named CrowdGP as it is a GP-based model to denoise crowd annotations.
In particular, we make the hypothesis that a document is correlated with other documents close to it in some document feature space in terms of their relevance labels to an information need.This hypothesis has been examined extensively in the IR community, e.g., in the cluster hypothesis [28].The consideration of using a GP prior to model task correlation is that it allows us to integrate auxiliary information of tasks (such as query-document text and document ranks produced by different system runs) and prior knowledge of relevance (such as pretrained ranking models); moreover, the model learned on crowd annotations can be further used to predict labels for new tasks that have no crowd annotations, which is not supported in most existing PGM approaches.Second, we assume that both the noise from tasks and the noise from annotators affect the observed crowd labels.This is evidenced in work of Maddalena et al. [27] who have empirically shown that topic difficulty, annotator competence and relevance label (relevant or nonrelevant) all affect the quality of crowd annotations.We use multiple Gaussian variables to model task noise and annotator noise, respectively.The advantage is that we can use the variance parameter of a Gaussian distribution to model how the labels differ, and use the mean parameter to model any bias towards or against relevancy, both per task and per annotator.Besides, such choice makes it easy to calculate the likelihood function in Equation (15).Finally, the new annotation generation process is assumed as follows: an observed (discrete) crowd label is generated from a Bernoulli distribution, of which the positive probability is determined by three variables: the latent GP, the task noise and the annotator noise.The model is optimized using a variational expectation maximization (VEM) algorithm [33].
The main contributions in this paper are the following: • We propose a new probabilistic graphical model CrowdGP, which captures latent true labels, annotator competence, annotator's bias towards relevancy, task difficulty, and task's bias towards relevancy for observed tasks.The model is able to infer true labels from crowd annotations and predict labels for new tasks that have no crowd annotations.• We propose to use a VEM algorithm to effectively and efficiently learn model parameters.We empirically demonstrated its effectiveness compared against stochastic gradient descent (SGD).• We empirically demonstrated the effectiveness of CrowdGP in terms of inferring latent true labels and selecting new tasks for crowd annotation.We provide detailed analysis of the effect of its components including task feature and mean function on performance, as well as the learning behaviour of CrowdGP under different hyperparameters.

RELATED WORK 2.1 Crowdsourcing in Information Retrieval
Relevance assessment is critical to generate high quality IR evaluation collections.The wide application of crowdsourcing relevance assessment introduces a new challenge of query control.The text retrieval conference (TREC) crowdsourcing tracks from 2011 to 2013 [39][40][41] investigated the use of crowdsourcing techniques to evaluate retrieval systems.Alonso and Mizzaro [1] have shown that crowdsourcing labels and the expert labels produced by TREC assessors correlate well in the IR evaluation measures.The quality of crowd annotations are significantly affected by human factors.For example, Kazai et al. [17] showed that an annotator's motivation, interest and familiarity with the task, perceived task difficulty, and satisfaction with the offered payment all influence the quality of the crowd annotations; Han et al. [13] concluded that task instruction, task subjectivity, task type, and the monetary reward of tasks all influence the quality of crowd annotations.Various quality control methods to ensure crowd annotation quality during the running of crowdsourcing tasks have been investigated.For example, limiting the time available for relevance labelling is useful to construct a high-quality test collection [26]; asking for annotator's rationale such as a justification more than just a relevance label improves the quality of crowd annotations [30].These studies lay the foundation of the hypotheses in Section 4.2.

Probabilistic Models for Aggregating Crowd Annotations
The goal of label aggregation is to infer the true label of each task given redundant and maybe inconsistent crowd annotations.The mainstream solution is to design a PGM to model the annotation generation process.Usually, the latent true labels of tasks are independent from each other; and given the latent true label of a task, each observed label is assumed to be generated independently from other observed labels.Let y j i denote the annotator j's label to task i, and y i denote the latent true label for task i.Let Y denote the annotations for all the tasks and the corresponding annotators, y denote the latent true labels for all tasks.The joint distribution of Y and y is defined as p (Y , y | θ ) = i p (y i | θ ) j p (y j i | y i , θ ) for most PGMs, but the major difference is the way they model p (y i | θ ) and p (y The parameter θ models factors like the difficulty of each task and the competence of each annotator etc. θ is learned by maximizing the likelihood.
Detailed review of most existing work can be found in [21,22,46].We briefly explain several representative models related to the work in this paper.Dawid and Skene (DS) [8] models p (y for each annotator j, which can be understood as an annotator competence matrix, and models p (y i ) by a categorical distribution τ k = p (y i = k); then the expectation maximization (EM) algorithm is employed to optimize model parameters v jkl and τ k .learning from crowd (LFC) [36] extends DS by adding a Dirichlet prior for v jk and τ ; again, the EM algorithm is employed to optimize model parameters.independent Bayesian classifier combination (iBCC) [23] is a Bayesian version of LFC; the Gibbs sampling method is used for parameter learning.generative model of labels, abilities, and difficulties (GLAD) [43] models p (y j i | y i ) by a logistic function 1 1+e −α i β j where α i is the difficulty of task i and β j is the competence of annotator j, and models p (y i ) by the same categorical distribution τ k = p (y i = k); model parameters are largely compressed into M + N instead of M × K × K as in DS; the parameters are inferred with the EM algorithm.multi-annotator competence estimation (MACE) [14] models p (y j i = l | y i = k) using a confusion matrix similarly with DS, where v jkl = (1−θ j )ϵ jl if k l and v jkl = θ j + (1 − θ j )ϵ jl if k = l; again, this confusion matrix reduces the number of parameters; the EM algorithm is used for parameter learning.Although these approaches model the annotation generation process differently, it is difficult to determine which performs the best in practice [22].Our model extends the independence assumption of these models and model the correlation of tasks in terms of relevance; further, it allows us to incorporate auxiliary information of tasks and prior knowledge on relevance.
There is also work jointly tackling label aggregation problems with other tasks such as downstream classification task and evaluation of retrieval systems.For example, Zhan et al. [45] proposed a joint classification and aggregation framework where the classification module provides feedbacks and boost the aggregation module.Ferrante et al. [10] use crowd relevance annotations as sources of uncertainty and design IR evaluation metrics based on the crowd annotations.

Gaussian Process for Aggregating Crowd Annotations
Several GP-based models have been proposed for label aggregation task [11,31,37,38].Groot et al. [11] aggregated crowd annotations of real-value type by averaging multiple crowd labels to one single label and apply a vanilla GP regression model.Rodrigues et al. [37] proposed a GP-based model to account for multiple annotators with different levels of expertise.Ruiz et al. [38] proposed a model of binary label aggregation by adding a GP prior on top of the confusion matrix in the DS model.A novel variational inference algorithm is proposed for the model.Further, the model is extended in order to deal with large-scale datasets, e.g., with approximately 1 million tasks, in the work of [31].Our work is different from these models in the sense that it assumes a different annotation generation process, and accordingly, a different inference method is used to learn model parameters.

PRELIMINARIES: GAUSSIAN PROCESS CLASSIFICATION
The Gaussian process classification model is introduced to provide necessary knowledge to understand our model.Given a set of training samples C ≜ {(x i , y i )} N i=1 ≜ (X , y), where N is the number of samples in C, x i is the input point, y i is the corresponding label.The goal is to predicts label for a new point x * .
A GP is a stochastic process with the important characteristics that any finite number of random variables follow a joint Gaussian distribution [3].A GP classification model assumes the observed data are generated through the following process: a GP first maps the input point x ∈ R D to a latent variable f ∈ R, then a link function maps f to a real value y ∈ [0, 1].The link function can be a logistic function or a probit function.We use a probit function ) dz in the work of this paper.In the binary classification setting, we denote the positive class as 1 and negative class as 0. Hence, according to the property of Φ (f ), the positive class probability is p (y = 1 | x) = Φ (f ) and the negative class probability is p (y Formally, the data generation process is rephrased as follows.First, the latent variable f follows a GP, denoted by: Second, each observed variable y i follows a Bernoulli distribution conditioned on its corresponding latent variable f i , denoted by: Now let us calculate the predictive probability of a new point x * being classified as positive p (y * = 1 | x * , X , y).We need to first compute the posterior distribution of the latent variable f * when the training data C being observed using and then compute the positive class probability using Equation ( 4) is easy to calculate if knowing p where * is the covariance vector between X and x * , and finally K * * is the covariance between x * and x * .
Based on the conditional Gaussian distribution rule [35], the distribution of f * conditioned on f is where The prior p (f | X ) is presented in Equation ( 5), and the likelihood However, the computation of the evidence p (y | X ) is not trivial, because it is not a Gaussian distribution due to the multiplication of the Bernoulli likelihood with the GP prior.The solution can be either analytic approximations of integrals (e.g.expectation propagation (EP) and Laplace approximation) or numerical approximation methods (e.g.Monte Carlo sampling), where a Gaussian distribution is used to approximate the posterior distribution p (f | X , y) and the evidence p (y | X ) [35].
In general, the mean function m (•) and the covariance function k (•, •) of the GP classification model both contains parameters to be learned.To learn these parameters, one needs to maximize the likelihood of the observed data, which is argmax θ p (y | X , θ ); or adopt a Bayesian view by considering both the likelihood and the prior of parameters, which turns to be argmax θ p (y | X , θ ) p (θ ) Note that here we rewrite p (y | X ) as p (y | X , θ ) when we explicitly take θ into consideration.The prior item p (θ ) can avoid model overfitting and the likelihood p (y | X , θ ) "incorporates a trade-off between model fit and model complexity" [35].

GAUSSIAN PROCESS CLASSIFICATION ON CROWD ANNOTATIONS 4.1 Problem Formulation
Different from the vanilla GP classification model where in the observed date each example has one single label, in the crowd annotation case, each task has labels annotated by multiple annotators.Assume that there are N unique tasks and M unique annotators in the crowd annotation dataset.Denote the annotations for the i-th task by , and denote the complete annotations by C ≜ {C 1 , C 2 , ..., C N }.Our goal is to infer the true label y i for each task x i .Moreover, we also want to select new tasks for future crowd annotation.

The Model
In this section we propose three hypotheses for crowd annotation generation process based on which we propose the GP classification model for crowd annotations named CrowdGP.

Hypothesis 1 (Correlation between tasks).
A task is correlated with other tasks close to it in some space in terms of their labels.
In the domain of IR this hypothesis is reflected as the widely accepted cluster hypothesis which states that closely associated documents tend to be relevant to the same information need [28].We use a GP to capture the correlation between tasks, presented formally as: where f ≜ [f 1 , f 2 , ..., f N ] are continuous values and can be converted to discrete labels through a link function like probit, the covariance function k (x, x ′ ) captures the correlation across tasks, and the mean function m (x) captures prior knowledge on labels.
Hypothesis 2 (Noise from tasks and annotators).Both the noise from tasks and the noise from annotators affect the observed crowd labels.
Intuitively if a task is easy, different crowd annotators tend to reach a consensus which leads to the same crowd labels; otherwise, the crowd labels will be very different.To this end, we use a Gaussian variable ϵ i to model the noise of each task T i : where µ i models the inherent bias of the task towards relevance and σ 2 i models the difficulty level of the task.Similarly, the competence of an annotator determines the quality of his or her annotations.Besides, an annotator has his or her own criterion or bias of relevance assessment.Similarly to ϵ i , we use another Gaussian variable ϵ j : where µ j models the bias of the annotator towards relevance, and σ 2 j models his or her competence.
Hypothesis 3 (Annotation generation process).An observed crowd label (discrete) is generated from a Bernoulli distribution, of which the positive probability is determined by three variables: the latent GP, the task noise and the annotator noise.
Based on the three hypotheses, we assume the observed crowd annotations are generated through the following process.Each task T i corresponds to a latent variable f i ∈ R, and the latent variables for all the tasks conform to a GP.When a task T i is distributed to an annotator A j , a Gaussian noise ϵ i is added to f i to generate a j i , and a Gaussian noises ϵ j is added to a j i to generate b j i .We assume that ϵ i is independent from f i and any other noise ϵ l (l i) ; and ϵ j is independent from any f i , any ϵ i , and any other ϵ k (k j) .A probit function maps b j i to a value in interval [0, 1] which represents the probability of y The process is illustrated in Figure 1.

Label Inference for Crowd Tasks
Given a new point or an existing point x * , our goal is to predict the latent true label y * .Similar to the vanilla GP classification model (Equation ( 4)), we calculate the positive class probability: The integral item p (f | X , Y ) can be further rewritten using the Bayes' rule as: The prior p (f | X ) is the GP prior in Equation ( 5).The likelihood of the observed crowd annotations -p (Y | f ) -consists of multiple Bernoulli likelihoods.Based on the independence assumption in the annotation generation process, p (Y | f ) can be written as: We give the detailed derivation as below, which is one of the contribution of this work.
Equation (16b) applies the total probability rule; Equation (16c) holds because a j i is only dependent on f i , b j i is only dependent on a j i , and y j i is only dependent on b j i ; Equation (16d) holds because the sum of a constant f i and a Gaussian variable ϵ i is a Gaussian variable, similarly the sum of a constant a j i and a Gaussian variable ϵ j is a Gaussian variable; Equation (16e) integrates a j i out by applying the Gaussian marginal and conditional rule [32, page 93].Now we continue the discussion of Equation ( 14).Note that p (f | X , Y ) is not a Gaussian distribution due to the multiplication of the Bernoulli likelihood with the GP prior and thus the computation of the integral is not trivial.The major idea is to use a multivariate Gaussian distribution q (f ) to approximate p (f | X , Y ).The problem is solved together with the optimization of model parameters (see Section 4.5).

Selection of New Tasks
Except for inferring latent true labels for tasks that have crowd annotations, we are also interested in selecting new tasks for future annotation.Finding all relevant document for relevance assessment is one of the goals of building a high-quality test collection [20].Note that the CrowdGP model outputs a predictive Gaussian distribution given a new point which is shown in Equation ( 6), making it possible to apply existing acquisition functions for new task search.
An acquisition function is a scoring function in a search space, which finds an optimal trade-off between exploration (the predicted variance is high) and exploitation (the predicted mean is high).In this work, we use expectation improvement (EI) [9] as the acquisition function, defined as: x is a new point, f (x) is a random variable conforming to the predictive Gaussian distribution.α (•) is the acquisition function.A negative β means exploration and a positive β means exploitation.For example, if β is negative, given two points that have the same predicted mean value but different predicted variance values, EI will prioritize the point with big variance.

Model Optimization
The proposed model contains parameters from the mean function and the covariance function which are common for all GP models, as well as parameters We denote all the parameters by θ .Our goal is to optimize the model with regard to θ .Similar to the vanilla GP classification model, we adopt a Bayesian view and maximize the log likelihood of the observed data plus the log of the parameter prior: log p (Y , θ | X ) = log p (Y | X , θ ) + log p (θ ).We formally write the optimization problem as: As the first part log p which is explained in the discussion of Equation ( 14), we instead maximize its evidence lower bound (ELBO), which is tractable.The derivation of ELBO is as follows: where q (f ) ≜ q (f | ψ) is the parameterized variational functional to be learned, which is assumed a multivariate Gaussian distribution approximating ) is the likelihood of the observed data in Equation ( 15), both parameterized by θ .Therefore we also denote ELBO ≜ ELBO (ψ, θ ).Finally, we adopt the VEM algorithm [33] to maximize the objective function, which is reduced to ELBO (ψ, θ ) + log p (θ ).Both the E and M steps maximize the same function, the difference is that the E step maximizes it with respect to the parameters of q (f ) while the M step maximizes it with respect to the model parameters θ .The optimization method is summarized in Algorithm 1.
where  ( ) ≜  ( | ) is the parameterized variational functional to be learned, which is assumed a multivariate Gaussian distribution approximating  ( |  );  ( ) ≜  ( |  ) is the prior distribution in Equation ( 5) and  ( |  ) ≜  ( | ,  ) is the likelihood of the observed data in Equation ( 15), both parameterized by  .Therefore we also denote  ≜  (,  ).Finally, we adopt the VEM algorithm [33] to maximize the objective function, which is reduced to  (,  ) + log  ( ).Both the E and M steps maximize the same function, the difference is that the E step maximizes it with respect to the parameters of  ( ) while the M step maximizes it with respect to the model parameters  .The optimization method is summarized in Algorithm 1.

Algorithm 1: Optimization of the CrowdGP model.
Input: A set of task features and crowd annotations (,  ).Output:  and .

Task Representation
Different from most label aggregation models, where tasks are represented as a set of indicators of {1, 2, . . .,  }, the covariance function  (,  ′ ) in CrowdGP requires the tasks to be represented as a set of vectors of { 1 ,  2 , . . .,   }.The vectors can incorporate any auxiliary information.In this work, we use lexical features, semantic features and ranking features as they are easy to acquire for most and have been demonstrated effective in IR related tasks.
Lexical features We consider several lexical features which have been shown effective in learning to rank algorithms [24]: (1) a term frequency score measuring the frequency of a query in a document (denoted by  ∈∩   (, )), (2) a inverse document frequency score measuring how much information a query provides (  ∈  ()), (3) a TF-IDF score measuring both (denoted by  ∈∩   (, )   ()), (4) a cosine value between TF-IDF vectors of a query and a document, measuring their lexical similarity, (5) a BM25 score between a query and a document, measuring their lexical similarity (  ∈  () ), and ( a probability of observing a query given a document, which is based on language modelling method and measures lexical their similarity [44] (denoted by  ∈∩ log  ( |) ( |) + || log   +  log  ( | )).In the aforementioned formulas,  is the query,  is the document, | • | is the number of terms,  is the average document length in a document collection,  is the document collection,  1 , , and   are hyperparameters with default values of 1.2, 0.75, and 0.5.Semantic features Semantic features are important supplementary of lexical features to capture task correlation.To acquire relevance-guided representation of query-document text, we pre-train a text representation model (BERT-FirstP) following [7] because it uses the same ClueWeb09 collection with our crowdsourcing dataset and it is a relevance classification task.We input query [SEP] document and output the vector of the [CLS] token as the semantic features.Ranking features Except for lexical and semantic features, it is easy to acquire document rank information from multiple retrieval systems for a relevance crowdsourcing task.Combining multiple ranked lists produced by various retrieval systems (known as meta search) does help relevance classification [4].We use the available ranked lists that covers the queries and documents of our crowdsourcing datasets, 35 in total, produced by the participating teams in the TREC 2009 million query track.

Mean and Covariance Function
The mean function of CrowdGP gives a set of latent function values without seeing any crowdsourcing data.We can use any pretrained relevance classification model as the mean function, in order to incorporate prior relevance knowledge.For idea validation, we simply pretrain a logistic regression model using the ground truth relevance labels in the corresponding training set of the crowdsourcing dataset, and use the predicted logits as the mean function value.
The covariance function of CrowdGP measures the correlation between two tasks.We employ the linear covariance  (,  ′ ) =  2  •  ′ for semantic features and the RBF covariance ) for lexical and ranking features.The length scale parameter  indicates with what scale the changes of input will not cause "large" change in output; the signal variance  indicates the amplitude of the latent function.

Implementation
We employ GPflow [29] to implement our model.The model contains a number of parameters including the length scale  and variance  of the covariance function, the mean and variance of the Gaussian noise for the annotators {(  , a cosine value between TF-IDF vectors of a query and a document, measuring their lexical similarity, (5) a BM25 score between a query and a document, measuring their lexical similarity ( t ∈q IDF (t) ), and ( 6) a probability of observing a query given a document, which is based on language modelling method and measures lexical their similarity [44] (denoted by t ∈q∩d log p (t |d ) In the aforementioned formulas, q is the query, d is the document, | • | is the number of terms, avдdl is the average document length in a document collection, C is the document collection, k 1 , b, and α d are hyperparameters with default values of 1.2, 0.75, and 0.5.Semantic features Semantic features are important supplementary of lexical features to capture task correlation.To acquire relevance-guided representation of query-document text, we pre-train a text representation model (BERT-FirstP) following [7] because it uses the same ClueWeb09 collection with our crowdsourcing dataset and it is a relevance classification task.We input query [SEP] document and output the vector of the [CLS] token as the semantic features.Ranking features Except for lexical and semantic features, it is easy to acquire document rank information from multiple retrieval systems for a relevance crowdsourcing task.Combining multiple ranked lists produced by various retrieval systems (known as meta search) does help relevance classification [4].We use the available ranked lists that covers the queries and documents of our crowdsourcing datasets, 35 in total, produced by the participating teams in the TREC 2009 million query track.

Mean and Covariance Function
The mean function of CrowdGP gives a set of latent function values without seeing any crowdsourcing data.We can use any pretrained relevance classification model as the mean function, in order to incorporate prior relevance knowledge.For idea validation, we simply pretrain a logistic regression model using the ground truth relevance labels in the corresponding training set of the crowdsourcing dataset, and use the predicted logits as the mean function value.
The covariance function of CrowdGP measures the correlation between two tasks.We employ the linear covariance k ) for lexical and ranking features.The length scale parameter l indicates with what scale the changes of input will not cause "large" change in output; the signal variance σ indicates the amplitude of the latent function.

Implementation
We employ GPflow [29] to implement our model.The model contains a number of parameters including the length scale l and variance σ of the covariance function, the mean and variance of the Gaussian noise for the annotators {(µ j , σ 2 j ) | j = 1, • • • , M }, the mean and variance of the Gaussian noise for the tasks {(µ i , Their initial values are set to be l = 1, σ = 1, µ i = µ j = 0, and σ i = σ j = 1 respectively.We set N (0, 1) as the prior for all the mean parameters, and Gamma (1, 1) as the prior for the length scale parameter and all the variance parameters.We use Adam [18] as the gradient descent optimizer in both the E step and the M step in Algorithm 1. Training step is set to 5000 to make sure CrowdGP converges.The code is publicly available. 1

EXPERIMENTS 5.1 Research Questions
In the remainder of the work we aim to answer the following research questions: RQ1 How does the model perform in terms of inferring latent true labels from crowd annotations compared with baselines?RQ2 How does the model perform in terms of selecting new tasks for future crowd annotation?RQ3 How do the auxiliary information of tasks (via task feature) and prior relevance knowledge (via mean function) affect model performance?RQ4 How do different optimization methods and different initialization of model hyperparameters affect the learning curve?
5.2 Experimental Setup 5.2.1 Dataset.We evaluate CrowGP on two crowdsourcing datasets: the crowdsourcing dataset of the TREC 2010 relevance feedback track [5] (CS20102 ), and the crowdsourcing development dataset of TREC 2011 crowdsourcing track aggregation task (CS20113 ).Each example is a tuple of query ID, document ID, annotator ID, ground truth label, and crowd label.The queries are from the TREC 2009 million query track [6] and the documents are from the ClueWeb09 collection.We also use the corresponding document ranks produced from the participation retrieval systems in the TREC 2009 million query track.The original CS2010 dataset contains 100 queries and 19,902 documents, and in total 20,232 unique query-document pairs and 96,883 relevance annotations given by 766 annotators.The ground truth labels were given by NIST experts in previous TREC tracks.We remove examples that are marked as invalid due to reasons such as broken links, or that have no corresponding ground truth label, or that have no text or ranks available for the document.Consequently, there remains 3,275 unique query-document pairs and 18,479 relevance annotations with ground truth labels.As our model is designed for binary labels, we turn the original ternary scale into a binary scale by mapping highly relevant or relevant labels to relevant labels.The original CS2011 dataset contains 25 queries and 3,557 documents, and in total 3,568 unique querydocument pairs and 10,752 binary relevance annotations given by 181 annotators.The ground truth labels are also from NIST experts.Similarly, invalid annotations are removed, resulting in 711 unique query-document pairs and 2,181 relevance annotations with ground truth labels.The statistics of the two datasets after preprocessing are shown in Table 1.
To understand the annotation distribution and quality of the two datasets, we are interested in: (1) how many redundant crowd     annotations are collected for each task (task redundancy), (2) how accurate the crowd annotations are for each task (task accuracy), (3) how many annotations each annotator gives (annotator redundancy), and (4) how accurate each annotator is (annotator accuracy).
Formally, we define the task redundancy of task i as the number of its crowd annotations, denoted by M i ; following [21], we define the task accuracy of task i as the accurate rate of its crowd annotations, denoted by ; we define the annotator redundancy of annotator j as the number of crowd annotations he or she gives, denoted by N j ; following [21], we define the annotator accuracy of annotator j as the accurate rate of his or her crowd annotations, denoted by . We plot the histograms in Figure 2 and 3. Overall, the quality of the crowd annotations of CS2011 is better than CS2010.

Baselines.
In order to evaluate its capability of inferring true labels given crowd annotations, we compare CrowdGP with MV and four popular PGMs which models annotators and tasks through different annotation generation assumptions, including DS [8], LFC [36], MACE [14], and GLAD [43].Furthermore, as our CrowdGP model allows using existing training crowd annotations (when pretraining the mean function) and auxiliary task information (task features), for fair comparison, we propose an intuitive approach -a Classifier -where the auxiliary information of tasks and crowd annotations are used in a supervised way to infer true labels of tasks given their crowd annotations.Finally, to understand to role the GP prior and the annotation generation process in CrowdGP, we propose two CrowdGP variants: Gaussian process with majority-voted label (MVGP) removes the annotation generation process component, and likelihood (LK) removes the GP prior component.
MV MV is a straightforward method to aggregate crowdsourcing annotations.The label with the highest number of votes among the annotators is considered as the true label.Annotators are treated with equal weights, which is not the truth in reality.DS [8] DS is a widely used PGM for label aggregation.It models the true label of tasks with a multinomial distribution of K classes.It models the competence of each annotator with a K × K confusion matrix, where K is the number of class.In total, there are M × K × K parameters for annotators and K parameters for tasks.It is an un-supervised method and only the crowd annotations are needed to infer the true labels.In the experiments we use the implementation of Zheng et al. [46].LFC [36] LFC extends DS by adding a Dirichlet prior over its parameters.In the experiments we use the implementation of Zheng et al. [46].MACE [14] MACE reduces the parameterized confusion matrix of DS into a parameterized vector.As a compensation, it introduces a new variable that indicates whether an annotator j is spamming on task i.The number annotator parameters is M × K + M. In the experiments we use the implementation of Hovy et al. [14].GLAD [43] GLAD models task difficulty and annotator competence with scale-value parameters.Annotator parameters are largely compressed into M + N (M parameters for annotators and N parameters for tasks).In the experiments we use the implementation of Zheng et al. [46].Classifier As a set of crowd annotations with ground truth labels are available, it is also possible to model the task of inferring true labels given crowd annotations as a supervised classification task.An intuitive way is to use crowd annotations for task features and train a simple classifier such as logistic regression model.The features of the Classifier consist of two parts: the features which are the same with that used in CrowdGP (lexical/semantic/ranking), and the features constructed using crowd annotations.We construct the second type of features in the following way: assume that for each query-document pair a set of crowd annotations are available, we use the mean, standard variation, median, maximum, and minimum values of the crowd annotations as the features.For example, assume a query-document is associated with 5 crowd annotations 1,1,1,0,0, the second type of features will be [0.6,0.  1(a)).The goals is to examine how important is the role the GP prior is playing in CrowdGP.We implement this baseline ourselves.

Label Inference for Crowd Tasks (RQ1)
In this experiment we study whether CrowdGP is able to correctly infer the latent true labels of tasks, and whether it performs better than the baseline models.We run the experiment on the both the CS2010 and CS2011 datasets and report the accuracy and the F1 score.Note that Classifier needs extra data containing crowdsourcing relevance annotations, and CrowdGP allows a pretrained mean function which needs to be trained on extra data containing ground truth relevance labels, therefore the corresponding training sets are used.The rest of the models are unsupervised and thus only the test set is used.For the CrowdGP model, we use all the three task features and the pretained mean function.We conducted 5-folds cross validation on both the two datasets.Table 2 shows the mean values of accuracy and F1 over the test sets of the 5-folds cross validation for all the models.First, MV is a strong baseline.The three PGMs (DS, LFC, and MACE) perform better than MV on CS2010 and worse than MV on CS2011, while GLAD performs worse than MV on CS2010 and better than MV on CS2011.The five models only take crowd annotations as input instead of any auxiliary information of tasks.MV simply treats each worker equally while the four PGMs model tasks and annotators in different ways to improve the quality of the inferred labels, however, the performance of the four PGM is not consistently better than MV.Similar finding can be found in [22, page 22], where MV and PGMs such as DS and GLAD are compared on 9 benchmark datasets, and of CrowdGP (we use the pretrained mean function).We run the experiment in a 5-folds cross validation setting on both CS2010 and CS2011 and report the accuracy and the F1 score on the test sets.Table 3 shows the performance of six different task feature combinations.It can be observed that on both CS2010 and CS2011 the semantic features perform better than the ranking features, and the ranking features performs better than the lexical features.The combination of the lexical and the semantic performs the best on CS2010 and the combination of ranking and semantic performs the best on CS2011.It indicates that the CrowdGP model is sensitive to task features and therefore careful designing of task features is necessary in application.
Similarly we examine three different mean functions: a zero function, a linear function, and a pretrained ranking function introduced in Section 4.7.The other settings are same as the default setting of CrowdGP (we use the all the lexical, the ranking and the semantic as the task features).We run the experiment in a 5-folds cross validation setting on both CS2010 and CS2011 and report the averaged accuracy score and F1 score on the test sets.Table 4 shows the performance of three different mean functions.It can be observed that the linear mean function performs than the zero, and the pretrain performs better than the linear, which is expected because the pretrain introduced prior knowledge of relevance, the linear learn this knowledge of relevance from the given crowdsourcing annotations, and the zero does not learn any prior knowledge of relevance at all.
To sum up, both the task features and the mean function affect the CrowdGP model, and the task features has stronger effect than the mean function.The designing of task features and mean function can be adapted accordingly in specific crowdsourcing applications.

Configuration Choices of CrowdGP (RQ4)
In this section we illustrate the learning behaviour of CrowdGP under different hyperparameters.The experiment is run on the test set of CS2010.Figure 5(a) shows the loss curve of two optimization method including SDG and VEM.The actual training step is 5000, for better visualisation we only present 1000 steps.It is observed that both optimization methods lead to convergence, and VEM converges faster and achieves lower loss value.Figure 5(b) shows the loss curve of 5 different initial values of the variances of the Gaussian variables for tasks and annotators.We only study the initialization of the variances as it is the key to determines task difficulty or annotator competence, we set the initial value of the means to be 0 as there is no prior knowledge on bias of task or annotator.Smaller σ 2 leads to slower convergence but lower loss value and bigger σ 2 leads to faster convergence but larger loss value.
To sum up, different model hyperparameters have very limited effect on model performance, but they do affect the convergence speed.

CONCLUSIONS AND FUTURE WORK
In this paper, we study the problem of relevance inference from noisy crowd annotations.We propose a new PGM name CrowdGP.It assumes a new annotation generation process and models the true relevance labels of tasks, the difficulty and bias of tasks, and the competence and bias of annotators by using a Gaussian process and multiple Gaussian variables.
We evaluate CrowdGP on two datasets, the crowdsourcing dataset of the TREC 2010 relevance feedback track, and the crowdsourcing development dataset of the TREC 2011 crowdsourcing track.The CrowdGP model performs consistently better than majority voting in terms of interring true relevance labels for tasks that have crowd annotations, while several state-of-the-art baselines including DS, LFC, GLAD and MACE perform comparably with majority voting.The CrowdGP model is also effective in terms of selecting new tasks for crowdsourcing labelling.It is a new functionality which is not supported in DS, LFC, GLAD and MACE etc.Moreover, ablation studies demonstrate that the effectiveness is attributed to the modelling of task correlation based on the auxiliary information of tasks and the prior relevance information of documents to queries.
One of the future work can be the Bayesian modelling of existing PGMs.For example, a GP prior can be integrated with DS in order to adapt to different crowdsourcing tasks.Another improvement may be achieved with a better task selection approach.In the current work, new tasks are selected for crowdsourcing labelling by a cutting off an EI score list at the predefined threshold.It is interesting to study whether iterative search approaches like Bayesian optimization can select new tasks more effectively, similar to the work on multi-armed bandits by Losada et al. [25], Rahman et al. [34] and whether it leads to biased or unbiased annotations [20].

ji
being relevant, denoted by Φ (b j i ).A crowd label y j i is generated from the Bernoulli distribution Bernoulli (Φ (b j i )).

Figure 1 :
Figure 1: Graphical model for the CrowdGP model.Squares represent observed variables and circles represent latent variables.Figure 1(b) illustrates a set of fully connected nodes following a GP. Figure 1(a) illustrates the annotation generation process.

Figure 5 :
Figure 5: Loss curves of different model hyperparameters.
Different from most label aggregation models, where tasks are represented as a set of indicators of {1, 2, . . ., N }, the covariance function k (x, x ′ ) in CrowdGP requires the tasks to be represented as a set of vectors of {x 1 , x 2 , . . ., x N }.The vectors can incorporate any auxiliary information.In this work, we use lexical features, semantic features and ranking features as they are easy to acquire for most and have been demonstrated effective in IR related tasks.
2  ) |  = 1, • • • ,  }, the mean and variance of the Gaussian noise for the tasks {(  ,  2  ) |  = 4.6 Task Representation (1) a term frequency score measuring the frequency of a query in a document (denoted by t ∈q∩d T F (t, d)), (2) a inverse document frequency score measuring how much information a query provides ( t ∈q IDF (t)), (3) a TF-IDF score measuring both (denoted by t ∈q∩d T F (t, d) I DF (t)),

Table 1 :
Statistics of two crowdsourcing datasets.
5, 1, 1, 0].The goal of this baseline is to explore another way of utilizing existing crowd annotations.CrowdGP utilizes crowd annotations by calculating the likelihood of a probabilistic generative model, while the Classifier baseline utilizes crowd annotations as its features.We implement this baseline ourselves.MVGP MVGP is a simplified version of CrowdGP.It is the same with CrowdGP except that the multiple crowd annotations are replaced with the single label aggregated by MV.It can be viewed as a vanilla GP classification model.The goals is to examine how important is the role the crowd annotations are playing in CrowdGP.We implement this baseline ourselves.LK LK is also a simplified version of CrowdGP.It removes the GP prior in CrowdGP and only keeps the annotation generation process part (see Figure

Table 3 :
The effect of task feature on label inference.

Table 4 :
The effect of mean function on label inference.