Grid-based Evaluation Metrics for Web Image Search

Compared to general web search engines, web image search engines display results in a different way. In web image search, results are typically placed in a grid-based manner rather than a sequential result list. In this scenario, users can view results not only in a vertical direction but also in a horizontal direction. Moreover, pagination is usually not (explicitly) supported on image search search engine result pages (SERPs), and users can view results by scrolling down without having to click a “next page” button. These differences lead to different interaction mechanisms and user behavior patterns, which, in turn, create challenges to evaluation metrics that have originally been developed for general web search. While considerable effort has been invested in developing evaluation metrics for general web search, there has been relatively little effort to construct grid-based evaluation metrics. To inform the development of grid-based evaluation metrics for web image search, we conduct a comprehensive analysis of user behavior so as to uncover how users allocate their attention in a grid-based web image search result interface. We obtain three findings: (1) “Middle bias”: Confirming previous studies, we find that image results in the horizontal middle positions may receive more attention from users than those in the leftmost or rightmost positions. (2) “Slower decay”: Unlike web search, users' attention does not decrease monotonically or dramatically with the rank position in image search, especially within a row. (3) “Row skipping”: Users may ignore particular rows and directly jump to results at some distance. Motivated by these observations, we propose corresponding user behavior assumptions to capture users' search interaction processes and evaluate their search performance. We show how to derive new metrics from these assumptions and demonstrate that they can be adopted to revise traditional list-based metrics like Discounted Cumulative Gain (DCG) and Rank-Biased Precision (RBP). To show the effectiveness of the proposed grid-based metrics, we compare them against a number of list-based metrics in terms of their correlation with user satisfaction. Our experimental results show that the proposed grid-based evaluation metrics better reflect user satisfaction in web image search.


INTRODUCTION
Image search has been shown to be very important within web search. Existing work shows that queries with an image search intent are the most popular on mobile phone devices and the second most popular on desktop and tablet devices [27]. In web image search a di erent type of search result placement is used compared to general web search, which results in di erences in interaction mechanisms and user behavior. Let us consider the image search search engine result page (SERP) in Figure 1 to highlight three important di erences: (1) An image search engine typically places results on a grid-based panel rather than in a one-dimensional ranked list. As a result, users can view results not only vertically but also horizontally. (2) Users can view results by scrolling down without having to click on the "next-page" button because the image search engine does not have an explicit pagination feature.
(3) Instead of a snippet, i.e., a query-dependent abstract of the landing page, an image snapshot is shown together with metadata about the image, which is typically only available when a cursor hovers on the result. Evaluation metrics encapsulate assumptions about user behavior [10,20] and, hence, di erences in user behavior should lead to di erences in the design of evaluation metrics in image search. Previous work on evaluation metrics [5,15,21] focuses on general web search scenarios where results are placed in a list manner. Among the evaluation metrics proposed, Rank-Biased Precision (RBP) [21] assumes that users will examine each result with a persistence probability p from top to bottom; users with a higher value of p are more patient to interact with search results. Discounted Cumulative Gain (DCG) [15] measures the gain of a document based on its position in the result list; the gain is accumulated from the top of the result list to the bottom and is discounted at lower ranks. Although these models work well to assess a result list in general web search, it is not obvious how to adapt them to image search where results are placed in a grid-based manner. Zhang et al. [39] show that the performance of these evaluation metrics is not promising in image search environments in the sense that they do not correlate well with user satisfaction. While the importance of di erent presentation formats has been recognized [23], there have been very few attempts to construct grid-based evaluation metrics.
As a rst step towards designing better evaluation metrics for web image search, we conduct a comprehensive user behavior analysis using data from a lab-based user study so as to obtain a deeper understanding of the underlying user behavior, especially how users allocate their attention.
To summarize, we have three major ndings through the analysis: (1) Similar with the ndings in [34], a middle position bias of users attention is observed in the user study data. (2) The attention of image search users is not discounted monotonically and dramatically along with the rank positions, which means that more attention might not always be allocated to the higher rank positions. Also, the attention allocated to results within a row shows less variance than the attention among di erent rows. (3) Users display row-skipping behavior on image SERPs. They may directly jump to results at some distance and ignore particular rows. A two-stage model can be used to depict this process in which users will judge the whole row rst, and then decide to skip this row or view the details of results in this row. Motivated by these observations, we propose corresponding user behavior assumptions to simulate users' interaction processes on SERPs. As shown in [3], evaluation metrics can be generalized as a function of gain and stopping probability, that is the sum over all ranks of the gain (e.g., relevance) accumulated by examining that far, times the probability that this is where the user stops examining the results. The basic idea of our proposed assumptions is to revise the stopping probability by incorporating grid-based position information. We show how we derive new evaluation metrics from these assumptions and how to adopt them to revise well-known list-based metrics.
We conduct extensive experiments to test the proposed assumptions. By using a large-scale commercial image search log, we show that incorporating grid-based features can help user behavior models to better predict the stopping position. We also use data from a eld study, in which users' explicit satisfaction feedback and assessors' relevance judgments are available, to measure the performance of the grid-based evaluation metrics. We demonstrate that in image search, existing list-based metrics do not correlate well with user satisfaction while the proposed grid-based evaluation metrics can better re ect user satisfaction.
In summary, we make the following contributions: • We thoroughly investigate how users allocate their attention on a grid-based interface in image search. We have three major ndings of user behavior, i.e., "Middle bias, " "Slower decay, " and "Row skipping. " • Motivated by our ndings on how attention is allocated, we propose corresponding user behavior assumptions to simulate users' search processes. We then derive new grid-based evaluation metrics based on these assumptions. • We conduct extensive experiments to test the performance of our proposed grid-based evaluation metrics. Experimental results demonstrate that they better re ect user satisfaction and the assumptions behind them are closer to practical user behavior than the assumptions underlying competing models.

RELATED WORK
Related work comes in two areas: image search and evaluation metrics.

Image search
As result placement and interaction mechanisms in image search are di erent from general web search, user behavior in image search is di erent from user behavior in general web search. There exists a number of studies on user behavior analysis of image search engines. One line of prior research focuses on characterizing general user behavior based on search logs [2,12,28,32]. Compared with general web search, important di erences in user behavior (e.g., shorter queries, a tendency to be more exploratory, and to browse deeper) have been observed. Another line of research investigates more ne-grained user interactions with image SERPs. Xie et al. [34] observe a di erent browsing model on image SERPs and show a middle position bias of users' examination behavior. The observation "Middle bias" in this paper accords with their ndings. Also, interaction behavior such as cursor hovering has been shown to be a valuable additional signal for relevance [22,35]. User behavior that is unique to image search has motivated various attempts at user behavior modeling that aim to improve the performance of image search engines [14,35,37]. Di erences in user behavior also have an impact on evaluation. Previous work on evaluation of image search mainly adopts existing list-based evaluation metrics to measure the performance of models developed for image search by simply joining results together [11,14]. Sanderson [24] introduces evaluation measures used in ImageCLEF, an evaluation forum for cross-language annotation and retrieval of images. However, these metrics still follow those in general web search. Zhang et al. [39] nd that existing metrics in web search do not correlate well with user satisfaction in image search. The construction of evaluation metrics that do correlate well with user satisfaction in the context of grid-based interfaces for image search still remains an open question and deserves more attention.

Evaluation metrics
Evaluation sits at the center of IR research. In order to approximate the system's performance and users' search satisfaction, two components are needed. One is a search result collection labeled with query-dependent relevance levels and the other is a well-designed user model used to simulate the search process [25]. A number of e ective evaluation metrics have been designed for general web search [7]. These metrics mainly follow the assumption that users scan ranked results from top to bottom before they stop [9]. One of these, RBP [21], assumes that users examine the (i + 1)-th result after examining the i-th result with persistence p and will end their examination with probability 1 − p. Järvelin and Kekäläinen [15] propose a metric, DCG, that formalizes user gain from a result list as a discounting process. Besides considering the position impact, Expected Reciprocal Rank (ERR) [5] takes result relevance into consideration and de nes the probability that a user is satis ed with a document to be related with relevance of the document. More sophisticated measures have been developed recently. Zhang et al. [38] try to model the search process based on upper limits for both bene t and cost, and propose a Bejeweled Player Model. Also, Wicaksono and Mo at [30] provide a detailed discussion of continuation probabilities (e.g., the persistence p in RBP) in user behavior models that underlie evaluation metrics.
In information retrieval, user satisfaction can be understood as the ful llment of a speci ed desire or goal [16]. Satisfaction can be considered as the golden standard in search performance evaluation and is used to re ect users' actual feelings about the system [1,13]. Correlation with actual user satisfaction is often taken to be the ultimate test for newly proposed evaluation metrics. Indeed, there exists a number of studies investigating di erent evaluation methods and the correlation between these methods and satisfaction [6,19,20,26]. In this paper, we follow the same principle and also measure the performance of the proposed evaluation metrics by considering their correlation with actual user satisfaction.
What we add on top of the work discussed above is the following. List-based metrics have shown their e ectiveness in estimating users' search satisfaction and measuring the performance of general web search engines -but they are list-based. However, in image search a grid-based result placement is adopted. We show that considering grid-based position information as part of the design of evaluation metrics can be bene cial. No previous research has investigated grid-based evaluation metrics for web image search.

IMAGE SEARCH USER BEHAVIOR PATTERNS
In order to gain a better understanding of user behavior in image search we examine the attention allocation mechanisms of search users in image search. The ndings of this examination will help us to formulate an image search user model that will underlie our proposed grid-based evaluation metric. We use two publicly available datasets, of image search and web search respectively, in this paper. The image search dataset has been created using data collected in a lab-based user study in image search scenarios [34]. A total of 40 participants have been recruited to complete 20 image search tasks in this study. A Tobbi eye-tracker with default settings has been used to record the examination behavior of participants; the participants' xation points and xation dwell time were recorded and certain image being examined was recorded by the built-in algorithms. The general web search dataset has been created using data collected in another user study conducted in general web search scenarios [18]. This dataset involves 32 participants who have been recruited to complete 30 web search tasks. Participants' xation points on general web SERPs were recorded used the eye-tracker with the same settings and built-in algorithms as in the rst dataset described above. Based on these two datasets, we cannot only investigate examination behavior in image search but compare image search with general web search.
We obtain three major ndings of user examination behavior on image SERPs. They are "Middle bias, " "Slower decay, " and "Row skipping. " The rst one ("Middle bias") is mainly column-based and share the same observations with [34]. Starting from reviewing this nding, we introduce two new observations ("Slower decay" and "Row skipping") which are mainly row-based.

"Middle bias"
In image search, results are placed in a grid-like manner. Hence, users cannot only examine results vertically, as in web search, but also horizontally, within a row. It is important to investigate how users allocate their attention within a row. For the rst dataset, similar to [34], we use the absolute position instead of the border of images to segment SERPs since the number of images in each row may be di erent (see the SERP example in Figure 1). Each SERP can be equally divided into 5 columns. We then draw a heat map with 10 rows and 5 columns of the distribution of examination durations (averaged over tasks and users); see Figure 2. Here, the examination duration of an image is de ned as the dwell time during which a user gazes at the image. Gaze is the externally-observable indicator of human visual attention [17].
By examining the heat map in Figure 2 we re-con rm the observations from [34]: the middle positions in each row receive more attention than other positions, i.e., the leftmost or rightmost positions.
Based on these observations, we propose our rst hypothesis:

Hypothesis 1 -Middle bias
Image search results in the middle position may attract more attention from search users than results in the leftmost or rightmost position. Hypothesis 1 is not new: Xie et al. [34] already apply a linear mixed model to justify that the middle-position bias is signi cant statistically. That is, eye gaze behaviors are related to the location of an image within a row, and placing an image in the middle columns has a signi cant impact on xation duration. However, they didn't adopt it to construct new image search evaluation metrics.
After Hypothesis 1, which concerns user examination behavior within a row, we introduce two other new observations and hypotheses that concern inter-row examination behavior patterns of image search users.

"Slower decay"
In image search, users can view results by scrolling down without having to click the "next page" button, which brings less cost to users and results in more exploratory search and deeper browsing depths [32]. We use the eye-tracking user study datasets to investigate how users examine SERPs in image search and general web search.
As shown in [35,39], di erent within-row directions have little impact on user behavior modeling in image search. De ne the rank position in a grid by following the top to bottom and left to right order. We calculate the examination duration for each cell in the grid (in the same way as was used in Section 3.1) and plot the distribution of the top 10 rank positions of image results in Figure 3. For the second dataset, we calculate the examination duration for each result and also plot the duration distribution in Figure 3 for comparison with image search. From Figure 3, the rst observation is that users' examination duration does not decrease dramatically with the rank position in image search, especially within the same row (To note, there are ve cells within a row.). Also, the di erence of values between positions at di erent rows is smaller than the di erence in web search. The second observation is that the change in examination duration in image search is not always monotonic, which is also di erent from web search. Position 7 (0.694s) receives a longer xation than position 4 (0.671s) and position 5 (0.505s). In the case of web search, attention decreases in a monotonic way and at a higher speed than in the case of image search.
This leads to our second hypothesis:

Hypothesis 2 -Slower decay
Users' attention does not decrease monotonically and dramatically with the rank position. In the case of image search attention decays at a slower speed than in general web search.
To verify Hypothesis 2, we rst take "two distributions in web search and Image search are similar" as the null hypothesis and then we use Pearson's chi-squared test, which is used to determine whether there is a signi cant di erence between the expected distribution and the observed distribution, to determine whether the null hypothesis is true. The result shows that the p-value is less than 0.001. Hence, we can reject the null hypothesis and say that the di erence in examination duration distribution between web search and image search in Figure 3 is signi cant. Also, we de ne "decay speed" as the result of dividing examination duration in position i by examination duration in position i + 1. We calculate the average decay speed based on the data shown in Figure 3. Results show that the average decay speed of image search (1.06) is much lower than of general web search (1.48).

"Row skipping"
We look deeper into examination sequences of search users using the eye-tracking data. We nd that users will not examine every row one-by-one from top to bottom, which means they will skip rows and examine results at some distance. This "Row skipping" behavior can be formalized as: Right after a user examines results in the i-th row, she/he examines results in the j-th row where j > i + 1.
We de ne the probability of row-skipping behavior in a certain row (row i) as: Here, E (i) is the number of cases where row i is being examined right after row i − 1 has been examined. And S (i) is the number of cases that users examine results at a row with a row number larger than i after examining row i −1. We de ne "search begin" as the row before row 0. That is, row 0 being skipped means the rst examined row is not row 0. We show the probability of row-skipping behavior in the rst 10 rows in Figure 4. There exists row-skipping behavior in image search. The highest probability is about 12% in the 6-th row in the rst dataset. Also, the row-skipping probability in the 0-th row is much smaller than in later rows, which means users rarely skip the rst row in image SERPs. Assuming that participants in a lab-based user study are more patient than users in real-life environments, the probability of row-skipping in real-life can be even higher. Thus, we propose our third hypothesis:

Hypothesis 3 -Row skipping
Users may ignore particular rows and directly jump to results at some distance.
We take "the frequency of cases that row i being examined after row i − 1 has been examined (i.e., E(i)) accords with the frequency of all cases that previous examined row is i − 1 (i.e., S(i)+E(i))" as the null hypothesis. We also perform a chi-squared test and nd that the p-value is less than 0.001. Therefore, we can reject the null hypothesis and say that row-skipping behavior does exist in user examination process.
To sum up, we have presented three hypotheses concerning user behavior in image search based on the observations made during eye-tracking user studies. Statistical tests have been conducted to verify the hypotheses and show the signi cance. Although our rst observation (i.e., "Middle bias") is not new, it has not been adopted in the design of image search evaluation metrics. we devote our attention to it as well as to two other, new observations, since considering both interaction processes in the horizontal direction ("Middle bias") and in the vertical direction ("Slower decay" and "Row skipping") as part of the construction of a gridbased evaluation metrics is bene cial in this two-dimensional environment.

GRID-BASED EVALUATION METRICS
In this paper, we construct grid-based evaluation metrics based on the user behavior hypotheses proposed in Section 3. We rst introduce a uniform framework from which existing list-based evaluation metrics can be instantiated. We then propose three modeling assumptions motivated by the hypotheses in Section 3. Based on these assumptions, we derive new grid-based metrics by making revisions on the uniform structure.

Evaluation framework
Given a result set generated in response to a query, we can estimate users' satisfaction based on the relevance score of each query-result pair and a particular user model followed by users when they interact with this result set. Existing list-based evaluation metrics mainly follow an interaction process where users scan ranked results oneby-one from top to bottom before they stop. This interaction process can be regarded as a cascade model [9]. Following the cascade assumption, Mo at et al. [20] de ne a framework that captures a user's expected utility to generalize arbitrary list-based evaluation metrics (M) as: where R i is the relevance score of the i-th result, and W i is the metric-speci c weight at rank position i. For example, for RBP with persistence probability p, W i = (1 −p)p i−1 and for DCG, the metricspeci c weight W i would be 1/log 2 (i + 2). To note, W ∞ is set to 0 for existing metrics. Similar to work reported in [3,38], we construct a uniform framework by considering user continuation and stopping probability. That is, users have a continuation probability C i at position i to examine the (i + 1)-th result and with probability S i they stop at position i and leave this search or issue another query. Thus, S i can be represented as: As shown in [3], the conditional probability of continuing past the i-th result, i.e., C i , relates to the metric-speci c weight, which can be computed as: We can transfer the framework mentioned in Eq. 2 to uniform framework depicting user stopping behavior and accumulated gain (relevance) as: We refer to M as the total user expected utility. Next, we show that M and M are equivalent (i.e., M ∼ M): The last equivalence holds because W 1 is a constant given a certain evaluation metric. The framework detailed in Eq. 5 can take the user interaction process into consideration more naturally than the framework depicted in Eq. 2 which mainly models the metricspeci c weight and obtained gain for each rank. We therefore make revisions on this framework by incorporating grid-based assumptions. For convenience, we use a triple (i, r (i), c (i)) to represent the index of an image result. As we discuss in Section 3.2, we prede ne the examination order of search users in image search to be from top to bottom and from left to right. Based on this order, we can obtain the rank position i of a certain image which is in the r (i)-th row and c (i)-th column.
We are now in a position to introduce the grid-based modeling assumptions which are among the contributions of our work. The order in which we propose our assumptions is the same as the order used to present observations of user behavior patterns in Section 3, i.e., "Middle bias" followed by "Slower decay" and "Row skipping. "

Middle bias assumption
The rst assumption, named "Middle bias," focuses on the interaction within a single row, i.e., it is column-based. As mentioned in Section 3.1, users have a higher probability to examine results in the middle position. In this paper, we simulate this bias by considering users' continuation examination, in which we increase the stopping probability in the middle position and lower it in the leftmost or the rightmost positions. We assume that users will have a higher probability to nally stop at the middle position within a row. Hence, we can use a column-based function f (c) to modify the stopping probability S i . For the image at rank position i with the column number c (i), we design the function f (c (i)) as follows: where (c (i)) is a normal distribution with mean µ and standard deviation σ as: where MP denotes the column index of the middle position in row r (i). We leave explorations of other functions (such as, e.g., a quadratic function) as future work. In a normal distribution, the mean is the central tendency of the distribution; it de nes the location of the peak for normal distributions. And the standard deviation is a measure of variability; it de nes the width of the normal distribution. Since we simulate users' middle bias in this assumption, we set µ to be a constant number 0 to further simplify the parameter estimation process, which means the "location" of the normal distribution will be right in the middle of the column. Thus, σ is then the only parameter needed to estimate in Eq. 8. Hence, based on the middle bias assumption, the total user expected utility (M) can be represented as:

Slower decay assumption
As mentioned in Section 3.2, the "Slower decay" observation shows that users are more patient in image search than in web search. Their attention decreases more slowly, especially on results within a row. Thus, simply adopting existing evaluation metrics, developed for web search, to image search scenarios is not promising. In this paper, we utilize the row information of image results. We assume that users' stopping probability will increase along with the row. Hence, we can revise the stopping probability S i in the proposed evaluation framework by multiplying S i with a row-based function I (r ). Considering a result at rank position i with row number r (i) and column number c (i), the revised probability of stopping at this result can be computed as: where S i,r (i ),c (i ) is the original stopping probability of a certain list-based metric; I (r ) is a monotonically increasing function. In this paper, we de ne I (r ) as an exponential function with a base β larger than 1. Then, we can rewrite Eq. 5 as: Adding the parameter β can slow down the speed of decreasing the stopping probability along with rows, since users might still have a relatively high probability of examining results at a lower rank (see Figure 3). Also, the stopping probability of results within a row will multiply the same value according to Eq. 10, which attempts to control the variance between the stopping probability of results in the same row. When β = 1, Eq. 10 models the stopping probability of existing list-based metrics. We show how di erent values of β a ect the estimated stopping probability distribution in Section 6.

Row skipping assumption
The third assumption is motivated by the "Row skipping" observation which suggests that users may skip particular rows and jump to results at some distance. In this paper, we model this process by considering a two-stage browsing process. In the rst stage, users brie y browse the whole row; we can join image results within a row together to an imaginary "united image". By viewing this "united image," users will make a decision for the second stage where they either skip this row or examine results in this row in detail. We arrive at this two-stage browsing process motivated by a neuroimaging study [36], which gives important hints about the multistage mechanisms of visual perceptual learning in the brain.
We are now in a position to describe our row skipping evaluation metric (M RS ). We use a parameter γ to depict the probability with which users skip the next row after examining the current row; γ is also a trainable parameter with a value between 0 and 1. Then, the stopping probability of users at rank position i can be computed as follows: where N (k ) is the number of images in the k-th row and S (k ) is the total number of images before the k-th row. The rst part of Eq. 12, before the multiplication sign, depicts the two-stage browsing assumption. We simply assume that with a probability (1 − γ ), users will examine all the image results within this row.
Since users stop at row r (i), they will not skip row r (i). Thus, there is the probability (1 − γ ) in the second part after the multiplication sign in Eq. 12.
The row skipping assumption also has an impact on the accumulated gain (i.e., i j=0 R j ). Since users have a probability γ to skip a certain row, the gain received from this row should be discounted by multiplying (1 − γ ). Hence, the total user expected utility (M) based on the row skipping assumption can be computed as: In this section, we have proposed three grid-based assumptions. According to these assumptions, we revise the formula expressing the continuation probability, stopping probability and also the accumulated gain in the uniform evaluation framework (see Eq. 5). To sum up, we modify the stopping probability at di erent columns, increase the value in the middle position, by considering a normal distribution according to the "Middle bias" assumption. We modify the stopping probability at each rank by increasing the value of the probability along with the rank according to the "Slower decay" assumption. And based on the "Row skipping" assumption, we consider a two-stage browsing process in which users have a skipping probability to ignore a certain row. Thus, the accumulated gain of a certain row is also modi ed by multiplying the probability that users browse this row.

EXPERIMENTAL SETUP
We evaluate the proposed grid-based evaluation metrics using search logs from a commercial image search engine and data from a eld study, in which query-level satisfaction feedback and assessors' relevance judgments for query-image pairs are available.
Since there is a user behavior model, which depicts the stopping behavior of search users, behind each proposed assumption, we rst perform a sanity check, that is, an experiment to test if incorporating grid-based features can help the underlying user behavior model to better predict the stopping position (in terms of mean Table 1: Statistics of the datasets used in our experiments ("#" refers to "number of"). Search log  --82,629  100,000  Field study  555  50  1,212  1,212 log-likelihood). As mentioned in Section 2, user satisfaction can be considered as the golden standard in search performance evaluation.

Dataset #Tasks #Participants #Queries #Sessions
In the same way as in [19,38], we compare our proposed grid-based evaluation metrics against existing list-based metrics in terms of their correlation with user satisfaction to show the e ectiveness of proposed grid-based assumptions.
In this section we rst introduce the datasets and then describe the design of the two main experiments in this paper.

Datasets
Two image search datasets are used to conduct the experiments. Descriptive statistics of these two datasets can be found in Table 1.
The rst dataset ("Search log") is randomly sampled from a search log in October 2017 from the Sogou image search engine, which is popular in China. In this dataset, the grid-based information (i.e., row and column number of image results) and user interaction behavior (i.e., click and cursor hovering) are available. We keep query sessions that have at least one click to make sure we can estimate the user's stopping position, since the last clicked rank can be used to approximate the users' actual stopping rank as shown in [3]. The number of search sessions used in this paper is 100K in total, with 80,000+ distinct queries. We split all query sessions into training and test sets at a ratio of 8:2.
The second dataset ("Field study") consists of data collected from a one-month eld study, which is publicly available (see [31]). In this eld study, participants are asked to provide explicit satisfaction feedback for their search experience. To note, they can decide which query sessions they want to give the explicit feedback on without having to annotate all search sessions and they are also asked to provide a description about the task they conduct when issuing a speci c query. Query-level satisfaction scores on a 5 point scale are gathered. Besides user behavior data recorded using a browser extension and explicit feedback from participants, relevance scores of query-image pairs are annotated by assessors on a crowdsourcing platform. Each query-image pair has at least ve relevance annotation scores in the range of 0 to 100. We use the average of these annotation scores in our experiment as the label of a certain query-image pair. Also, assessors are recruited to assign a user intent tag to each task (i.e., the "Locate, Learn, Entertain" taxonomy proposed by Xie et al. [33]). Since image search users usually have deeper browsing depths, we test the performance of evaluation metrics at depths of 5, 10 as in [39] and 15 as well. Thus, we keep query sessions in which the number of the last browsing row is not less than 15, which leads to 1,212 query sessions in total in our dataset.

Experiment 1: Behavior prediction
Experiment 1 is aimed at testing whether the proposed grid-based user behavior assumptions (considering the continuation and stopping behavior in a grid-based interface) are closer to real-life user behavior than list-based assumptions. As mentioned in Section 4, the proposed grid-based assumptions revise users' stopping probability at di erent rank positions by incorporating row and column information. To validate the user behavior assumptions underlying the proposed evaluation metrics, we test the performance of these assumptions on predicting users' actual stopping positions. We use RBP as our baseline model that naturally takes users' continuation and stopping into consideration. In RBP, a persistence probability p is used to depict users' continuation probability at each rank. Based on formulas introduced in Section 4, we can calculate the stopping probability at di erent ranks, estimated by RBP as well as by grid-based RBPs with di erent proposed assumptions (i.e., "MD": Middle bias; "SD": Slower decay; "RS": Row skipping). For example, the stopping probability at rank position (i, r (i), c (i)) estimated by grid-based RBP with the "Slower decay" assumption according to Eq. 11 can be computed as: We regard the last click position to be users' stopping position on SERPs in the same way as in [3]. And we use log-likelihood to show how well the stopping probability distributions estimated by di erent models approximate the actual user stopping behavior. We use a grid-search algorithm to estimate the best parameter(s) for each model to minimize the mean log-likelihood of the training data (80%) in our rst dataset. We then test the performance of these models with the pre-trained hyper-parameter(s) in the test data (20%). We show the details of the bounds and discretization of the di erent parameters needed to be estimated using grid-search in Table 2.

Experiment 2: Correlation with user satisfaction
In Experiment 2, we measure the performance of our grid-based assumptions by testing the correlation between grid-based evaluation metrics, derived from our assumptions, and user satisfaction. We rst conduct experiments on RBP-based metrics. We show Pearson's correlation results of RBP with di erent assumptions (the original list-based and the proposed grid-based assumptions). We also construct a t-statistic to test the signi cance of the di erence between two dependent correlation coe cients [8]. The p-value level is reported if a signi cant di erence is observed. We then look deeper into the e ect of di erent settings of our proposed assumptions (e.g., di erent starting rows to consider row-skipping assumption, di erent number of rows of results being modeled in the evaluation metrics). After that, we report results of the gridbased evaluation metrics, under the best settings, based on other list-based prototype metrics (i.e., ERR and DCG). Comparisons are also made between di erent prototype metrics.

RESULTS
We rst report the results of Experiment 1, behavior prediction of user behavior models that are based on di erent grid-based assumptions. Then, in Experiment 2 we show the performance of grid-based evaluation metrics in terms of their correlation with user satisfaction. We compare the parameter selection in di erent tasks and discuss the optimal settings to perform our proposed grid-based evaluation metrics. Additional comparisons are made between di erent grid-based evaluation metrics based on di erent prototype list-based evaluation metrics. Table 3 shows the minimized mean log-likelihood of each user behavior model as well as the value of the best parameters and improvements over the baseline model (RBP). Here, the improvement of the log-likelihood of model A over model B is computed as

Evaluation of behavior prediction
. We also perform pairwise t-tests to determine the signi cance of the observed di erence between grid-based models and the baseline model. Table 3: Outcomes of Experiment 1. Minimized mean Loglikelihood of user behavior models. **: signi cantly better than the RBP model with p-value < 0.01.

Model
Parameter(s) Log-likelihood Improvement Compared against the baseline model (RBP), our grid-based models with the proposed assumptions achieve better performance on behavior prediction, i.e., users' stopping behavior, in terms of mean log-likelihood. Also, all observed di erences are signi cant. The best grid-based model RBP-RS obtains a 13.5% (signi cant) improvement over the list-based model RBP. Thus, incorporating grid-based information into the construction of a user behavior model is bene cial and results show that search user behavior in a grid-based environment di ers from that in a list-based environment.
Compared to the "Slower decay" and "Row skipping" assumptions, both of which help RBP to better predict user stopping behavior, RBP with the "Middle bias" assumption has a smaller improvement over the baseline model on behavior prediction. The reason can be two-fold: (1) The method used to depict middle position bias of search users may not be optimal. The practical distribution of users' stopping probability within a row may follow more complex distributions. We leave an investigation on methods to more accurately model "Middle bias" behavior as future work. (2) Users' stopping behavior correlates more with row information than with column information. Thus, the row-based assumptions ("Slower decay" and "Row skipping") achieve better results than the column-based assumptions ("Middle bias") on behavior prediction. We also show the value of the best parameters in Table 3. We can see the performance of the baseline model with a xed continuation probability is not promising, which indicates that in image search users' continuation probability may be a ected by other factors like the position information of the current examined result.
When considering the "Middle bias" assumption, the value of the best parameter σ is 2. For a normal distribution, a small standard deviation (σ ) produces a distribution that is more tight. Thus, a di erence in stopping probability between middle position and other positions is observable.
By incorporating the additional parameter β (when considering the "Slower decay" assumption), we are able to consider the possibility that users' stopping probability will increase along with the row. In this setting, the probability of the results at lower ranks being examined will be higher than in the list-based setting, which might indicate that in image search users have deeper browsing depth (conrming [32]); the stopping probability will decrease slowly.
In RBP with the "Row skipping" assumption, we observe that the probability to ignore certain rows is 0.2. This observation accords with the results shown in Figure 4 while the row-skipping probability in the search log is slightly higher than in the user study data. This may be caused by the fact that participants in a lab-based user study may be more patient due to the phenomenon that their behavior will be recorded. Thus, the probability of row-skipping of real-life users can be higher.
In summary, Experiment 1 has shown that the grid-based assumptions proposed in this paper are closer to natural user behavior than list-based assumptions. User behavior models underlying the grid-based assumptions achieve better performance in predicting real-life user behavior, i.e., users' stopping behavior. The value of the estimated parameters of grid-based assumptions further conrms the observations introduced in Section 3.

Evaluation of user satisfaction correlation
As explained in Section 5.3, we rst consider RBP-based evaluation metrics at top 10 rows, in the same way as in [39]. Table 4 shows the coe cients of Pearson's correlation between RBP-based metrics and user satisfaction. As shown in Figure 4, the row-skipping probability in the 0-th row is much smaller; we also compare the di erent settings of where we start to perform "Row skipping" in this table. We can observe from Table 4 that with the help of the proposed grid-based assumptions, RBP-based evaluation metrics can achieve better correlation with user satisfaction than the original RBP that follows the list-based assumption.
Since the optimization target is di erent from the target in Section 6.1, we t the best parameters of di erent evaluation metrics to gain the best correlation with users satisfaction in this experiment. We can observe from Table 3 and Table 4 that the best parameters in these two tasks are slightly di erent. The reason can be two-fold: (1) We consider a xed number of rows in this experiment to calculate the correlation, since a prede ned scale of results being measured is required for o ine evaluation metrics [15,21]. However, in Experiment 1, we compute the log-likelihood based 0.333 - (2) In the eld study, participants can freely decide the feedback of which query session to be recorded by the browser extension. Thus, the search intent distribution may be slightly di erent between the eld study data and the search log. This di erence has previously been observed by [33]. Di erent search intents have an impact on user behavior and satisfaction [31]. We also investigate how the choice of di erent rst rows to which to apply the "Row skipping" assumption a ects the performance of the grid-based evaluation metrics. Results are shown in Table 4; they indicate that applying the "Row skipping" assumption at the very beginning is not promising. When we apply the "Row skipping" assumption at the second row (i.e., RBP-RS(S@1)), we observe a better result, with a higher correlation with user satisfaction than list-based RBP. We also show the results of other RBP-RSs with di erent starting rows (with a row number larger than 1) to apply the "Row-skipping" assumption in Table 4. We nd that although these metrics are better than the list-based metric, the improvement of them over the baseline decrease along with the row number of the starting rows. RBP-RS(S@3) has no observable improvement. Thus, it is optimal to consider "Row skipping" starting from the second row. This nding accords with the results plotted in Figure 4, which shows that users rarely skip the rst row on an image SERPs.
Since we need to de ne the number of rows of results being considered in the evaluation metrics before using a certain metric, we also discuss the optimal setting of the row scale. For each query session, we test the performance of di erent RBP-based models at the top 5, 10 and 15 rows respectively. Results are shown in Table 5.  We have two ndings from this table: (1) When only a small number of rows is considered, grid-based evaluation metrics with the "Row skipping" assumption, which mainly takes row-based information into consideration, cannot achieve improvements over the list-based competitor. However, the column-based evaluation metric (i.e, RBP-MD) is still better than the baseline model: RBP-MD mainly considers user behavior within a row: changes in the number of rows have less e ect on it. For the "Row skipping" assumption, the reason that we only observe small di erences may be that the stopping probability at a lower rank is too small to be a ected. Thus, the improvement over RBP obtained by considering "Row skipping" is achieved mainly from the top rows. (2) The more rows are being considered in the evaluation metrics, the better the correlation with user satisfaction that can be achieved, for all evaluation metrics (see, e.g., how "Top 10 rows" compares with "Top 5 rows"). However, the di erence between "Top 10 rows" and "Top 15 rows" is small which indicates there exists an upper bound on the performance. Hence, considering the annotation expense, we regard "Top 10 rows" as the best setting of the row scale. Armed with the best settings ("S@1", "Top 10 rows") observed from the experiments conducted on the RBP-based evaluation metrics, we further test the e ectiveness of our proposed grid-based assumptions on other list-based prototype metrics. We perform experiments on two other list-based prototype metrics, i.e., DCG and ERR. Recall that DCG is also a position-based model, like RBP. The di erence is that the continuation probability of the result at rank i in DCG is rank-dependent; it can be computed as: In addition, we consider ERR. Unlike DCG and RBP, the stopping criterion of ERR is a ected by the gain (G) of the currently examined result. Following [5], the probability that a user stops at rank i can be represented as: where G i is the gain that correlates with the relevance score of the current result at rank i, which has the following form: where r is the relevance score of the i-th result. ERR and DCG have been used in previous evaluation tasks on image search [35,39].
We are now in a position to test the performance of our gridbased assumptions on these two evaluation metrics. The results are presented in Table 6. The proposed grid-based assumptions can help ERR and DCG to achieve better correlation with user satisfaction, while an exception is observed (ERR with the MB assumption). All ERR-based evaluation metrics obtain a poor correlation with user satisfaction, con rming a similar result by Zhang et al. [39]. The reason may be that ERR focuses more on the user gain rather than the examined position. As shown in [11,35], users' judgments Table 6: Outcomes of Experiment 2. Pearson's Correlation between evaluation metrics (DCG and ERR @top 10 rows) and user satisfaction in eld study dataset. "(u@0.9)" refers to "Upper bound of continuation probability is 0.9". All correlations are signi cant at the p < 0.001 level. ‡ ( †): the difference is signi cant comparing to the corresponding listbased metrics at the p < 0.01 (0.05) level.

Metric
List about image results depend largely on image attractiveness. Only considering the e ect of relevance on user stopping may not be promising. Furthermore, since position information is not explicitly modeled in the stopping probability in ERR, a grid-based version of ERR cannot achieve promising results. For the DCG-based evaluation metrics, we can observe the expected results that most grid-based DCG metrics perform better than the list-based DCG, demonstrating the e ectiveness of our grid-based assumptions. We also observe a similar performance of DCG and DCG-SD. This may be explained by the fact that the original continuation probability, which is shown in Eq. 15, approaches 1 quickly along with the rank which results in a small stopping probability approaching 0. Thus, the parameter β of "Slower decay" assumption has limited e ect on the stopping probability. We also consider an upper bound of the continuation probability of DCGbased evaluation metrics. The results are also shown in Table 6 (last row). All grid-based DCG metrics obtain better correlation with user satisfaction than the list-based DCG. Also, setting an upper bound on the continuation probability improves the performance of all DCG-based metrics, which con rms the observation that users' attention decays at a slower speed. Simply adopting assumptions of list-based DCG is not promising in image search scenarios. Importantly, the best parameters of grid-based DCG are almost the same as for RBP shown in Table 4, i.e., σ (1), β (1.1) and γ (0.2), where the di erent setting for β may be caused by the different continuation probability settings between RBP and DCG. The results shown in Table 6 indicate that the proposed grid-based assumptions help increase the correlation of position-based models of user satisfaction (e.g., RBP and DCG).
In summary, Experiment 2 has shown that the proposed gridbased assumptions can help existing list-based evaluation metrics, especially position-based evaluation metrics (e.g., RBP and DCG), to better re ect user satisfaction. We nd that: (1) performing the "Row-skipping" assumption beginning at the second row rather than the rst row can help RBP-RS to achieve better performance; and (2) a result grid limited to the top 10 rows in RBP-based evaluation metrics is optimal considering the trade-o between metric performance and annotation cost.

CONCLUSION AND FUTURE WORK
In this paper, we have conducted a comprehensive user behavior analysis using data from a lab-based user study so as to understand the attention allocation mechanisms of search users in image search.
We obtain three major ndings through our analysis: (1) User attention follows a middle position bias within a row ("Middle bias").
(2) User attention in the case of image search decays more slowly than in general web search ("Slower decay"). (3) Users may skip particular rows and jump to results at some distance ("Row skipping").
We have proposed three grid-based assumptions. Our experimental results show that user behavior models underlying these grid-based assumptions are closer to real-life user behavior. Existing evaluation metrics (e.g., RBP and DCG) can achieve better performance in terms of correlation with user satisfaction by taking grid-based assumptions into consideration.
Our work is the rst attempt to construct grid-based evaluation metrics for Web image search. The research outputs of this paper can guide the optimization of image search engines (e.g., in result ranking and UI design) and are also meaningful to inform user behavior modeling in grid-based environments (not only image search but also video search and e-commerce).
Limitations of the proposed grid-based assumptions which may guide future work: (1) The proposed grid-based assumptions mainly consider the e ect of the position. It may be bene cial to also take appearance bias (the e ect of image attractiveness) into consideration. (2) The way to model grid-based user behavior may not be optimal, e.g., using the normal distribution to simulate the "Middle bias." Methods to encode grid-based user behavior and combine di erent user behavior assumptions need further investigation [4].
(3) We test the performance of grid-based assumptions on a small group of evaluation metrics only. Experiments conducted on further evaluation metrics are called for. (4) As the e ectiveness of evaluation metrics may vary with tasks [29], we will try to investigate the performance of proposed grid-based evaluation metrics across search tasks and intents.

Code
To facilitate reproducibility of our results, we share the code used to run our experiments at https://github.com/THUxiexiaohui/gri d-based-evaluation-metrics.