Quantifying Engagement with Citations on Wikipedia

Wikipedia, the free online encyclopedia that anyone can edit, is one of the most visited sites on the Web and a common source of information for many users. As an encyclopedia, Wikipedia is not a source of original information, but was conceived as a gateway to secondary sources: according to Wikipedia's guidelines, facts must be backed up by reliable sources that reflect the full spectrum of views on the topic. Although citations lie at the very heart of Wikipedia, little is known about how users interact with them. To close this gap, we built client-side instrumentation for logging all interactions with links leading from English Wikipedia articles to cited references during one month, and conducted the first analysis of readers' interaction with citations on Wikipedia. We find that overall engagement with citations is low: about one in 300 page views results in a reference click (0.29% overall; 0.56% on desktop; 0.13% on mobile). Matched observational studies of the factors associated with reference clicking reveal that clicks occur more frequently on shorter pages and on pages of lower quality, suggesting that references are consulted more commonly when Wikipedia itself does not contain the information sought by the user. Moreover, we observe that recent content, open access sources and references about life events (births, deaths, marriages, etc) are particularly popular. Taken together, our findings open the door to a deeper understanding of Wikipedia's role in a global information economy where reliability is ever less certain, and source attribution ever more vital.


INTRODUCTION
Wikipedia is the largest encyclopedia ever built, established through the collaborative effort of a large editor base, self-governed through agreed policies and guidelines [7,16]. Thanks to the tenacious work of the editor community, Wikipedia's content is generally up to date and of high quality [25,45], and is relied upon as a source of neutral, unbiased information [35]. Wikipedia's inline references, or citations, 1 are a key mechanism for monitoring and maintaining its high quality. Wikipedia's core content policies require that "people using the encyclopedia can check that the information comes from a reliable source", 2 and citations are the main way to connect a statement to its sources. A clearly distinctive feature of Wikipedia is the fact that many citations are actionable: they are often equipped with hyperlinks to the cited material available on the Web.
As a result, Wikipedia's role on the Web has been defined as the "bridge to the next layer of academic resources" [19], and the "gateway through which millions of people now seek access to knowledge" [11]. Nevertheless, a question remains open: to which extent do Wikipedia readers actually cross the bridge and access the broader knowledge referenced in the encyclopedia?
Given the collaborative and open nature of Wikipedia, being able to quantify readers' engagement with the content and its supporting sources is of crucial importance for the constant betterment of the encyclopedia and its role in fostering a self-critical society. By understanding readers' interactions with citations, we can better assess the role of Wikipedia editors and policies in maintaining a high quality of information, measure public demand for secondary sources, and provide insights and potential recommendations to increase the public's interest in references.
This paper takes a step in this direction, by addressing, for the first time, the problem of quantifying and studying Wikipedia readers' engagement with citations. More specifically, we ask the following research questions, RQ1 To what extent do users engage with citations when reading Wikipedia? (Sec. 4) RQ2 What features of a page predict whether a reader will interact with a citation on the page? (Sec. 5) RQ3 What features of a citation predict whether a reader will interact with it? (Sec. 6) In order to answer these questions, we collect a large dataset comprising all citation-related events (96M) on the English Wikipedia for two months (October 2018, April 2019), including reference clicks, reference hovers, and downwards and upwards footnote click, as visualized in Fig. 1. By analyzing this dataset, 3 we make the following main contributions: • We quantify users' engagement with citations and find that it is a relatively rare event (RQ1, Sec. 4): 93% of the links in citations are never clicked over a one-month period, and the fraction of page views that involve a click on a citation link is 0.29%. • We gain insights into factors associated with seeking additional information via citation interactions, both at the page level (RQ2, Sec. 5) and at the link level (RQ3, Sec. 6). Through matched observational studies, we show that articles that are of higher quality, and thus also longer and more popular, are associated with a lower propensity of users to interact with citations. Using a logistic regression model trained on linguistic features, we show that more frequently clicked citation links tend to relate to social or life events. We thus conclude that readers are more likely to use Wikipedia as a gateway on topics where Wikipedia is still wanting and where articles are of low quality and not sufficiently informative; and that Wikipedia tends to be the final destination in the large majority of cases where the information it contains is of sufficiently high quality.
Our work provides the first study aimed at understanding if and how users engage with citations on Wikipedia, thus paving the way for a broader and deeper understanding of Wikipedia's role in the global information ecosystem.

RELATED WORK
This paper is related to research on a number of different themes.
Characterizing Wikipedia readers. A substantial amount of prior work has focused on understanding user engagement with Wikipedia from the point of view of the editor community [2,41,43,68]. Studies on the behavior of Wikipedia readers have mostly considered interest in contents [31,49,63], content popularity [8,47,56], or event timing [37]. More recently, a study explored the question why users read Wikipedia, by combining multiple-choice surveys with log-based analyses of user activity [53]. A similar design was used to study 14 languages other than English [32]. Little is known, however, on how users engage with Wikipedia's citations of external sources; ours is the first study on this subject. Navigation in Wikipedia. Wikipedia citations are part of the hyperlink network connecting Wikipedia and the Web. Understanding citation usage can yield useful insights for improving this network [29,66]. The analysis, modeling, and prediction of human navigation inside Wikipedia has been considered in previous studies [13,18,21,30,52,62], largely relying on traces from the navigation games Wikispeedia [50,64,65] and WikiGame [12,26,54]. For 3 Notebooks with code at https://github.com/epfl-dlab/wikipedia-citation-engagement our study, we collect instead a new, fine-grained dataset of user interactions with Wikipedia references to external content.

Science in Wikipedia.
A sizeable portion of citations on Wikipedia refer to scientific literature [40]. Consequently, Wikipedia is a fundamental gateway to scientific results and enables the public understanding of science [33,34,38,51,61]. The chance of a scientific reference being cited on Wikipedia varies with the impact factor of the publication venue and its open-access availability [58]. Being cited on Wikipedia can thus be considered an indicator of impact [27]. Despite the indirect influence that Wikipedia has on scientific progress [59], Wikipedia is in turn rarely acknowledged in the scientific literature [23,60].
Improving Wikipedia. Wikipedia content quality relies on the work of editors and their gradual improvement of articles [9,46]. Automated or semi-automated tools [17,39,44] can help improve user experience [29,69], content variety [42,67], and quality [1,20,28]. The reliability of Wikipedia can also be improved automatically, e.g., by finding potential citations [15] and Wikipedia statements in need of evidence [48]. The insights from our work can help improve Wikipedia via new citations with which users would be more likely to interact.
Quantifying Web user engagement. User engagement is crucial for the success of Web services, and numerous researchers have focused on quantifying how Web users engage with online content, e.g., in computational advertising [6,70], social media [5,10,22], or information retrieval [24,55]. Also, while the body of work focusing on understanding readers' and editors' engagement with content within Wikipedia has been growing in the recent years [36], we study here for the first time how Wikipedia readers engage with the broader outside knowledge linked from the online encyclopedia.

CITATION DATA COLLECTION
To study readers' engagement with citations, we collected data capturing where readers navigate and how they interact with citations in English Wikipedia.

Background: Citations in Wikipedia
Articles in Wikipedia are written by editors in wikicode, a markup language that is then translated to HTML by MediaWiki, the software that powers the website. There are different ways to add citations to sources in the text, summarized below. In all cases, the full reference descriptions are rendered as footnotes at the bottom of the page (in a dedicated section called References) with an automatically assigned footnote number that is added as a link anchor (e.g., " [1]") in the text of the article wherever the reference is cited (Fig. 1). Most references in the References section consist of text including the title of the source, the authors' names, the year of publication, and the source's publisher. For 80% of Wikipedia references, the source title is actionable via a clickable link to the source. Also, when reading a page, hovering over a reference's footnote number with the mouse cursor will display a reference tooltip, 4 a pop-up containing the reference text and a clickable link (when present), e.g., Daniel Nasaw (July 24, 2012). "Meet the 'bots' that edit Wikipedia". BBC News. When readers click on the reference's footnote number, they are sent to the reference description at the page bottom, from where they can jump back to the locations where the reference is cited by clicking on a small icon (e.g.,^).
The most common method to add a reference to an article, also recommended by the Wikipedia guidelines, is via an inline citation using a <ref/> tag directly in the context where the reference is first cited. In the tag, the editors can specify the reference details (text and links) by using a predefined template or plain wikicode. In addition to this standard method, some references are added automatically by templates included in the page, such us the geolocations present in the infobox. It is worth noting that a reference can be cited multiple times by assigning it a name and appending the tag to every sentence that should link to it. Given the numerous ways to use the <ref/> tag, and in order to have an accurate view of the article, we parsed pages from wikicode to HTML and extracted the information from the HTML code.

Logging citation and page load events
We make use of Wikimedia's EventLogging tool, 5 an extension of the MediaWiki software that performs client-side logging of specific types of events. We detect 5 main types of citation-related events and 1 page load event. In terms of citations, we capture the mouse events that involve any kind of reader interaction with the references (see Fig. 1 for a visual explanation): refClick: a click on a hyperlink in an article's reference section. extClick: a click on an external link outside the reference section. fnHover: a hover over a footnote number in the text, logged when the reference tooltip is visible for more than 1 second. fnClick: a click on a footnote number, which takes the user to the reference section at the bottom of the page. upClick: the inverse of fnClick: a click on a reference's up arrow icon that takes the reader back to the part of text where the reference is cited. pageLoad: in addition to the above citation-related events, this event is triggered whenever a Wikipedia article is loaded. The EventLogging platform manages a so-called session token, a cookie-based identifier that allows us to group events that happened within the same browser tab. We henceforth refer to event sequences that occur with the same session token as sessions.
We collected 4 contiguous weeks 6 of Wikipedia mobile and desktop traffic data of citation-related events. We repeated the 4-week data collection over two periods: from September 26 to October 25, 2018, and from March 24 to April 21, 2019. In both cases, we collected all citation-related events (extClick, refClick, fnHover, fnClick, upClick) and (due to computational infrastructure constraints) sampled pageLoad events at the session level at a rate of 33%.
To ensure that the logs reflect reader, rather than editor, behavior, we exclusively retained data from users who in the 4 weeks of data collection acted only as anonymous readers, discarding all events generated by Wikipedia editors (logged-in users or users with anonymous edits) and by bots (which can be filtered out using a detector provided by the EventLogging tool).
Throughout the paper, we will mostly focus on the data from the second data collection period (April 2019) and only use the October 2018 data for a longitudinal study measuring the impact of article quality on readers' engagement with citations.

Definition of engagement metrics
Two key metrics in our analysis will be the citation click-through rate (CTR) and the footnote hover rate.
For each page p and each session s, let C(p, s) be the indicator function that is 1 if at least one reference was clicked on page p during session s by the respective user (refClick event), and 0 otherwise. Analogously, let H (p, s) indicate if the user hovered over at least one footnote (fnHover event). Furthermore, let N (p) be the number of sessions during which p was loaded (pageLoad event) Global click-through rate. The global CTR measures overall reader engagement via reference clicks across Wikipedia. It is defined as the fraction of page views on which at least one reference click occurred (treating all views of the same page in the same session as one single event): where p ranges over the set of pages that contain at least one reference with a hyperlink.
Page-specific click-through rate. The page-specific CTR for page p is defined as the probability of observing at least one click on a reference in p during a session in which p was viewed: Finally, we denote the average page-specific CTR over a set P of pages by Note that pCTR(P) corresponds to a macro average where every page gets the same weight, whereas gCTR corresponds to a micro average where pages are weighted in proportion to the number of sessions in which they were viewed.
Footnote hover rates. In analogy to the above definitions, but when replacing the click indicator C(p, s) with the hover indicator H (p, s), we obtain the global and page-specific footnote hover rates:

Capturing event context
Each event is characterized by a set of features that capture information about three aspects of the event: the session in which the event happened, the page, and the reference.
Session: We collect the unique session token (cf. Sec. 3.2) that identifies the browser tab in which the event occurred. Pages: At the article level, we store title, page id, text length of wikicode in characters, number of references, and popularity (number  Figure 2: Distribution of Wikipedia articles by (a) popularity (number of pageviews), (b) page length (number of characters in wikicode), and (c) quality (increasing from left to right; "GA" for "Good Article", "FA" for "Featured Article") (Sec. 3.5).
of pageLoad events during the data collection period). We also use the ORES drafttopic classifier [3] to label each Wikipedia article with a vector of topics, whose elements reflect the probability of the page to belong to one the 44 topics from the highest level of the WikiProjects taxonomy. 7 We further use the ORES articlequality model [20] to label articles with a quality level, which can take the following values (from low to high quality): "Stub", "Start", "C-class", "B-class", "Good Article", "Featured Article". References: For each reference clicked or hovered, we record its URL, the text in the reference, the text of the sentence in which the reference is cited, and the relative position (character offset from the start in plain text, divided by page length) in the page where the reference is cited. Since we associate references to their contexts, references to the same source appearing on different pages are treated as distinct.
Wikipedia is dynamic by nature: articles are continuously updated, and their changes are tracked through revisions. To account for the evolution of articles over the 4 weeks of data collection, we aggregate individual revision-level metrics at the article level. To compute article-specific characteristics such as article length or number of references, we calculate their average over all revisions from the logging period. To quantify the amount of reader engagement with a given article (e.g., page loads, reference clicks), we sum all events recorded at each revision of the article.

General statistics of English Wikipedia
By the end of the data collection, English Wikipedia contained 5.8M articles, 5.4M (95%) of which were loaded at least once in our data sample, in a total of 7.4M revisions. Out of these articles, 3.9M (73%) contain at least one citation, linking to a total of 24M distinct URLs.
Over the 4 weeks of data collection, we collected (at a 33% sampling rate) 1.5B pageLoad events (62% from the mobile site and the rest from the desktop site). In Fig. 2a we report the (complementary cumulative) popularity distribution for the Wikipedia pages that were viewed at least once during the data collection period. The distribution is heavily skewed, with approximately 83% of the articles loaded fewer than 100 times in the 33% random sample (cf. Sec. 3.2), or fewer than 300 times when extrapolating to all data.
We observe a similar uneven distribution of page length (Fig. 2b), with the majority of articles being very short. Fig. 2c shows that the distribution of article quality levels is also heavily skewed toward low quality levels: most articles are 7   identified as "Stub" or "Start", and fewer than 300K articles are marked as "Good" or "Featured" articles. Finally (Fig. 3), we find that a majority of articles are about geography or "Language and literature" (the latter including biographies), followed by topics related to sports and science.

RQ1: PREVALENCE OF CITATION INTERACTIONS
After these preliminaries, we are now ready to address our first research question, which asks to what extent Wikipedia readers engage with citations.

Distribution of interaction types
We start by analyzing the relative frequency of the different citation events, as defined in Sec. 3.2. Over the month of data collection, we captured a total of 96M citation events. Fig. 4 shows how these events distribute over the 5 event types, broken down by device type (mobile vs. desktop). We observe that most interactions with citations happen on desktop rather than mobile devices, despite the fact that the majority of page loads (62%) are made from mobile. The interactions also distribute differently across types for mobile vs. desktop. The by far prevailing event on desktop is hovering over a footnote (fnHover) in order to display the reference text. Hovering requires a mouse, which is not available on most mobile devices, which in turn explains the low incidence of fnHover on  mobile. In order to reveal the reference text behind a footnote, mobile users instead need to click on the footnote, which presumably explains why fnClick is the most common event on mobile.
Clicking external links outside of the References section at the bottom of the page (extClick) is the second most common event on both desktop and mobile, followed by clicks on citations from the References section (refClick). Finally, the upClick action, which lets users jump back from the References section to the locations where the citation is used in the main text, is almost never used.

Citation click-through rates
We now focus on the two prevalent interactions with citations, hovering over footnotes (fnHover) and leaving Wikipedia by clicking on citation links (refClick). (We do not dwell on extClick events, as they do not concern citations but other external links; cf. Sec. 3.2.) First, we observe that, out of the 24M distinct URLs that are cited across all articles in English Wikipedia, 93% of the URLs are never clicked during our month of data collection.
Next, we note that the global click-through rate (CTR) across all pages with at least one citation (gCTR, Eq. 1) is 0.29%; i.e., clicks on references happen on fewer than 1 in 300 page loads. Breaking the analysis up by device type, we observe again substantial differences between desktop and mobile: on desktop the global CTR is 0.56%, over 4 times as high as on mobile, where it is only 0.13%.
The average page-specific CTR (pCTR, Eq. 3) is higher, at 1.1% for desktop and 0.52% for mobile. This is due to the fact that there are many rarely viewed pages (cf. Fig. 2a) with a noisy, high CTR. After excluding pages with fewer than 100 page views, the global CTR is 0.67% on desktop, and 0.21% on mobile.
Engagement via footnote hovering is slightly higher, at a global footnote hover rate (gHR, Eq. 4) of 1.4%. The average page-specific footnote hover rate (pHR, Eq. 4) is 0.68% when including all pages with at least one clickable reference, and 1.1% when excluding pages with fewer than 100 page views. 8 Given these numbers, we conclude that readers' engagement with citations is overall low.

Positional bias
Previous work has shown that users are more likely to click Wikipedia-internal links that appear at the top of a page [42]. To verify whether this also holds true for references, we sample one random page load with citation interactions per session and randomly sample one clicked and one unclicked reference for this page load. We then compute each reference's relative position in the page as the offset from the top of the page divided by the page length (in characters). Fig. 5, which shows the distribution of the relative position for clicked and unclicked references, reveals that users are more likely to click on references toward the top and (less extremely so) the bottom of the page.

Top clicked domains
Next, we investigate what are the most frequent domains at which users arrive upon clicking a citation. Initially, we found that the most frequently clicked domain is archive.org (Internet Archive), with 882K refClick events. Such URLs are usually snapshots of old Web pages archived by the Internet Archive's Wayback Machine. To handle such cases, we extract the original source domains from wrapping archive.org URLs.
In Fig. 7 we report the top 15 domains by number of refClick events. The most clicked domain is google.com. Drilling deeper, we checked the main subdomains contributing to this statistic, finding that a significant proportion of clicks goes to books.google.com, which is providing partial access to printed sources. The second most clicked domain is doi.org, the domain for all scholarly articles, reports, and datasets recorded with a Digital Object Identifier (DOI), followed by (mostly liberal) newspapers (The New York Times, The Guardian, etc.) and broadcasting channels (BBC).

Markovian analysis of citation interactions
Whereas the above analyses involved individual events, we now begin to look at sessions: sequences of events that occurred in the same browser tab (as indicated by the session token; Sec. 3.2). Every session starts with a pageLoad event, and we append a special END event after the last actual event in each session.
By counting event transitions within sessions, we construct the first-order Markov chain that specifies the probability P(j |i) of observing event j right after event i, where i and j can take values from     the event set introduced in Sec. 3.2 (pageLoad, refClick, extClick, fnClick, upClick, fnHover) plus the special END event.
The transition probabilities are reported in Fig. 6. We observe that most reading sessions are made up of page views only: on both desktop and mobile, after loading a page, readers tend to end the session (with a probability of around 50%) or load another page in the same tab (47%). All citation-related events have a very low probability (at most 1.2%) of occurring right after loading a page.
On desktop, reference clicks become much more likely after footnote clicks (34%), and footnote clicks in turn become much more likely after footnote hovers (6.5%), hinting at a common 3step motif (fnHover, fnClick, refClick), where the reader engages ever more deeply with the citation. Note, however, that this is not true for mobile devices, where, even after readers clicked on a footnote, the probability of also clicking on the citation stays low (0.5%).
Finally, reference clicks (refClick) are also common immediately after other reference clicks (8% on desktop, 13% on mobile). Note that for external links outside of the References section (extClick) we see a different picture: such external clicks are only rarely followed by interactions with citations (fnHover, fnClick, refClick), and in the majority of cases (59% on desktop, 53% on mobile) they conclude the session, suggesting that Wikipedia is in these cases commonly used as a gateway to external websites.

RQ2: PAGE-LEVEL ANALYSIS OF CITATION INTERACTIONS
We now proceed to our second research question, which asks what features of a Wikipedia page predict whether readers will engage with the references it contains.

Predictors of reference clicks
As a first step, we perform a regression analysis. We train a logistic regression classifier for predicting whether a given pageLoad event will eventually be followed by a refClick event. To assemble the training set, we first find sessions with at least one (positive) pageLoad followed by a refClick and at least one (negative) pageLoad not followed by a refClick, and make sure to include at most one such pair per session in order to avoid over-representing power users with extensive sessions. The dataset totals 938K pairs, which we split into 80% for training and 20% for testing. As predictors we use the article's topic vector (with entries from [0, 1]; Sec. 3.4) and the quality label (Sec. 3.4), which we also normalize to a score in the range [0, 1] using the mapping from a previous study [20]. We did not use the number of references and the length of the page, as they are important features in the quality model and would cause collinearity issues due to their high correlation with quality (Pearson's correlation 0.81 and 0.75, respectively). The resulting regression model has an area under the ROC curve (AUC) of 0.6 on the testing set. A summary of the 10 most predictive positive and negative coefficients is given in Fig. 8. By far the most important predictor-with a large negative weight-is the article's quality. Moreover, some topics are positive predictors (e.g., "Language and literature", which also includes all biographies, as well as "Internet culture"), while others are negative predictors (e.g., "Media", "Information science").
Given the importance of the quality feature in this first analysis, we now move to investigating its role in a more controlled study.

Effects of page quality
To come closer to a causal understanding of the impact of an article's quality on readers' clicking citations in the article, we perform a matched observational study. The ideal goal would be to compare the page-specific CTR (Eq. 2) for pairs of articles-one of high, the other of low quality-that are identical in all other aspects.
Propensity score. Finding such exact matches is unrealistic in practice, so we resort to propensity score matching [4], which provides a viable solution. The propensity score specifies the probability of being treated as a function of the observed (pre-treatment) covariates. Crucially, data points with equal propensity scores have the same distribution over the observed covariates, so matching treated to untreated points based on propensity scores will balance the distribution of observed covariates across treatment groups.
In our setting, we define being of high quality as the treatment and estimate propensity scores via a logistic regression that uses topics, length, number of citations, and popularity as observed covariates in order to predict quality as the binary treatment variable. We consider as low-quality all articles tagged as Stub or Start (74% of the total; Fig. 2c), and as high-quality the rest. Articles without a refClick or fewer than 100 pageLoad events are discarded in order to avoid noisy estimates of the page-specific CTR. This leaves us with 854K articles.
Matching. We compute a matching (comprising 198K pairs) that minimizes the total absolute difference of within-pair propensity scores, under the constraint that the length of matched pages should not differ by more than 10%. This constraint is necessary to ascertain balance on the page length feature because page length is so highly correlated with quality (Pearson correlation 0.81; cf. Sec. 5.1). After matching, we manually verify that all observed covariates, including page length, are balanced across groups.
Results. Fig. 9 visualizes the average page-specific CTR for articles of low (yellow) and high (blue) quality as a function of article popularity. We can observe that the CTR of low-quality articles significantly surpasses that of high-quality articles across all levels of popularity. In interpreting this result, it is important to recall that page length is one of the most important features in ORES [20], the quality-scoring model we use here. As we control for page length, the gap observed in Fig. 9 may be attributed to the remaining features used by ORES, such as the presence of an infobox, the number of images, and the number of sections and subsections. We hence dedicate our next, final page-level analysis to estimating the impact of page length alone on page-specific CTR.

Effects of page length
In order to measure the effect of page length on CTR, we take a twopronged approach, first via a cross-sectional study using propensity scores, and second via a longitudinal study.
Cross-sectional study. First, we conduct a matched study based on propensity scores analogous to Sec. 5.2, but now with page length as the treatment variable (using the longest and the shortest 40% of articles as treatment groups), and all other features (except quality) as observed covariates. Matching yields 683K pairs, and we again manually verify covariate balance across treatment groups.
The average page-specific CTR of short articles (0.68%) is more than double that of long articles (0.27%; p ≪ 0.001 in a two-tailed Mann-Whitney U test). Moreover, as seen in Fig. 10, this relative difference obtains across all levels of article popularity.
Longitudinal study. While in the above cross-sectional study propensity score matching ensures that the covariates of long vs. short articles are indistinguishable at the aggregate treatment group level, it does not necessarily do so at the pair level. Also, we did not include as observed covariates features describing the users who read the respective articles, and it might indeed be the case that users with a liking for short, niche articles also have a higher probability of clicking citations. In order to mitigate the danger of such remaining potential confounds and achieve even finer control, we now conduct a longitudinal study to assess how a variation in length of the same article impacts its CTR.
To do so, we select all articles that grew in length between October 2018 and April 2019, our two data collection periods (Sec. 3.2). To control for the effect of page popularity, which was observed to negatively correlate with CTR ( Fig. 9 and 10), we assign a popularity level to each article by binning page view counts into deciles and discard articles whose popularity level has changed between the two periods. This way, we obtain a set of 120K articles with matched long and short revisions.
By grouping these articles by the length ratio of their two revisions and plotting this ratio against the CTR for the long (purple) vs. short (green) versions (Fig. 11), we provide a further strong indicator that page length causally decreases the prevalence of citation clicking. According to a Mann-Whitney U test, the CTR difference between long and short revisions is statistically significant with p < 0.05 starting from a length increase of 17%, and with p < 0.01 from 31%. In addition, to verify that the effect is not confounded by a concomitant change in article popularity, the inset plot in Fig. 11 shows that the popularity indeed stays constant between revisions.

RQ3: LINK-LEVEL ANALYSIS OF CITATION INTERACTIONS
Our final research question asks which features of a specific reference predict if readers will engage with it. Note that this is different from RQ2 (Sec. 5), where we operated at the page level and did not differentiate between different references on the same page.

Predictors of reference clicks
We begin with a regression analysis to detect which features predict whether a reference will be clicked. We selected all the references with external links, and we carefully rule out a host of confounds by sampling pairs of clicked and unclicked references from the same page view, thus controlling for situational features such as the page, user, information need, etc. As we saw in Fig. 5, references at the top and bottom of pages are a priori more likely to be clicked. Thus, to exclude position as a confound and maximize the probability that the user saw both references in a pair, we pick as the unclicked reference in a pair the one that appears closest in the page to the  clicked reference. To make sure we sample references associated with a sentence, we discard all footnotes in tables, infoboxes, and images, and keep only those within the article text. Finally, we again sample only one pair per session in order to avoid over-representing readers who are more prone to click on references. This process yields 1.8M reference pairs. As predictors we use the words in the sentence that cites the respective reference, as well as the words in the reference text (cf. Sec. 3.1), represented as binary indicators specifying for each of the 1K most frequent words whether the word appears in the sentence. 9 Using these features as predictors, we train a logistic regression to predict the binary click indicator.
We perform this analysis on the full above-described dataset, as well as on subsets consisting only of page views from each of 4 broad categories (derived by aggregating the 44 WikiProjects categories from Sec. 3.4): "Culture" (1.3M pairs), "STEM" (436K), "Geography" (530K), and "History and Society" (467K). The model achieves a testing AUC of around 0.55 across these 5 settings.
The words with the largest and smallest coefficients are displayed in Table 1, where we observe that, for all article topics except for (c) Hover event (sentence text) Figure 12: Empath [14] topics most strongly (anti-)associated with citation events (cf. Sec. 6.2 for description). Reference text not studied for hover event (Sec. 6.3) because unlikely to be visible to user before hovering.
On STEM-related pages, open-access references seem to receive more clicks than others, with words like "free" and "pdf" among the top predictors, whereas words related to traditionally closed-access libraries such as JSTOR appear among the negative predictors, in line with previous findings [58].

Topical correlates of reference clicks
For a higher-level view, we perform a topical analysis of citing sentences and reference texts, separately for the clicked vs. the unclicked references from the paired dataset of Sec. 6.1.
To extract topics, we use Empath [14], which comes with a pretrained model for labeling input text with a distribution over 200 wide-ranging topics. After applying the model to each data point, we compute the average topic distribution for clicked and unclicked references, respectively, and sort topics by the signed difference between their probability for clicked vs. unclicked references.
The topics with the largest positive and negative differences are listed in Fig. 12a and 12b for citing sentences and reference texts, respectively. The results corroborate those from Sec. 6.1, with human factors (wedding, family, sex, death) being more prominent among clicked references, whereas career-related topics such as competitions or achievements receive less attention. Among the most prominent topics for reference texts (Fig. 12b), topics related to technology and the Internet also emerge.

Predictors of footnote hovering
The analyses of Sec. 6.1 and 6.2 considered engagement via reference clicks. As we observed in Fig. 4, on desktop devices, hovering over a footnote to reveal the reference text in a tooltip is an even more common way to interact with references. We hence replicated the above analyses with the fnHover instead of the refClick event (8.7M reference pairs), with the only difference that we excluded words from reference texts as features, since the user is unlikely to have seen those words before hovering over the footnote.  The results echo those of Sec. 6.1 and 6.2, so for space reasons we do not discuss the regression analysis for footnote hovering (cf. Sec. 6.1) and focus on the topical analysis instead (cf. Sec. 6.2). Inspecting Fig. 12c, we observe that we see a stronger tendency of fnHover events, compared to refClick events, to be elicited by words that are related to both positive and negative emotions.

Predictors of reference clicks after hovering
Once a user hovers over a (fnHover), the text of the corresponding reference is revealed in a so-called reference tooltip ( Fig. 1). At this point, the user has the choice to either click through to the citation URL (refClick) or to stay on the article page. In the final analysis of the paper, we are interested in understanding what words in the reference text influence the user when making this decision.
We create a dataset by selecting the page loads with at least two footnote hover events, where one converted to a refClick (positive), whereas the other did not (negative). As in the previous studies, we selected at most one random pair per session, giving rise to a dataset of 440K pairs of hover events.
Similar to the study in Sec. 6.1, we represent reference texts as 1K-dimensional word indicator vectors and use them as predictors in a logistic regression to predict refClick events (testing AUC 0.54).
The strongest coefficients are summarized in Table 2, painting a picture consistent with the previous analyses: readers, after seeing a reference preview via the tooltip, are more likely to click on the cited link when the reference text mentions social and life aspects ("wife", "baby", "instagram", etc.). The strongest negative coefficients suggest that readers tend to not click through to dictionary entries, book catalogs (ISBN, OCLC), and information in languages other than English: manual inspection revealed that "spanish" is mainly due to the note "In Spanish", "le" is the French article common in French newspaper names (e.g., Le Monde), and "die" is a German article.

DISCUSSION AND CONCLUSIONS
Our analysis provides important insights regarding the role of Wikipedia as a gateway to information on the Web. We found that in most cases Wikipedia is the final destination of a reader's journey: fewer than 1 in 300 page views lead to a citation click. In our analysis, we focused on the fraction of users who engage with references, and characterized how Wikipedia is used as a gateway to external knowledge. Our findings suggest the following.
• We engage with citations in Wikipedia when articles do not satisfy our information need. Sec. 5 showed that readers are more likely to click citations on shorter and lower-quality articles. Although this result seemed counter-intuitive at first, since higher-quality articles actually contain more references that could potentially be clicked, it is in line with the finding that citations to sources reporting atomic facts that are typically available in Wikipedia articles (e.g., awards, career paths), are also generally less engaging (Sec. 6). Collectively, these results suggest that readers are inclined to seek content beyond Wikipedia when the encyclopedia itself does not satisfy their information needs. • Citations on less engaging articles are more engaging. In all of Sec. 5 we found that citation click-through rates decrease with the popularity of an article. While this may follow from the previous point because long, high-quality articles tend to be more popular, it may also suggest that less popular articles are visited with a specific information need in mind. Previous work indeed suggests that popular articles are more likely to be viewed by users who are randomly exploring the encyclopedia [53]. • We engage with content about people's lives. We clearly saw that readers' interest is particularly high in references about people and their social and private lives (Sec. 6). This is especially true for hovers, a less cognitively demanding form of engagement with citations. Hover events are also more likely to be elicited by words that are related to emotions, both positive and negative. • Recent content is more engaging. We found that references about recent events (whose text includes "2019") are more engaging, both in terms of hovering and clicking. • Open content is more engaging. Finally, we saw that references in Wikipedia pages about science and technology, especially if they point to a open-access sources (e.g., having "free" or "pdf" in the reference text), are also more likely to be clicked. Theoretical implications. Our findings furnish novel insights about Web users and their information needs through the lens of the largest online encyclopedia. For the first time, by characterizing Wikipedia citation engagement, we are able to quantify the value of Wikipedia as a gateway to the broader Web. Our findings enable researchers to develop novel theories about readers' information needs and the possible barriers separating knowledge within and outside of the encyclopedia. Our research can also guide the broader community of Web contributors in prioritizing efforts towards improving information reliability: we found that people especially rely on cited sources when seeking information about recent events and biographies, which suggests that Web content in these areas should be especially well curated and verified. Finally, the fact that readers engage more with freely accessible sources highlights the importance of open access and open science initiatives.
Practical implications. Quantifying Wikipedia article completeness has proven to be a non-trivial task [45]. The notion that article completeness is highly related to readers' engagement with Wikipedia references opens up ideas for novel applications to help satisfy Web users' information needs, including models that quantify lack of information in an article by incorporating signals related to reference click-through rate. Our findings will also help prioritize areas of content to be checked for citation quality by Wikipedia editors: in areas of content where Wikipedia acts as a major gateway, the quality and reliability of sources that readers visit become even more crucial. Finally, the data we collected could empower a model that, given a sentence missing a citation (i.e., with a citation needed tag), could quantify how likely readers are to be interested in accessing the corresponding information and thereby help Wikipedia editors prioritize the backlog of unsolved missing-reference cases.
Limitations and future work. The overall low AUC (0.54 to 0.6) of the regression models (Sec. 5-6) emphasizes the inherent unpredictability of reader behavior. While the significantly above-chance performance renders the models useful for analyzing the impact of various predictors, their performance is currently too low to make them useful as practical predictive tools. Future work should hence invest in more powerful sequence models to improve accuracy.
By focusing on English Wikipedia only, the present analysis provides a limited view of the broader Wikipedia project, which is available in almost 300 languages and accessed by users all over the world. In our future work, we therefore plan to replicate this study for other language editions. So far, we also omitted any user characteristics from our study, such as more global behavioral traits beyond the page-view level, as well as geographic information, which are known to play an important role in user behavior [32,57]. Future work should incorporate such signals.
We will also investigate reader intents more closely. While click and hover logs reflect the extent to which readers are interested in knowing more about a given topic, they cannot tell us about the specific circumstances that led the user to engage by clicking or hovering, nor about the level of satisfaction achieved by following up on a reference. In the future, we plan to better understand these aspects via qualitative methods such as surveys and interviews.
Further, whereas our analysis focused on links in the References section of articles, future work should also study other types of external links (cf. Fig. 1) in satisfying readers' information needs.
Finally, as exogenous events strongly affect Wikipedia users' information needs [53], future work should go beyond studying Wikipedia as an isolated platform and analyze how citation interaction patterns are warped by breaking news and events with uncertain information. This will sharpen our picture of Wikipedia as a gateway to global information.