Leveraging Context for Neural Question Generation in Open-domain Dialogue Systems

Question generation in open-domain dialogue systems is a challenging but less-explored task. It aims to enhance the interactivity and persistence of human-machine interactions. Previous work mainly focuses on question generation in the setting of single-turn dialogues, or investigates it as a data augmentation method for machine comprehension. We propose a Context-augmented Neural Question Generation (CNQG) model that leverages the conversational context to generate questions for promoting interactivity and persistence of multi-turn dialogues. More specifically, we formulate the task of question generation as a two-stage process. First, we employ an encoder-decoder framework to predict a question pattern, which denotes a set of representative interrogatives, and identify the potential topics from the conversational context by employing point-wise mutual information. Then, we generate the question by decoding the concatenation of the current dialogue utterance, the pattern, and the topics with an attention mechanism. To the best of our knowledge, ours is the first work on question generation in multi-turn open-domain dialogue systems. Our experimental results on two publicly available multi-turn conversation datasets show that CNQG outperforms the state-of-the-art baselines in terms of BLEU-1, BLEU-2, Distinct-1 and Distinct-2. In addition, we find that CNQG allows one to efficiently distill useful features from long contexts, and maintain robust effectiveness even for short contexts.


INTRODUCTION
Question Generation (QG) aims to generate a relevant question for a given input. It has been used to automatically create large-scale training data for machine reading comprehension [20] and question answering [17,22]. In the field of open-domain dialogue systems, question generation, also known as learning to ask, serves as an essential communication skill to help solicit feedback from users and to extend current conversational topics or start new ones, which can enhance the interactivity and persistence of dialogues [26].
Most previous work on question generation uses neural methods that adopt a sequence-to-sequence framework (Seq2Seq, also called encoder-decoder framework) [10,21]. For instance, Serban et al. [17] apply a Seq2Seq framework to generate factoid questions from a structured knowledge base. Du et al. [4] change the modality of the input data and generate questions based on given text passages and answers, which has inspired follow-up work that includes [5,20,22]. In addition, Mostafazadeh et al. [14] focus on the novel task of visual question generation (VQG) that involves generating a natural question for a given image. However, QG in open-domain dialogue systems is still challenging. First, the main purpose of QG is to achieve interactive and persistent dialogues [26], which is substantially different from the traditional QG tasks, where questions are generated to enhance machine comprehension and usually can be answered by the given input. In addition, colloquial and short texts in conversational corpora are often creative in the expressions they use and semantically ambiguous, which increases the difficulty of QG. For instance, a phrase like "I don't know" frequently occurs in dialogues [6], which often has a negative impact on the informativeness and diversity of generated questions [11,27].
To address the above issue, Li et al. [12] explore how a chatbot can return an appropriate answer by asking questions in a movie-specific domain. However, this model resorts to a specific knowledge base, which restricts their solution to dialogues in a closed domain. Wang et al. [26] focus on QG in the setting of opendomain dialogue systems and propose a typed decoder based on a Seq2Seq architecture, which takes the latest dialogue utterance as input. Their solution ignores the previous conversational context, which limits its application to single-turn dialogues. For multi-turn conversations, generating a random or free-style question without considering its conversational context is not useful for enhancing interactivity and persistence of the dialogue. Hence, we argue that a good question should contain relevant topics that may appear in previous utterances [24].
In this paper, we investigate the task of Question Generation (QG) in the setting of multi-turn open-domain dialogue systems and propose a Context-augmented Neural Question Generation (CNQG) model, that leverages the conversational context in dialogues to generate appropriate and informative questions. In particular, we formulate the question generation task as a two-stage process that is implemented within an encoder-decoder framework: (1) We first encode the latest dialogue utterance, referred to as the "post, " into a hidden vector representation, which is then used to predict the question pattern that denotes a set of representative interrogatives. We then use point-wise mutual information to identify the topics from the preceding conversational context as well as the post. (2) We employ an encoder-decoder framework with an attention mechanism to generate the final question by decoding the concatenation of the post, the pattern and the topics as input. To evaluate the performance of CNQG, we conduct experiments on two publicly available benchmark datasets, i.e., the DailyDialog dataset 1 and the Cornell Movie-Dialog dataset 2 ("Cornell" for short), which are both collections of multi-turn dialogues extracted from human-to-human conversations. Experimental results show that CNQG outperforms the state-of-the-art baselines in terms of BLEU-1, BLEU-2, Distinct-1 and Distinct-2, which demonstrates its effectiveness at generating appropriate and informative questions. In addition, we find that CNQG can efficiently avoid interference from long contexts so as to prevent digressions, and maintain robust effectiveness even for short contexts, which are usually ambiguous.
The main contributions of our work are the following. • To the best of our knowledge, ours is the first to work on question generation in multi-turn open-domain dialogue systems. We leverage the conversational context to generate appropriate and informative question. • We propose a context-augmented neural question generation model (CNQG) that models question generation as a two-stage process and follows an encoder-decoder framework to generate questions. • We analyze the effectiveness of CNQG on two conversational datasets and find that it significantly beats the state-of-the-art baselines in terms of BLEU-1 and BLEU-2.

APPROACH
We provide a high-level overview of the Context-augmented Neural Question Generation (CNQG) model in Fig. 1. CNQG consists of four main components, i.e., a post encoder ( §2.1), a pattern predictor ( §2.2), a topic identifier ( §2. 3), and a question generator ( §2.4). We first detail the task of question generation in multi-turn opendomain dialogue systems. We take a d-turn (d ≥ 3) dialogue session as a sequence {U 1 , . . . , U d }, which is then represented by a triple (C, X , Y ), where C denotes the conversational context consisting of d − 2 utterances {U 1 , . . . , U d −2 }, X is the post U d −1 , and Y represents the target question U d . The purpose of question generation in multiturn open-domain dialogue systems is to compute the probability P(Y | X , C) of generating a question Y given the conversational context C and post X .
We assume that the target question Y consists of a sequence of T words, i.e., Y = (y 1 , . . . , y T ), and is an implicit combination of a question pattern Z and topics K. The question pattern Z = (z 1 , . . . , z L ) comprises L representative interrogatives. The topics K = {k 1 , . . . , k M } are a set of words semantically related to the post and the conversational context. Thus, question generation can be regarded as a two-stage process. First, use the post to predict the question pattern P(Z | X ) and leverage the post plus the context to obtain the question topics P(K | X , C). Second, decode the concatenation of the post, pattern and topics to generate the final question word-by-word as where y t is the word to be generated at the t-th step, and y <t represents the previously generated words before the t-th step.

Post encoding
Given an N -length post X = (x 1 , . . . , x N ), we use a GRU-based encoder to convert the post sentence into a sequence of hidden vectors as: where 0 ≤ n < N and e x n+1 is the embedding of word x n+1 . The GRU is parameterized as follows: where x n+1 is the input vector and is assigned as e x n+1 here; z and r are the update gate vector and reset gate vector, respectively; W z , U z , W r , U r , W s , U s are the weight matrices; • represents the operation of element-wise multiplication, σ д and σ h are the activation functions.
We convert a post X into a sequence of hidden states (h X 1 , . . . , h X N ), which is fed to the following decoders for pattern prediction and question generation, respectively.

Pattern prediction
Most naturally occurring questions in human conversations feature one of a small set of interrogatives [5]. For instance, a question "What is your nationality?" features the interrogative what. Following [5], we identify 8 types of question pattern: yes/no, what, why, how, who, where, when and which. Each pattern is expressed by one or several interrogatives, e.g., the pattern who has the interrogatives who, whose, whom.  Figure 1: Overview of the Context-augmented Neural Question Generation (CNQG) framework.
We first collect commonly used interrogatives to construct a pattern vocabulary, then adopt an attention-augmented encoderdecoder framework to generate question pattern-related interrogatives Z = (z 1 , . . . , z L ), as follows: The word probability distribution at each decoding position is computed as follows: where v is a word from the pattern vocabulary V Z ; д is the projection function implemented by a fully-connected layer with a softmax function; s Z l is the pattern decoder hidden state at l-step and is computed as: where the GRU for pattern prediction is similar to Eq. 3 but has different parameters; [c Z l , e z l −1 ] denotes a concatenation of e z l −1 and c Z l , where e z l −1 is the embedding of word z l −1 ; c Z l is a weighted mixture vector computed by attentively reading the output (see Eq. 2) of the post encoder as: where the weight α Z l n is defined by Here, η is implemented by a multi-layer perceptron model with tanh as the activation function.

Context-augmented topic identification
To maintain consistency with previous utterances in a given dialogue, we propose a topic identification scheme to find the potential topics in the conversational context as well as the post. We first locate the nouns and verbs from the context as well as the post, and then identify their topics using point-wise mutual information (PMI) [2] matrices. PMI is often used to measure similarity between two items and previous studies [7,25,26] have shown its effectiveness in natural language processing. Instead of using the traditional PMI, we introduce a part-ofspeech (POS) feature to guide the computation process and obtain two POS-based PMI matrices, i.e., one corresponding to the nountype PMI and the other corresponding to the verb-type PMI. More specifically, we first apply POS tagging to identify nouns and verbs in the context as well as in the post; we refer to nouns and verbs in the context and post as triggers, and to those in the ground-truth questions as targets. Then, the PMI scores of pairs of trigger and target nouns and of pairs of trigger and target verbs are calculated as follows: where p ⟨trigger,target ⟩ (w 1 , w 2 ) is the co-occurrence probability of w 1 occurring in triggers and w 2 occurring in targets, simultaneously; p trigger (w 1 ) and p target (w 2 ) denote the independent probabilities of w 1 occurring as a trigger and w 2 as a target, respectively. Given a post X and a conversational context C, we determine the relevance score for a word k m (a "topic") as a sum of its PMI scores: where w i ranges over nouns from the N noun -length noun set extracted from the post and the context and w j is a verb from the N ver b -length verb set. Finally, we select the top-M words k 1 , . . . , k M with the highest relevance scores as the topics for the given post X and context C.

Question generation
The question decoder is similar to the pattern decoder and takes a vector as input and generates the question word-by-word with an attention mechanism. Here, the input to the question decoder is a concatenated vector Ψ of three sources, namely the post, the pattern and the topics, which is obtained as follows: where p z l and t k m are the transformed vectors; e z l and e k m are the embeddings of the generated interrogative z l (see §2.2) and identified topic k m (see §2.3), respectively; W ∈ R d h ×d e is used to transform the embedding vectors (e.g., e z l and e k m ). Given the concatenated vector Ψ, the GRU for question generation has similar structure with Eq. 3, but is assigned as: where [c Y t , e y t −1 ] is the concatenation of e y t −1 and c Y t . e y t −1 is the embedding of generated word at step t − 1, c Y t is a weighted sum vector obtained from the attention mechanism as follows: where φ i ∈ Ψ and the weight coefficient α Y t i is computed as , Here, η is defined similarly with the question pattern decoder. The probability P(y t | X , Z , K, y <t ) of word y t is obtained as follows: where w is a word from the pre-defined vocabulary V Y .
In the training phase, the proposed model is trained by minimizing the negative log-likelihood of the training question Y , where the loss function L θ = L θ (Z ) + L θ (Y ) has two components: where L θ (Z ) and L θ (Y ) are losses from the pattern decoder and the question decoder, respectively; θ denotes the parameter set.
Here, L θ (Z ) provides an additional supervised signal for pattern prediction.

EXPERIMENTAL SETUP
In this section, we detail our experimental setup. We focus on three research questions. (RQ1) Does CNQG outperform competitive baselines on question generation? (RQ2) How does CNQG perform on predicting question patterns? (RQ3) What is the impact of context length in our model on question generation?

Datasets
We conduct experiments on two multi-turn conversational datasets, i.e., the DailyDialog dataset [13] and the Cornell Movie-Dialog dataset [3]. DailyDialog is collected from human-to-human talks in daily life. It contains 11,318 human-written dialog sessions and covers various topics such as culture, education, tourism and health etc. Cornell is extracted from movie scripts including 220,579 conversational exchanges between 10,292 pairs of movie characters.
To train CNQG, we perform several pre-processing steps on the raw text. We first generate triples (C, X , Y ), i.e., three turn dialogues between two interlocutors where C is the context, X is the post, and Y is the target response. Then, with the help of hand-crafted rules we pick triples where the response is in the form of a question. These rules include presence of a question mark and a list of interrogatives. We identify the pattern for each question based on the classification method proposed in [5]. Finally, we obtain 28,769 triples from DailyDialog and 49,689 triples from Cornell; for each dataset, 2,000 triples are randomly selected for validation and another 2,000 for testing; the remainder is used for training. The statistics of the datasets we use are shown in Table 1. Clearly, the distributions of different patterns are quite unbalanced. Moreover, both datasets feature a broad range of context lengths.

Baselines and metrics
3.2.1 Baselines. For comparison, we compare the performance of CNQG against three state-of-the-art baselines for question generation: (1) NQG [4]: an attention-based sequence to sequence learning model that encodes sentences from a text passage to generate a question. Similar approaches can be found in [20,26]. Here, we set the post as the input sentence.

Metrics.
Following [4,5,17,20,26], we adopt five metrics to evaluate the performance of CNQG and the baselines, i.e., BLEU-1 [15], BLEU-2 [15], Distinct-1 [11], Distinct-2 [11]. BLEU-1 and BLEU-2 are the most frequently used metrics for question generation; they measure the word-overlap between the generated question and the ground-truth. A higher BLEU score indicates that the generated question is closer to the ground-truth. Distinct-1 and Distinct-2 respectively evaluate the number of distinct unigrams and bigrams in the generated questions, which are often used to measure the questions in terms of sentence diversity.

Implementation details
In our experiments, we manually collect 36 interrogatives as the pattern vocabulary. We adopt the NLTK tool 3 for pos-tagging and lemmatization. In total, 30 topics are identified for each dialogue. Like [4,20], the word embedding is initialized by pre-trained Glove 6B 4 word vectors with 300 dimensions. We use the original vocabulary consisting of 16,578 unique words in DailyDialog for decoding and choose the 20,000 most common words as our vocabulary for Cornell. All out of vocabulary words are replaced by the symbol ⟨UNK⟩. The GRU unit has a 1-layer structure with 512 hidden cells. The parameters of the CNQG model are updated by the Adam Optimizer [8] with gradient clipping. We train all models for at most 20 epochs. The learning rate is set to 0.002 and the mini-batch size is fixed to 64. We refer to the Bahdanau Attention Mechanism [1] for decoding.

RESULTS AND DISCUSSION 4.1 Performance on question generation
To answer RQ1, we investigate the appropriateness and informativeness of the questions generated by CNQG and the baselines in terms of BLEU-1, BLEU-2, Distinct-1 and Distinct-2. We also use a significance test for the difference between the performance of CNQG and the performance of the best performing baseline in terms of BLEU-1 and BLEU-2. The results are presented in Table 2.
In general, CNQG consistently achieves the best performance on both datasets in terms of all metrics, which demonstrates its effectiveness for generating appropriate and informative questions. Particularly, the improvements of CNQG over the best performaing baseline in terms of BLEU-1 and BLEU-2 are statistically significant. On the DailyDialog dataset, context-sensitive approaches, like DCGM-I, HRED and CNQG, achieve obviously higher Distinct scores than NQG, which indicates that the conversational context benefits generating different words and leads to a more informative question in dialogues. But DCGM-I and HRED achieve ver different results in terms of BLEU scores. Many questions generated by HRED are logically reasonable but quite different from the ground-truth, which may explain its poor performance in terms of BLEU scores. We can observe similar results on Cornell. However, for all discussed models, the performance on Cornell in terms of the Distinct scores are worse than on DailyDialog. This may be attributed to the fact that the sentences in Cornell tend to have more uninformative expressions than in DialyDialog, which makes it harder to generate informative and diverse questions [11]. Table 2: Performance of different question generation models. The results produced by the best baseline and the best performing model in each column are underlined and boldfaced, respectively; * denotes significantly better than the best baseline in a paired t-test (p ≤ 0.01); "DD" is short for DailyDialog.

Performance on pattern prediction
To answer RQ2, we zoom in on a comparison between CNQG and the baselines in terms of variety and consistency with the groundtruth of generated question patterns. On the test sets, we calculate the question quantity of each pattern for various models as well as for the ground-truth. The results are plotted in Fig. 2.
As shown in Fig. 2, focusing on the pattern variety of generated questions, we see that CNQG and HRED generate more diverse patterns than DCGM-I and NQG. Especially for infrequent patterns like when, why, where, who and which, DCGM-I and NQG fail to generate those patterns; they are restricted to a single pattern, for instance, NQG only generates what patterns on DailyDialog. As for the consistency with the ground-truth, CNQG covers all almost varieties of patterns that exist in the ground-truth, while HRED generates many what patterns, with large gaps on Cornell. On both datasets, NQG and DCGM-I lack many patterns that are present in the ground-truth. In addition, for CNQG, HRED and DCGM-I, we can find some instances of others patterns that are not identified as questions. By manual inspection, we also found that most of these instances actually correspond to the yes/no pattern, which has the most ambiguous interrogatives. For instance, a generated question like "you have a company?" does not have any explicit interrogatives, so it is hard to identify its pattern automatically.

Impact of context length
To answer RQ3, we analyze the performance of CNQG and the context-sensitive baselines, i.e., DCGM-I and HRED, on test samples with varying context lengths (measured in number of words). For brevity, we only present our experimental results on the Daily-Dialog dataset as qualitatively similar phenomena can be found on the Cornell dataset.
We split the test samples into groups according to their context length and present the distribution of tests by context length in Table 3. The majority of the tests are associated with a short context of less than 20 words, which are more likely to be ambiguous. Next, we evaluate the model performance in terms of BLEU-1, BLEU-2, Distinct-1 and Distinct-2, respectively, and plot the results in Fig. 3. Generally, for most cases, CNQG outperforms the baselines at every context length in terms of all metrics (except Distinct-2 at length more than 30), which confirms the robustness of CNQG across different context lengths. In particular, for contexts of length less than 10, CNQG clearly outperforms the baselines, more so than for other lengths, demonstrating its effectiveness for short contexts. DCGM-I is the best baseline in terms of BLEU scores while HRED is the best in terms of Distinct scores; this is consistent with their overall performance shown in Table 2. Additionally, with the increase in context length, all models show an increase in terms of the Distinct scores and a decrease in terms of BLEU scores. This indicates that it is increasingly hard for question generation to balance sentence diversity and similarity to the ground-truth when the context length grows, since a long context may introduce various topics while injecting noise at the same time. CNQG uses the semantic information contained in long contexts to achieve high Distinct scores, while it manages to filter out diverging topics so as to maintain a good performance in terms of BLEU scores.

Case study
To obtain a better understanding of the models discussed, we perform a case study by randomly sampling three examples from the datasets we use in our experiments; see Table 4.
In Example 1, the post "i do not know." appears in a dialogue, which is a common but meaningless expression; CNQG, DCGM-I and HRED are able to generate more informative questions than NQG. This could be attributed to the use of the conversational context. As for appropriateness, CNQG performs best among the four models as it can accurately generate a pivotal topic in the question (job), which has appeared in the conversational context. In Example 2, all generated questions by the models seem reasonable according to the post and context. However, according to the pattern of the ground-truth question, CNQG obtains a more appropriate question pattern (why) than the baseline models. In Example 3, we see that the post only has a single meaningless word while the context provides a useful topic (soda). Based on the limited amount of information that is available, the three baseline models fail to generate relevant or informative questions with the correct topic. However, CNQG successfully introduces a highly related topic beer and brings out a positive turn for the dialogue; this confirms the effectiveness of CNQG at avoiding breakdown of the dialogue.

CONCLUSIONS
To the best of our knowledge, ours is the first work on question generation in the setting of multi-turn open-domain dialogue systems. In this paper, we have proposed a context-augmented neural question generation model CNQG that leverages the conversational context to generate appropriate and informative questions. Experiments on two publicly available conversational datasets provide experimental evidence for the effectiveness of our proposal, showing that CNQG outperforms state-of-the-art question generation baselines in terms of BLEU-1, BLEU-2, Distinct-1 and Distinct-2. CNQG is able to extract useful features from long conversational contexts while maintaining robust performance on short contexts. As to future work, we want to exploit knowledge bases to enrich interactions in a question-based manner, while maintaining semantic coherence [23]. Also, for dialogues in an e-commerce context we aim to enrich question generation with contrastive questions so as to increase diversity, especially for short contexts [9,16].