{"title": "Q-LDA: Uncovering Latent Patterns in Text-based Sequential Decision Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 4977, "page_last": 4986, "abstract": "In sequential decision making, it is often important and useful for end users to understand the underlying patterns or causes that lead to the corresponding decisions. However, typical deep reinforcement learning algorithms seldom provide such information due to their black-box nature. In this paper, we present a probabilistic model, Q-LDA, to uncover latent patterns in text-based sequential decision processes. The model can be understood as a variant of latent topic models that are tailored to maximize total rewards; we further draw an interesting connection between an approximate maximum-likelihood estimation of Q-LDA and the celebrated Q-learning algorithm. We demonstrate in the text-game domain that our proposed method not only provides a viable mechanism to uncover latent patterns in decision processes, but also obtains state-of-the-art rewards in these games.", "full_text": "Q-LDA: Uncovering Latent Patterns in Text-based\n\nSequential Decision Processes\n\nJianshu Chen\u21e4, Chong Wang\u2020, Lin Xiao\u21e4, Ji He\u2021, Lihong Li\u2020 and Li Deng\u2021\n\n\u21e4Microsoft Research, Redmond, WA, USA\n\n{jianshuc,lin.xiao}@microsoft.com\n\n\u2020Google Inc., Kirkland, WA, USA\u21e4\n{chongw,lihong}@google.com\n\u2021Citadel LLC, Seattle/Chicago, USA\n{Ji.He,Li.Deng}@citadel.com\n\nAbstract\n\nIn sequential decision making, it is often important and useful for end users to\nunderstand the underlying patterns or causes that lead to the corresponding deci-\nsions. However, typical deep reinforcement learning algorithms seldom provide\nsuch information due to their black-box nature. In this paper, we present a proba-\nbilistic model, Q-LDA, to uncover latent patterns in text-based sequential decision\nprocesses. The model can be understood as a variant of latent topic models that\nare tailored to maximize total rewards; we further draw an interesting connection\nbetween an approximate maximum-likelihood estimation of Q-LDA and the cel-\nebrated Q-learning algorithm. We demonstrate in the text-game domain that our\nproposed method not only provides a viable mechanism to uncover latent patterns\nin decision processes, but also obtains state-of-the-art rewards in these games.\n\n1\n\nIntroduction\n\nReinforcement learning [21] plays an important role in solving sequential decision making problems,\nand has seen considerable successes in many applications [16, 18, 20]. With these methods, however,\nit is often dif\ufb01cult to understand or examine the underlying patterns or causes that lead to the sequence\nof decisions. Being more interpretable to end users can provide more insights to the problem itself\nand be potentially useful for downstream applications based on these results [5].\nTo investigate new approaches to uncovering underlying patterns of a text-based sequential decision\nprocess, we use text games (also known as interactive \ufb01ctions) [11, 19] as the experimental domain.\nSpeci\ufb01cally, we focus on choice-based and hypertext-based games studied in the literature [11],\nwhere both the action space and the state space are characterized in natural languages. At each time\nstep, the decision maker (i.e., agent) observes one text document (i.e., observation text) that describes\nthe current observation of the game environment, and several text documents (i.e., action texts) that\ncharacterize different possible actions that can be taken. Based on the history of these observations,\nthe agent selects one of the provided actions and the game transits to a new state with an immediate\nreward. This game continues until the agent reaches a \ufb01nal state and receives a terminal reward.\nIn this paper, we present a probabilistic model called Q-LDA that is tailored to maximize total\nrewards in a decision process. Specially, observation texts and action texts are characterized by two\nseparate topic models, which are variants of latent Dirichlet allocation (LDA) [4]. In each topic\nmodel, topic proportions are chained over time to model the dependencies for actions or states. And\n\n\u21e4The work was done while Chong Wang, Ji He, Lihong Li and Li Deng were at Microsoft Research.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fthese proportions are partially responsible for generating the immediate/terminal rewards. We also\nshow an interesting connection between the maximum-likelihood parameter estimation of the model\nand the Q-learning algorithm [22, 18]. We empirically demonstrate that our proposed method not\nonly provides a viable mechanism to uncover latent patterns in decision processes, but also obtains\nstate-of-the-art performance in these text games.\n\nContribution. The main contribution of this paper is to seamlessly integrate topic modeling with\nQ-learning to uncover the latent patterns and interpretable causes in text-based sequential decision-\nmaking processes. Contemporary deep reinforcement learning models and algorithms can seldom\nprovide such information due to their black-box nature. To the best of our knowledge, there is no\nprior work that can achieve this and learn the topic model in an end-to-end fashion to maximize the\nlong-term reward.\n\nRelated work. Q-LDA uses variants of LDA to capture observation and action texts in text-based\ndecision processes. In this model, the dependence of immediate reward on the topic proportions\nis similar to supervised topic models [3], and the chaining of topic proportions over time to model\nlong-term dependencies on previous actions and observations is similar to dynamic topic models [6].\nThe novelty in our approach is that the model is estimated in a way that aims to maximize long-term\nreward, thus producing near-optimal policies; hence it can also be viewed as a topic-model-based\nreinforcement-learning algorithm. Furthermore, we show an interesting connection to the DQN\nvariant of Q-learning [18]. The text-game setup used in our experiment is most similar to previous\nwork [11] in that both observations and actions are described by natural languages, leading to\nchallenges in both representation and learning. The main difference from that previous work is that\nthose authors treat observation-texts as Markovian states. In contrast, our model is more general,\ncapturing both partial observability and long-term dependence on observations that are common\nin many text-based decision processes such as dialogues. Finally, the choice of reward function in\nQ-LDA share similarity with that in Gaussian process temporal difference methods [9].\n\nOrganization. Section 2 describes the details of our probabilistic model, and draws a connection\nto the Q-learning algorithm. Section 3 presents an end-to-end learning algorithm that is based on\nmirror descent back-propagation. Section 4 demonstrates the empirical performance of our model,\nand we conclude with discussions and future work in Section 5.\n\n2 A Probabilistic Model for Text-based Sequential Decision Processes\n\nIn this section, we \ufb01rst describe text games as an example of sequential decision processes. Then, we\ndescribe our probabilistic model, and relate it to a variant of Q-learning.\n\n2.1 Sequential decision making in text games\nText games are an episodic task that proceeds in discrete time steps t 2{ 1, . . . , T}, where the length\nT may vary across different episodes. At time step t, the agent receives a text document of N words\nn=1.2 We call these words\ndescribing the current observation of the environment: wS\nobservation text. The agent also receives At text documents, each of which describes a possible\nt , {wa\nn=1 with a 2{ 1, . . . , At}, where At\naction that the agent can take. We denote them by wa\nis the number of feasible actions and it could vary over time. We call these texts action texts. After the\nagent takes one of the provided actions, the environment transits to time t + 1 with a new state and an\nimmediate reward rt; both dynamics and reward generation may be stochastic and unknown. The new\nstate then reveals a new observation text wS\nt+1 for a 2{ 1, . . . , At+1}.\nThe transition continues until the end of the game at step T when the agent receives a terminal reward\nrT . The reward rT depends on the ending of the story in the text game: a good ending leads to a large\npositive reward, while bad endings negative rewards.\nThe goal of the agent is to maximize its cumulative reward by acting optimally in the environment.\n1:t : 8a}, previous actions\n1:t, all action texts wA\nAt step t, given all observation texts wS\na1:t1 and rewards r1:t1, the agent is to \ufb01nd a policy, \u21e1(at|wS\n1:t, a1:t1, r1:t1), a conditional\n\nt , {wS\nt,n}N\n\nt+1 and several action texts wa\n\nt,n}N\n\n1:t , {wa\n1:t, wA\n\n2For notation simplicity, we assume all texts have the same length N.\n\n2\n\n\fA\n\nA\n\n\u21b5A\nt\n\n\u2026\n\nrt\n\n\u21b5S\nt\n\nS\n\nS\n\nN\n\nwa\nt,n\n\nza\nt,n\n\n|At|\n\n\u2713a\nt\n\nat\n\n\u2713S\nt\n\nzS\nt,n\n\nwS\nt,n\n\nN\n\nA\n\nA\n\nN\n\nwa\n\nt+1,n\n\nza\nt+1,n\n\n|At+1|\n\u2713a\nt+1\n\n\u21b5A\n\nt+1\n\nrt+1\n\nat+1\n\n\u21b5S\n\nt+1\n\n\u2713S\nt+1\n\nS\n\nS\n\nzS\nt+1,n\n\nwS\n\nt+1,n\n\nN\n\n\u2026\n\nFigure 1: Graphical model representation for the studied sequential decision process. The bottom\nsection shows the observation topic models, which share the same topics in S, but the topic\ndistributions \u2713S\nt changes with time t. The top section shows the action topic models, sharing the\nt for each a 2 At. The middle\nsame action topics in A, but with time varying topic distribution \u2713a\nsection shows the dependence of variables between consecutive time steps. There are no plates for\nthe observation text (bottom part of the \ufb01gure) because there is only one observation text document\nat each time step. We follow the standard notation for graphical models by using shaded circles as\nobservables. Since the topic distributions \u2713S\nt (except\n1 ) are not observable, we need to use their MAP estimate to make end-to-end learning\n1 and \u21b5A\n\u21b5S\nfeasible; see Section 3 for details. The \ufb01gure characterizes the general case where rewards appear at\neach time step, while in our experiments the (non-zero) rewards only appear at the end of the games.\n\nt and the Dirichlet parameters \u21b5S\n\nt and \u21b5A\n\nt and \u2713a\n\nprobability of selecting action at, that maximizes the expected long-term reward E{PT\n\u2327 =t \u2327tr\u2327},\nwhere 2 (0, 1) is a discount factor. In this paper, for simplicity of exposition, we focus on problems\nwhere the reward is nonzero only in the \ufb01nal step T . While our algorithm can be generalized to the\ngeneral case (with greater complexity), this special case is an important case of RL (e.g., [20]). As a\nresult, the policy is independent of r1:t1 and its form is simpli\ufb01ed to \u21e1(at|wS\nThe problem setup is similar to previous work [11] in that both observations and actions are described\nby natural languages. For actions described by natural languages, the action space is inherently\ndiscrete and large due to the exponential complexity with respect to sentence length. This is\ndifferent from most reinforcement learning problems where the action spaces are either small\nor continuous. Here, we take a probabilistic modeling approach to this challenge: the observed\nvariables\u2014observation texts, action texts, selected actions, and rewards\u2014are assumed to be generated\nfrom a probabilistic latent variable model. By examining these latent variables, we aim to uncover\nthe underlying patterns that lead to the sequence of the decisions. We then show how the model is\nrelated to Q-learning, so that estimation of the model leads to reward maximization.\n\n1:t, a1:t1).\n\n1:t, wA\n\n2.2 The Q-LDA model\n\nThe graphical representation of our model, Q-LDA, is depicted in Figure 1. It has two instances of\ntopic models, one for observation texts and the other for action texts. The basic idea is to chain the\ntopic proportions (\u2713s in the \ufb01gure) in a way such that they can in\ufb02uence the topic proportions in the\nfuture, thus capturing long-term effects of actions. Details of the generative models are as follows.\nFor the observation topic model, we use the columns of S \u21e0 Dir(S)3 to denote the topics for\nthe observation texts. For the action topic model, we use the columns of A \u21e0 Dir(A) to denote\nthe topics for the action texts. We assume these topics do not change over time. Given the initial\ntopic proportion Dirichlet parameters\u2014\u21b5S\n1 for observation and action texts respectively\u2014the\nQ-LDA proceeds sequentially from t = 1 to T as follows (see Figure 1 for all latent variables).\n\n1 and \u21b5A\n\n3S is a word-by-topic matrix. Each column is drawn from a Dirichlet distribution with hyper-parameter S,\n\nrepresenting the word-emission probabilities of the corresponding topic. A is similarly de\ufb01ned.\n\n3\n\n\f1. Draw observation text wS\n\nt as follows,\n\n(a) Draw observation topic proportions \u2713S\n(b) Draw all words for the observation text wS\n\nt ).\nt \u21e0 Dir(\u21b5S\n\ndenotes the standard LDA generative process given its topic proportion \u2713S\nS [4]. The latent variable zS\n\nt,n indicates the topic for the word wS\n\nt,n.\n\nt \u21e0 LDA(wS\n\nt |\u2713S\n\nt , S), where LDA(\u00b7)\nt and topics\n\nt as follows,\n2. For a = 1, ..., At, draw action text wa\n(a) Draw action topic proportions \u2713a\nt ).\nt \u21e0 Dir(\u21b5A\n(b) Draw all words for the a-th action text using wa\nt,n indicates the topic for the word wa\n\nvariable za\n\nt \u21e0 LDA(wa\nt,n.\n\nt |\u2713a\n\nt , A), where the latent\n\n3. Draw the action: at \u21e0 \u21e1b(at|wS\n\n1:t, a1:t1), where \u21e1b is an exploration policy for data\ncollection. It could be chosen in different ways, as discussed in the experiment Section 4.\nAfter model learning is \ufb01nished, a greedy policy may be used instead (c.f., Section 3).\n\n4. The immediate reward rt is generated according to a Gaussian distribution with mean\n\n1:t, wA\n\nfunction \u00b5r(\u2713S\n\nt ,\u2713 at\n\nt , U ) and variance 2\nr:\n\nrt \u21e0N \u00b5r(\u2713S\n\nt , U ), 2\n\nt ,\u2713 at\nt , U ) and its parameter U to the next section,\n\n(1)\n\nr .\n\nHere, we defer the de\ufb01nitions of \u00b5r(\u2713S\nwhere we draw a connection between likelihood-based learning and Q-learning.\n\nt ,\u2713 at\n\n\u21b5S\n\nt + \u21b5S\n\nt + WSA\u2713at\n\nt+1 = WSS\u2713S\n\n5. Compute the topic proportions Dirichlet parameters for the next time step t + 1 as\nt +\u21b5A\n\n(2)\nwhere (x) , max{x, \u270f} with \u270f being a small positive number (e.g., 106), at is the action\nselected by the agent at time t, and {WSS, WSA, WAS, WAA} are the model parameters to\nbe learned. Note that, besides \u2713S\na=1 that will in\ufb02uence\nt+1 and \u21b5A\nt , i.e., the one corresponding to the chosen action at. Furthermore, since\n\u21b5S\nt and \u2713at\n\u2713S\nt+1\nare (implicitly) chained over time via \u2713S\n\nt+1 = WAS\u2713S\nt , the only topic proportions from {\u2713a\n\nt are generated according to Dir(\u21b5S\n\nt ) and Dir(\u21b5A\nt and \u2713at\nt\n\n1 ,\u21b5 A\n\nt ), respectively, \u21b5S\n\n(c.f. Figure 1).\n\nt +WAA\u2713at\n\n1 ,\n\nt+1 and \u21b5A\n\nt+1 is \u2713at\n\nt }At\n\nThis generative process de\ufb01nes a joint distribution p(\u00b7) among all random variables depicted in\nFigure 1. Running this generative process\u2014step 1 to 5 above for T steps until the game ends\u2014\nproduces one episode of the game. Now suppose we already have M episodes. In this paper, we\nchoose to directly learn the conditional distribution of the rewards given other observations. By\nlearning the model in a discriminative manner [2, 7, 12, 15, 23], we hope to make better predictions\nof the rewards for different actions, from which the agent could obtain the best policy for taking\nactions. This can be obtained by applying Bayes rule to the joint distribution de\ufb01ned by the generative\nprocess. Let \u21e5 denote all model parameters: \u21e5= {S, A, U, WSS, WSA, WAS, WAA}. We have\nthe following loss function\n\n\u21e5 ( ln p(\u21e5) \n\nmin\n\nMXi=1\n\nln pr1:Ti|wS\n\n1:Ti, wA\n\n1:Ti, a1:Ti, \u21e5) ,\n\n(3)\n\nwhere p(\u21e5) denotes a prior distribution of the model parameters (e.g., Dirichlet parameters over\nS and A), and Ti denotes the length of the i-th episode. Let KS and KA denote the number of\ntopics for the observation texts and action texts, and let VS and VA denote the vocabulary sizes for\nthe observation texts and action texts, respectively. Then, the total number of learnable parameters\nfor Q-LDA is: VS \u21e5 KS + VA \u21e5 KA + KA \u21e5 KS + (KS + KA)2.\nWe note that a good model learned through Eq. (3) may predict the values of rewards well, but might\nnot imply the best policy for the game. Next, we show by de\ufb01ning the appropriate mean function\nfor the rewards, \u00b5r(\u2713S\nt , U ), we can achieve both. This closely resembles Q-learning [21, 22],\nallowing us to effectively learn the policy in an iterative fashion.\n\nt ,\u2713 at\n\n2.3 From Q-LDA to Q-learning\n\nBefore relating Q-LDA to Q-learning, we \ufb01rst give a brief introduction to the latter. Q-learning [22,\n18] is a reinforcement learning algorithm for \ufb01nding an optimal policy in a Markov decision process\n(MDP) described by (S,A,P, r, ), where S is a state space, A is an action space, and 2 (0, 1)\nis a discount factor. Furthermore, P de\ufb01nes a transition probability p(s0|s, a) for going to the next\n\n4\n\n\fstate s0 2S from the current state s 2S after taking action a 2A , and r(s, a) is the immediate\nreward corresponding to this transition. A policy \u21e1(a|s) in an MDP is de\ufb01ned to be the probability\nof taking action a at state s. Let st and at be the state and action at time t, and let rt = r(st, at) be\nthe immediate reward at time t. An optimal policy is the one that maximizes the expected long-term\nreward E{P+1t=1 t1rt}. Q-learning seeks to \ufb01nd the optimal policy by estimating the Q-function,\n\nQ(s, a), de\ufb01ned as the expected long-term discounted reward for taking action a at state s and then\nfollowing an optimal policy thereafter. It satis\ufb01es the Bellman equation [21]\n\nQ(s, a) = E{r(s, a) + \u00b7 max\n\nQ(s0, b)|s, a} ,\nand directly gives the optimal action for any state s: arg maxa Q(s, a).\nQ-learning solves for Q(s, a) iteratively based on observed state transitions. The basic Q-learning [22]\nrequires storing and updating the values of Q(s, a) for all state\u2013action pairs in S\u21e5A , which is\nnot practical when S and A are large. This is especially true in our text games, where they can be\nexponentially large. Hence, Q(s, a) is usually approximated by a parametric function Q\u2713(s, a) (e.g.,\nneural networks [18]), in which case the model parameter \u2713 is updated by:\n\n(4)\n\nb\n\n\u2713 \u2713 + \u2318 \u00b7r \u2713Q\u2713 \u00b7 (dt Q\u2713(st, at)) ,\n\n(5)\nwhere dt , rt + \u00b7 maxa0 Q\u27130(st+1, a0) if st nonterminal and dt , rt otherwise, and \u27130 denotes\na delayed version of the model parameter updated periodically [18]. The update rule (5) may be\nunderstood as applying stochastic gradient descent (SGD) to a regression loss function J(\u2713) ,\nE[dt Q\u2713(s, a)]2. Thus, dt is the target, computed from rt and Q\u27130, for the prediction Q\u2713(st, at).\nWe are now ready to de\ufb01ne the mean reward function \u00b5r in Q-LDA. First, we model the Q-function\nt , where U is the same parameter as the one in (1).4 This is different from\nby Q(\u2713S\ntypical deep RL approaches, where black-box models like neural networks are used. In order to\nconnect our probabilistic model to Q-learning, we de\ufb01ne the mean reward function as follows,\n\nt ) = (\u2713a\n\nt )T U\u2713 S\n\nt ,\u2713 a\n\nt ) \u00b7 E\u21e5 max\n\nb\n\nQ(\u2713S\n\nt ,\u2713 at\n\n\u00b5r(\u2713S\n\nt ,\u2713 at\n\nt+1,\u2713 b\n\nt and \u2713at\n\nt+1)|\u2713S\n\nt ,\u2713 at\nt , U ) = Q(\u2713S\nt and \u2713at\nsince the second term in the above expression is\nNote that \u00b5r remains as a function of \u2713S\nt\na conditional expectation given \u2713S\nt . The de\ufb01nition of the mean reward function in Eq. (6)\nhas a strong relationship with the Bellman equation (4) in Q-learning; it relates the long-term reward\nt ) to the mean immediate reward \u00b5r in the same manner as the Bellman equation (4). To\nQ(\u2713S\nsee this, we move the second term on the right-hand side of (6) to the left, and make the identi\ufb01cation\nthat \u00b5r corresponds to E{r(s, a)} since both of them represent the mean immediate reward. The\nresulting equation share a same form as the Bellman equation (4). With the mean function \u00b5r de\ufb01ned\nabove, we show in Appendix B that the loss function (3) can be approximated by the one below using\nthe maximum a posteriori (MAP) estimate of \u2713S\n\n(denoted as \u02c6\u2713S\n\nt , respectively):\n\nt and \u02c6\u2713at\n\nt and \u2713at\nt\n\nt ,\u2713 at\n\n(6)\n\nt \u21e4\n\nmin\n\n\u21e5 n ln p(S|S) ln p(A|A) +\n\nMXi=1\n\nTiXt=1\n\n1\n22\n\nr hdt Q(\u02c6\u2713S\n\nt , \u02c6\u2713at\n\nt )i2o\n\n(7)\n\nt+1, \u02c6\u2713b\n\nwhere dt = rt + maxb Q(\u02c6\u2713S\nt+1) for t < Ti and dt = rt for t = Ti. Observe that the \ufb01rst two\nterms in (7) are regularization terms coming from the Dirichlet prior over S and A, and the third\nterm shares a similar form as the cost J(\u2713) in Q-learning; it can also be interpreted as a regression\nproblem for estimating the Q-function, where the target dt is constructed in a similar manner as\nQ-learning. Therefore, optimizing the discriminative objective (3) leads to a variant of Q-learning.\nAfter learning is \ufb01nished, we can obtain the greedy policy by taking the action that maximizes the\nQ-function estimate in any given state.\nWe also note that we have used the MAP estimates of \u2713S\nt due to the intractable marginalization\nof the latent variables [14]. Other more advanced approximation techniques, such as Markov Chain\nMonte Carlo (MCMC) [1] and variational inference [13] can also be used, and we leave these\nexplorations as future work.\n\nt and \u2713at\n\n3 End-to-end Learning by Mirror Descent Back Propagation\n\n4The intuition of choosing Q(\u00b7,\u00b7) to be this form is that we want \u2713S\n\naction (large Q-value), and to be misaligned with the \u2713a\nof U allows the number and the meaning of topics for the observations and actions to be different.\n\nt of the correct\nt of the wrong actions (small Q-value). The introduction\n\nt to be aligned with \u2713a\n\n5\n\n\fAlgorithm 1 The training algorithm by mirror descent back propagation\n1: Input: D (number of experience replays), J (number of SGD updates), and learning rate.\n2: Randomly initialize the model parameters.\n3: for m = 1, . . . , D do\n4:\n\nInteract with the environment using a behavior policy \u21e1m\nepisodes of data {wS\nfor j = 1, . . . , J do\n\nb (at|xS\ni=1 and add them to D.\n\n1:Ti, a1:Ti, r1:Ti}M\n\n1:Ti, wA\n\n1:t, xA\n\n1:t, a1:t1) to collect M\n\nRandomly sample an episode from D.\nFor the sampled episode, compute \u02c6\u2713S\n1, . . . , Ti according to Algorithm 2.\nFor the sampled episode, compute the stochastic gradients of (7) with respect to \u21e5 using\nback propagation through the computational graph de\ufb01ned in Algorithm 2.\nUpdate {U, WSS, WSA, WAS, WAA} by stochastic gradient descent and update {S, A}\nusing stochastic mirror descent.\n\nt ) with a = 1, . . . , At and t =\n\nt and Q(\u02c6\u2713S\n\nt , \u02c6\u2713a\n\nt , \u02c6\u2713a\n\n5:\n6:\n7:\n\n8:\n\n9:\n\nend for\n\n10:\n11: end for\n\nt : a = 1, . . . , At} and at, for all t = 1, . . . , Ti.\n1 = \u21b5A\n1\n\n1 , \u21b5A\n\nt , {xa\n1 and \u02c6\u21b5A\n\n1 , L, , xS\n1 = \u21b5S\n\nAlgorithm 2 The recursive MAP inference for one episode\n1: Input: \u21b5S\n2: Initialization: \u02c6\u21b5S\n3: for t = 1, . . . , Ti do\n4:\n\nt exp\u21e3hT\nt by repeating \u02c6\u2713S\nt / 1, where C is a normalization factor.\nt for each a = 1, . . . , At by repeating \u02c6\u2713a\n\nCompute \u02c6\u2713S\nization \u02c6\u2713S\nCompute \u02c6\u2713a\nfor L times with initialization \u02c6\u2713a\nt+1 from \u02c6\u2713S\nCompute \u02c6\u21b5S\nCompute the Q-values: Q(\u02c6\u2713S\nt , \u02c6\u2713a\n\nt+1 and \u02c6\u21b5A\n\nt 1\n\n\u02c6\u2713S\n\n5:\n\nC\n\nS\n\n6:\n7:\n8: end for\n\n\u02c6\u2713a\n\nt 1\n\nt exp\u21e3hT\nt / 1, where C is a normalization factor.\nt and \u02c6\u2713at\nt ) = (\u02c6\u2713a\n\nt according to (11).\nt )T U \u02c6\u2713S\n\nt for a = 1, . . . , At.\n\nC\n\nxS\nt\n\nS \u02c6\u2713S\nt\n\n+ \u02c6\u21b5S\nt 1\n\u02c6\u2713S\nt\n\ni\u2318 for L times with initial-\ni\u2318\n\n+ \u02c6\u21b5A\nt 1\n\u02c6\u2713a\nt\n\nA \u02c6\u2713a\nt\n\nxa\nt\n\nA\n\nIn this section, we develop an end-to-end learning algorithm for Q-LDA, by minimizing the loss\nfunction given in (7). As shown in the previous section, solving (7) leads to a variant of Q-learning,\nthus our algorithm could be viewed as a reinforcement-learning algorithm for the proposed model.\nWe consider learning our model with experience replay [17], a widely used technique in recent state-\nof-the-art systems [18]. Speci\ufb01cally, the learning process consists of multiple stages, and at each stage,\n1:t, a1:t1) to\nthe agent interacts with the environment using a \ufb01xed exploration policy \u21e1b(at|xS\ncollect M episodes of data {wS\ni=1 and saves them into a replay memory D.\n(We will discuss the choice of \u21e1b in section 4.) Under the assumption of the generative model Q-LDA,\nour objective is to update our estimates of the model parameters in \u21e5 using D; the updating process\nmay take several randomized passes over the data in D. A stage of such learning process is called one\nreplay. Once a replay is done, we let the agent use a new behavior policy \u21e10b to collect more episodes,\nadd them to D, and continue to update \u21e5 from the augmented D. This process repeats for multiple\nstages, and the model parameters learned from the previous stage will be used as the initialization\nfor the next stage. Therefore, we can focus on learning at a single stage, which was formulated in\nSection 2 as one of solving the optimization problem (7). Note that the objective (7) is a function of\nthe MAP estimates of \u2713S\nt and\nthen introduce our learning algorithm for \u21e5.\n\nt . Therefore, we start with a recursion for computing \u02c6\u2713S\n\n1:Ti, a1:Ti, r1:Ti}M\n\nt and \u02c6\u2713at\n\nt and \u2713at\n\n1:Ti, wA\n\n1:t, xA\n\n3.1 Recursive MAP inference by mirror descent\n\nThe MAP estimates, \u02c6\u2713S\n\nt and \u02c6\u2713a\n\nt and \u2713a\n\nt are de\ufb01ned as\n\nt , for the topic proportions \u2713S\nt |wS\n\nt ) = arg max\n\u2713S\nt ,\u2713a\nt\n\nt ,\u2713 a\n\np(\u2713S\n\n(\u02c6\u2713S\n\nt , \u02c6\u2713a\n\n1:t, wA\n\n1:t, a1:t1)\n\n(8)\n\n6\n\n\fSolving for the exact solution is, however, intractable. We instead develop an approximate algorithm\nthat recursively estimate \u02c6\u2713S\nt and \u02c6\u2713a\nt . To develop the algorithm, we rely on the following result, whose\nproof is deferred to Appendix A.\nProposition 1. The MAP estimates in (8) could be approximated by recursively solving the problems:\n(9)\n\n\u02c6\u2713S\nt = arg max\n\u2713S\nt\n\n\u02c6\u2713a\nt = arg max\n\u2713a\nt\n\n\u21e5ln p(xS\nt , S) + ln p\u2713S\nt\u21e4\nt |\u02c6\u21b5S\nt |\u2713S\n\u21e5ln p(xa\nt , A) + ln p\u2713a\nt\u21e4 ,\nt |\u2713a\nt |\u02c6\u21b5A\n\na 2{ 1, . . . , At} ,\n\n(10)\n\nt and xa\n\nwhere xS\nwa\nvalues for the next t + 1 time step according to\n\nt are the bag-of-words vectors for the observation text wS\n1 and \u02c6\u21b5A\n\nt , respectively. To compute \u02c6\u21b5S\n\nt , we begin with \u02c6\u21b5S\n\nt and \u02c6\u21b5A\n\n1 = \u21b5S\n\nt and the a-th action text\n1 and update their\n1 = \u21b5A\n\n\u02c6\u21b5S\n\nt+1 = \u21e3WSS \u02c6\u2713S\n\nt +WSA \u02c6\u2713at\n\nt +\u21b5S\n\n1\u2318 ,\n\n\u02c6\u21b5A\n\nt+1 = \u21e3WAS \u02c6\u2713S\n\nt +WAA \u02c6\u2713at\n\nt +\u21b5A\n\n(11)\n\n1\u2318\n\nt and \u02c6\u2713a\n\nNote from (9)\u2013(10) that, for given \u02c6\u2713S\nt now becomes At+1 decoupled\nsub-problems, each of which has the same form as the MAP inference problem of Chen et al. [8].\nTherefore, we solve each sub-problem in (9)\u2013(10) using their mirror descent inference algorithm, and\nthen use (11) to compute the Dirichlet parameters at the next time step. The overall MAP inference\nprocedure is summarized in Algorithm 2. We further remark that, after obtaining \u02c6\u2713S\nt , the\nQ-value for the t step is readily estimated by:\n\nt , the solution of \u2713S\n\nt and \u02c6\u2713a\n\nt and \u2713a\n\nt ,\u2713 a\n\nE\u21e5Q(\u2713S\n\n1:t, wA\n\nt )|wS\n\n(12)\nwhere we approximate the conditional expectation using the MAP estimates. After learning is \ufb01nished,\nthe agent may extract a greedy policy for any state s by taking the action arg maxa Q(\u02c6\u2713S, \u02c6\u2713a). It\nis known that if the learned Q-function is closed to the true Q-function, such a greedy policy is\nnear-optimal [21].\n\na 2{ 1, . . . , At} ,\n\n1:t, a1:t1\u21e4 \u21e1 Q(\u02c6\u2713S\n\nt , \u02c6\u2713a\nt ),\n\n3.2 End-to-end learning by backpropagation\n\nThe training loss (7) for each learning stage has the form of a \ufb01nite sum over M episodes. Each term\ninside the summation depends on \u02c6\u2713S\nt , which in turn depend on all the model parameters in \u21e5\nvia the computational graph de\ufb01ned by Algorithm 2 (see Appendix E for a diagram of the graph).\nTherefore, we can learn the model parameters in \u21e5 by sampling an episode in the data, computing\nthe corresponding stochastic gradient in (7) by back-propagation on the computational graph given\nin Algorithm 2, and updating \u21e5 by stochastic gradient/mirror descent. More details are found in\nAlgorithm 1, and Appendix E.4 gives the gradient formulas.\n\nt and \u02c6\u2713at\n\n4 Experiments\n\nIn this section, we use two text games from [11] to evaluate our proposed model and demonstrate the\nidea of interpreting the decision making processes: (i) \u201cSaving John\u201d and (ii) \u201cMachine of Death\u201d\n(see Appendix C for a brief introduction of the two games).5 The action spaces of both games are\nde\ufb01ned by natural languages and the feasible actions change over time, which is a setting that Q-LDA\nis designed for. We choose to use the same experiment setup as [11] in order to have a fair comparison\nwith their results. For example, at each m-th experience-replay learning (see Algorithm 1), we use the\nsoftmax action selection rule [21, pp.30\u201331] as the exploration policy to collect data (see Appendix\nE.3 for more details). We collect M = 200 episodes of data (about 3K time steps in \u201cSaving John\u201d\nand 16K in \u201cMachine of Death\u201d) at each of D = 20 experience replays, which amounts to a total\nof 4, 000 episodes. At each experience replay, we update the model with 10 epochs before the next\nreplay. Appendix E provides additional experimental details.\nWe \ufb01rst evaluate the performance of the proposed Q-LDA model by the long-term rewards it receives\nwhen applied to the two text games. Similar to [11], we repeat our experiments for \ufb01ve times with\ndifferent random initializations. Table 1 summarize the means and standard deviations of the rewards\n\n5The simulators are obtained from https://github.com/jvking/text-games\n\n7\n\n\fTable 1: The average rewards (higher is better) and standard deviations of different models on the two\ntasks. For DRRN and MA-DQN, the number of topics becomes the number of hidden units per layer.\n\nDRRN (1-layer) DRRN (2-layer) MA-DQN (2-layer)\n\nTasks\nSaving\nJohn\n\nMachine\nof Death\n\n# topics\n\n20\n50\n100\n20\n50\n100\n\nQ-LDA\n18.8 (0.3)\n18.6 (0.6)\n19.1 (0.6)\n19.9 (0.8)\n18.7 (2.1)\n17.5 (2.4)\n\n17.1 (0.6)\n18.3 (0.2)\n18.2 (0.2)\n7.2 (1.5)\n8.4 (1.3)\n8.7 (0.9)\n\n18.4 (0.1)\n18.5 (0.3)\n18.7 (0.4)\n9.2 (2.1)\n10.7 (2.7)\n11.2 (0.6)\n\n4.9 (3.2)\n9.0 (3.2)\n7.1 (3.1)\n2.8 (0.9)\n4.3 (0.9)\n5.2 (1.2)\n\non the two games. We include the results of Deep Reinforcement Relevance Network (DRRN)\nproposed in [11] with different hidden layers. In [11], there are several variants of DQN (deep\nQ-networks) baselines, among which MA-DQN (max-action DQN) is the best performing one. We\ntherefore only include the results of MA-DQN. Table 1 shows that Q-LDA outperforms all other\napproaches on both tasks, especially \u201cMachine of Death\u201d, where Q-LDA even beats the DRRN\nmodels by a large margin. The gain of Q-LDA on \u201cSaving John\u201d is smaller, as both Q-LDA and\nDRRN are approaching the upper bound of the reward, which is 20. \u201cMachine of Death\u201d was believed\nto be a more dif\ufb01cult task due to its stochastic nature and larger state and action spaces [11], where the\nupper bound on the reward is 30. (See Tables 4\u20135 for the de\ufb01nition of the rewards for different story\nendings.) Therefore, Q-LDA gets much closer to the upper bound than any other method, although\nthere may still be room for improvement. Finally, our experiments follow the standard online RL\nsetup: after a model is updated based on the data observed so far, it is tested on newly generated\nepisodes. Therefore, the numbers reported in Table 1 are not evaluated on the training dataset, so\nthey truthfully re\ufb02ect the actual average reward of the learned models.\nWe now proceed to demonstrate the analysis of the latent pattern of the decision making process\nusing one example episode of \u201cMachine of Death\u201d. In this episode, the game starts with the player\nwandering in a shopping mall, after the peak hour ended. The player approaches a machine that\nprints a death card after inserting a coin. The death card hints on how the player will die in future. In\none of the story development, the player\u2019s death is related to a man called Bon Jovi. The player is\nso scared that he tries to combat with a cardboard standee of Bon Jovi. He reveals his concern to a\nfriend named Rachel, and with her help he \ufb01nally overcomes his fear and maintains his friendship.\nThis episode reaches a good ending and receives the highest possible reward of 30 in this game.\nIn Figure 2, we show the evolution of the topic proportions for the four most active topics (shown in\nTable 2)6 for both the observation texts and the selected actions\u2019 texts. We note from Figure 2 that the\nmost dominant observation topic and action topic at beginning of the episode are \u201cwander at mall\u201d\nand \u201caction at mall\u201d, respectively, which is not surprising since the episode starts at a mall scenario.\nThe topics related to \u201cmall\u201d quickly dies off after the player starts the death machine. Afterwards, the\nmost salient observation topic becomes \u201cmeet Bon Jovi\u201d and then \u201ccombat\u201d (t = 8). This is because\nafter the activation of death machine, the story enters a scenario where the player tries to combat with\na cardboard standee. Towards the end of the episode, the observation topic \u201cconverse w/rachel\u201d and\nthe topic \u201ckitchen & chat\u201d corresponding to the selected action reach their peaks and then decay right\nbefore the end of the story, where the action topic \u201crelieve\u201d climbs up to its peak. This is consistent\nwith the story ending, where the player chooses to overcome his fear after chatting with Rachel. In\nAppendix D, we show the observation and the action texts in the above stages of the story.\nFinally, another interesting observation is about the matrix U. Since the Q-function value is computed\nfrom [\u02c6\u2713a\nt , the (i, j)-th element of the matrix U measures the positive/negative correlation\nbetween the i-th action topic and the j-th observation topic. In Figure 2(c), we show the value of the\nlearned matrix U for the four observation topics and the four action topics in Table 2. Interestingly,\nthe largest value (39.5) of U is the (1, 2)-th element, meaning that the action topic \u201crelieve\u201d and the\nstate topic \u201cconverse w/rachel\u201d has strong positive contribution to a high long-term reward, which is\nwhat happens at the end of the story.\n\nt ]T U \u02c6\u2713S\n\n6In practice, we observe that some topics are never or rarely activated during the learning process. This is\nespecially true when the number of topics becomes large (e.g., 100). Therefore, we only show the most active\ntopics. This might also explain why the performance improvement is marginal when the number of topics grows.\n\n8\n\n\fTable 2: The four most active topics for the observation texts and the action texts, respectively.\n\nObservation Topics\n1: combat\n2: converse w/ rachel\n3: meet Bon Jovi\n4: wander at mall\nAction Topics\n1: relieve\n2: kitchen & chat\n3: operate the machine\n4: action at mall\n\nminutes, lights, \ufb01rearm, shoulders, whiff, red, suddenly, huge, rendition\nrachel, tonight, grabs, bar, towards, happy, believing, said, moonlight\nsmall, jovi, bon, door, next, dog, insists, room, wrapped, standees\nended, catcher, shopping, peak, wrapped, hanging, attention, door\n\nleave, get, gotta, go, hands, away, maybe, stay, ability, turn, easy, rachel\nwait, tea, look, brisk, classics, oysters, kitchen, turn, chair, moment\ncoin, insert, west, cloth, desk, apply, dollars, saying, hands, touch, tell\nalarm, machine, east, ignore, take, shot, oysters, win, gaze, bestowed\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\nObservation Topic 1\nObservation Topic 2\nObservation Topic 3\nObservation Topic 4\n\n5\n\n10\n\n15\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\nAction Topic 1\nAction Topic 2\nAction Topic 3\nAction Topic 4\n\n1.2\n22.1\n2.5\n5.3\n\n264\n\n39.5\n12.4\n4.8\n8.4\n\n20.7\n12.2\n1.4 0.2\n4.1\n1.9\n4.1\n13.3\n\n375\n\n5\n\n10\n\n15\n\n(a) Observation topic \u2713S\nt\n\n(b) Selected action topic \u2713at\nt\n\n(c) Learned values of matrix U\n\nFigure 2: The evolution of the most active topics in \u201cMachine of Death.\u201d\n\n5 Conclusion\n\nWe proposed a probabilistic model, Q-LDA, to uncover latent patterns in text-based sequential\ndecision processes. The model can be viewed as a latent topic model, which chains the topic\nproportions over time. Interestingly, by modeling the mean function of the immediate reward in a\nspecial way, we showed that discriminative learning of Q-LDA using its likelihood is closely related\nto Q-learning. Thus, our approach could also be viewed as a Q-learning variant for sequential topic\nmodels. We evaluate Q-LDA on two text-game tasks, demonstrating state-of-the-art rewards in these\ngames. Furthermore, we showed our method provides a viable approach to \ufb01nding interesting latent\npatterns in such decision processes.\n\nAcknowledgments\nThe authors would like to thank all the anonymous reviewers for their constructive feedback.\n\nReferences\n[1] Christophe Andrieu, Nando De Freitas, Arnaud Doucet, and Michael I Jordan. An introduction\n\nto MCMC for machine learning. Machine learning, 50(1):5\u201343, 2003.\n\n[2] C. M. Bishop and J. Lasserre. Generative or discriminative? getting the best of both worlds.\n\nBayesian Statistics, 8:3\u201324, 2007.\n\n[3] D. M. Blei and J. D. Mcauliffe. Supervised topic models. In Proc. NIPS, pages 121\u2013128, 2007.\n\n[4] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. JMLR, 3:993\u20131022, 2003.\n\n[5] David M Blei. Probabilistic topic models. Communications of the ACM, 55(4):77\u201384, 2012.\n\n[6] David M Blei and John D Lafferty. Dynamic topic models.\n\nIn Proceedings of the 23rd\n\ninternational conference on Machine learning, pages 113\u2013120. ACM, 2006.\n\n9\n\n\f[7] G. Bouchard and B. Triggs. The tradeoff between generative and discriminative classi\ufb01ers. In\n\nProc. COMPSTAT, pages 721\u2013728, 2004.\n\n[8] Jianshu Chen, Ji He, Yelong Shen, Lin Xiao, Xiaodong He, Jianfeng Gao, Xinying Song, and\nLi Deng. End-to-end learning of lda by mirror-descent back propagation over a deep architecture.\nIn Proc. NIPS, pages 1765\u20131773, 2015.\n\n[9] Yaakov Engel, Shie Mannor, and Ron Meir. Reinforcement learning with Gaussian processes. In\nProceedings of the Twenty-Second International Conference on Machine Learning (ICML-05),\npages 201\u2013208, 2005.\n\n[10] Matthew Hausknecht and Peter Stone. Deep recurrent Q-learning for partially observable MDPs.\n\nIn Proc. AAAI-SDMIA, November 2015.\n\n[11] Ji He, Jianshu Chen, Xiaodong He, Jianfeng Gao, Lihong Li, Li Deng, and Mari Ostendorf.\n\nDeep reinforcement learning with a natural language action space. In Proc. ACL, 2016.\n\n[12] A. Holub and P. Perona. A discriminative framework for modelling object classes. In Proc.\n\nIEEE CVPR, volume 1, pages 664\u2013671, 2005.\n\n[13] Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An intro-\nduction to variational methods for graphical models. In Learning in graphical models, pages\n105\u2013161. Springer, 1998.\n\n[14] Michael Irwin Jordan. Learning in graphical models, volume 89. Springer Science & Business\n\nMedia, 1998.\n\n[15] S. Kapadia. Discriminative Training of Hidden Markov Models. PhD thesis, University of\n\nCambridge, 1998.\n\n[16] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep\n\nvisuomotor policies. Journal of Machine Learning Research, 17(1):1334\u20131373, 2016.\n\n[17] Long-Ji Lin. Reinforcement learning for robots using neural networks. Technical report,\n\nTechnical report, DTIC Document, 1993.\n\n[18] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G.\nBellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Pe-\ntersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan\nWierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement\nlearning. Nature, 518:529\u2013533, 2015.\n\n[19] Karthik Narasimhan, Tejas Kulkarni, and Regina Barzilay. Language understanding for text-\n\nbased games using deep reinforcement learning. In Proc. EMNLP, 2015.\n\n[20] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driess-\nche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander\nDieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap,\nMadeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the\ngame of Go with deep neural networks and tree search. Nature, 529:484\u2013489, 2016.\n\n[21] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press\n\nCambridge, 1998.\n\n[22] Christopher Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279\u2013292, 1992.\n[23] Oksana Yakhnenko, Adrian Silvescu, and Vasant Honavar. Discriminatively trained Markov\n\nmodel for sequence classi\ufb01cation. In Proc. IEEE ICDM, 2005.\n\n10\n\n\f", "award": [], "sourceid": 2572, "authors": [{"given_name": "Jianshu", "family_name": "Chen", "institution": "Microsoft Research, Redmond, W"}, {"given_name": "Chong", "family_name": "Wang", "institution": null}, {"given_name": "Lin", "family_name": "Xiao", "institution": "Microsoft Research"}, {"given_name": "Ji", "family_name": "He", "institution": "University Washington"}, {"given_name": "Lihong", "family_name": "Li", "institution": "Google Inc."}, {"given_name": "Li", "family_name": "Deng", "institution": "Citadel"}]}