Kernel-based Reinforcement Learning for Traffic Signal Control with Adaptive Feature Selection Tianshu Chu1, Jie Wang1, and Jian Cao2 Abstract— Reinforcement learning in a large-scale system is computationally challenging due to the curse of the dimensionality. One approach is to approximate the Q-function as a function of a state-action related feature vector, then learn the parameters instead. Although assumptions from the priori knowledge can potentially explore an appropriate feature vector, selecting a biased one that insufficiently represents the system usually leads to the poor learning performance. To avoid this disadvantage, this paper introduces kernel methods to implicitly propose a learnable feature vector instead of a pre-selected one. More specifically, the feature vector is estimated from a reference set which contains all critical state-action pairs observed so far, and it can be updated by either adding a new pair or replace an existing one in the reference set. Thus the approximate Q-function keeps adjusting itself as the knowledge about the system accumulates via observations. Our algorithm is designed in both batch mode and online mode in the context of the traffic signal control. In addition, the convergence of this algorithm is experimentally supported. Furthermore, some regularization methods are proposed to avoid overfitting of Q-function on the noisy observations. Finally, A simulation on the traffic signal control in a single intersection is provided, and the performance of this algorithm is compared with Qlearning, in which the Q-function is numerically estimated for each state-action pair without approximation. I. INTRODUCTION Reinforcement learning (RL) aims to learn the optimal policy from interactions with the environment. RL is formalized in the framework of the Markov decision process (MDP) where the learner gains decisionsupporting knowledge about the underlying structures of the environment from a sequence of observations [1]. In online mode, the learner updates its knowledge after each observation with the temporal difference (TD); while in batch mode, it learns in a single step when enough observations are collected. Q-learning is a typical RL algorithm in which the learner accumulates the 1Tianshu Chu and Jie Wang are with the Department of Civil and Environmental Engineering, Stanford University, CA 94305, USA {cts1988, jiewang}@stanford.edu 2Jian Cao is with the Department of Computer Science and Engineering, Shanghai JiaoTong University, Shanghai, China [email protected] knowledge about the underlying Q-function (or actionfunction) to determine the optimal policy [2]. A. Preliminaries In MDP, the environment is represented as a four-tuple (X ; A; g; p), where X and A are the state and action space respectively, p(yjx; a) is the transition probability going to the next state y 2 X given the current stateaction pair (joint state) (x; a), and g(yjx; a) is the corresponding stage cost. When g(yjx; a) is bounded, the optimal cost function J∗(x) of the MDP satisfies the Bellman’s equation: J∗ = T J∗, where T is the dynamic programming (DP) operator [3]. Equivalently, if we define the Q-function as Q(x; a) := X y2X p(yjx; a)(g(yjx; a) + αJ(y)); where α 2 (0;1) is the discounted rate, then J∗(x) = Q∗(x; µ∗(x)) with the optimal policy µ∗(x) 2 argmina2A Q∗(x; a), and Q∗ can be obtained by solving Q = FQ with (FQ)(x; a) = X y2X p(yjx; a)(g(yjx; a) + αmin b2A Q(y; b)) [4]. In stochastic control, the agent knows all required information so we call it the expert, or the oracle. In contrast, a learner does not know the underlying environmental structures, such as the transition probability. Thus it has to gain the knowledge about them by the observation of the sampled trajectory. As a result, the optimal cost of an expert usually gives a lower bound of that of a learner. In Q-learning, the knowledge of learner is represented as Q-function. For finite MDPs, the values of Q-function Q(x; a) can be numerically stored at each joint state, while for infinite (continuous) MDPs, Qfunction is usually approximated as the inner product: Q ~(x; a) =< w; φ(x; a) >, where w is the weight vector to be learned and φ(x; a) is a pre-selected feature vector. Furthermore, w depends on critical historical observations in the reference set R, so it can be rewritten as a linear combination of reference feature vectors: w =< θ; φ((x; a)R) >, where θ is the weight vector 53rd IEEE Conference on Decision and Control December 15-17, 2014. Los Angeles, California, USA 978-1-4673-6090-6/14/$31.00 ©2014 European Union 1277to be learned. Then the approximate Q-function can be represented as Q ~R(x; a) =< θ; k((x; a)R; (x; a)) >; (1) where k((x; a)R; (x; a)) is an jRj−dimensional vector in which each element is a kernel k((y; b); (x; a)) implicitly representing the inner product < φ(y; b); φ(x; a) >, for any (y; b) 2 R. B. Related Work A common approach for implementing Q-learning in infinite MDPs is to approximate Q(x; a) as Q ~(x; a) and update w with observations to obtain more “correct” Q ^∗. Specifically, in online mode like SARSA learning [5], w is updated with the gradient descent of the TD error, while in batch mode like fitted Q iteration [6], it is updated with regressions on Q ^∗ estimated from Monte-Carlo on sample observations. Therefore, the key procedure in approximate Q-learning is to design a “good” feature vector as we did in supervised learning [7]. In the online state aggregation algorithm, φ(x; a) was an indicator vector mapping joint states to clusters, and w contained Q-values of clusters on this“induced MDP”, then the idea of online classification was applied [8]. In a further modified version, w represented Qvalues of some critical states, and φ was designed as an exponentiated "distance function" [9]. Others applied batch-mode classification or regression. For classification on µ ^∗, Bagnell et al. suggested a 0/1 loss classifier for policy search with “baseline distribution” [10], while Lagoudakis et al. proposed other classifiers such as SVM and neural network [11]. For regression of Q ^∗, Ormoneit et al. maintained a training set Sa for each action, and suggested ka(xi; x) = k((xi; a); (x; a)) as a normalized local weight [12], while Ernst et al. suggested a treebased kernel for each joint state [13]. Q-learning was first introduced to traffic signal control by Wiering [14], then was integrated with other techniques, such as genetic algorithm [15], neural network [16], and neurofuzzy [17]. Approximate Q-learning was introduced later [18]. However, all these studies implement the existing algorithms in the domains of traffic signal control, with properly designed feature vectors from real-time traffic condition. This paper proposes a more adaptive kernel-based Q-learning algorithm, making the feature vector learnable from accumulated observations. In our algorithm, the learner initializes a small reference set (or even an empty set if no prior knowledge is available) and then keeps expanding or adjusting it to reduce the approximation bias of the Qfunction. Also, the learner maintains the corresponding parameters for the reference set, such as Q ^∗ and the kernel matrix, so it can efficiently learn and update the control actions. This algorithm is especially useful in traffic signal controls, since a rich information stream of traffic condition can be obtained online from sensors, to support better understanding of the environment. In this paper, we first propose the algorithm in both batch and online modes. Furthermore, we suggest some regularization methods to reduce the estimation variance of the algorithm. Finally, we implement this algorithm to a traffic signal control problem. II. ALGORITHMS In this section, we develop an adaptive Q-learning algorithm with the kernel method so the learner can keep adjusting the feature vector with accumulated observations. First, we construct an adaptive reference set and define a parameter to control its growth. Second, the learning algorithm is provided in batch and online modes given kernel matrix of the reference set. In addition, the convergence issues of this algorithm are discussed. Finally, we propose some context-based regularization with additional assumptions on the MDP. A. Adaptive Reference Set In Q-learning, we can design the features as similarities between the current joint state to those in a fixed reference set, which is developed with some priori knowledge. In other words, we can directly replace (w; φ) with (θ; k) because whenever φ is shown, it is in the form of < w; φ >, as mentioned in Section IA. With this representation, the reference set does not have to be fixed since we can compute the kernel function between any two joint states for prediction. In particular, the learner can adaptively attach a new joint state or replace an old one in the reference set during the learning process. Usually, the attaching/replacing decision is made to minimize the estimation variance of Q ^∗ R. Typically, the variance increases when the joint states in the reference set are correlated. Here we use a single parameter δ 2 [0; 1) and another kernel function k0 to determine if a new joint state is distinguished from those in the reference set, as shown in Algorithm 1. Here kR(x; a) is the shorthand of k((x; a)R; (x; a)) in Eq. (1); the estimated Q-values Q ^∗ R and the kernel matrix KR of the reference set should also be updated. B. Adaptive Batch Learning Batch learning updates the policy after a collection of observations. The algorithm with the framework of fitted Q iteration is shown in Algorithm 2. The idea of this algorithm is originated from the kernel-based supervised learning algorithms, but instead of i.i.d observations, 1278Algorithm 1 ARS((x; a); R; Q ^∗ R; KR; k; k0; δ) 1: [k ^0;(^ y;^ b)] max(y;b)2R k0((y; b);(x; a)) 2: if (k ^0 < δ) then 3: R R [ (x; a) 4: Q ^∗ R Q ^∗ R [ Q ^∗(x; a) 5: KR [KR; kR(x; a); kR T(x; a); k((x; a);(x; a))] 6: else 7: (^ y;^ b) (x; a) 2 R 8: Q ^∗(^ y;^ b) Q ^∗(x; a) 2 Q ^∗ R 9: KR((^ y;^ b); ·) kR T(x; a) 10: KR(·;(^ y;^ b)) kR(x; a) 11: KR((^ y;^ b);(^ y;^ b)) k((x; a);(x; a)) 12: end if the observations here are sampled from the MDP with some unknown transition probability. Formally, M is the underlying MDP, k and k0 are kernel functions used for prediction and reference set updating, respectively, n is the size of the collection of observations, l is the loss function, η is the smoothing rate, λ is the regularization parameter, and  is used for stop condition. Algorithm 2 ABL(M; R; k; k0; n; l; η; δ; λ; ) 1: R(0) R; Q ^∗ R(0) 0; S; Q ^∗ S ; 2: m; t 0; Kθ I 3: repeat 4: for t = 1 n do 5: fgt; yt+1g M(xt; at) 6: S S [ (xt; at) 7: Kt kR(m)(yt+1; A) 8: Q ~R(m)(yt+1; A) norm(KtKθ)Q ^∗ R(m) 9: Q ^∗ S Q ^∗ S[η(gt+αminb2A Q ~R(m)(yt+1; b)− Q ~R(m)(xt; at)) + Q ~R(m)(xt; at) 10: end for 11: m m + 1 12: R(m) R(m − 1) 13: Q ^∗ R(m) Q ^∗ R(m−1) 14: for j = 1 ! n do 15: ARS((x; a)Sj; R(m); Q ^∗ R(m); KR(m); k; k0; δ) 16: end for 17: θ argminθ0 P j2R(m) l(Q ~j; Q ^∗ j) + λ 2kQ ~R(m)k2 18: Kθ θ=Q ^∗ R(m) 19: S; Q ^∗ S ; 20: until kQ ~R(m)(S) − Q ~R(m−1)(S)k1 <  Now we explain Algorithm 2 closely. First, we estimate the next-step kernel-based Q-function Q ~R from the learned parameter θ for acting (lines 7-8). Specifically, we (1) estimate the kernel vectors for all actions Kt to represent the similarities between (yt+1; b) and (x; a)R, 8b 2 A; then (2) estimate the equivalent “transition matrix” P ^ as KtKθ and normalize it with the maximum row summation so that the sum of each row is not greater than 1; and (3) estimate Q ~R as P ^Q ^∗ R. Second, we estimate the sample Q-values using one step projection with observations FQ ~ (line 9). Third, when enough samples are selected, lines 12-16 determines the attaching/replacing of the reference set based on Algorithm 1. Forth, new parameter θ is learned by minimizing the empirical difference between the kernelbased Q-functions and the sample Q-values (line 17), then is adjusted by dividing the estimated sample Qvalues. Finally, a stopping condition decides when the Qfunction converges (line 20). Obviously, when the same sample sets are used during each iteration, the reference set remains unchanged, and this algorithm will reduce to the fitted Q iteration. 1) Loss Functions: The most important process in Algorithm 2 is to update θ such that it has better explanations on the new observations, i.e., we should define appropriate loss functions to make Q ~R have a good regression on Q ^∗ R. The popular loss functions for regression are quadratic loss, absolute-value loss, and -insensitive loss, while those for classification are 0/1 loss, hinge loss, and sigmoid loss. Most loss functions have the closed form solution θ. In particular, we rewrite line 17 in the matrix form: θ 2 argmin θ0 X j2R l((KRθ0)j; Q ^∗ j) + λ 2 θ0T KRθ0; (2) where the second term is the L2−norm regularization, and KR here is the new kernel matrix from the updated reference set (line 15) with new observations f(x; a)S; Q ^∗ Sg. Taking quadratic loss as an example, the solution can be easily computed as θ = (KR+λIjRj)−1Q ^∗ R; for lj = 1 2((KRθ0)j −Q ^∗ j)2; i.e., Kθ = (KR + λIjRj)−1. However, for classification, we also need to estimate µ ^∗. A simple way is to replace Q ~R with Q ^∗ for all samples and then to estimate the optimal policy again. 2) Convergence Issues: It’s known that not every gradient method guarantees convergence in fitted Qiteration [19]. Here, we apply the normalization in line 8 to make P ^ a stochastic matrix. So it can be easily proved that F is a contract mapping from the definition FQ ~R = g + αP ^Q ~R (A similar proof can be found in the appendix of [13]), i.e., our algorithm does converge 1279once R is a fixed reference set. We can also verify the convergence with extreme conditions δ ! 0 and δ ! 1. This implies the algorithm converges almost surely, as the iteration number is significantly large. C. Adaptive Online Learning Online learning updates the policy after each observation in a sequence. It is more elegant and natural than the ABL since learning and acting have the same frequency. Our algorithm (Algorithm 3) is developed by introducing the kernel trick into the framework of SARSA learning, with T the planning horizon and η the learning rate. In online learning, (w; φ) is preferred in the stochastic gradient updating rule, so we generate pseudo feature matrix Φ := φ((x; a)R) from KR := ΦΦT with Cholesky decomposition. This operation automatically performs the regularization since Φ is a lower triangle matrix, where orthogonal components are used and information fades along least important principle components. Therefore Cholesky decomposition is simple yet useful in signal control problems with rich informed observations. AOL shares the same convergence issues of SARSA learning when T is sufficiently large. Now we examine Algorithm 3 closely. First, lines 3-10 implement ARS without the replacing operation. This is because W is updated from the previous reference set, replacing the old reference joint state will throw its weight as well. Second, line 11 estimates the feature vector of the new observed joint state in two different ways, depending on if it is already inside the reference set (second formula) or not (first formula). Finally, lines 12-13 estimate the kernel-based Q-function and use it to select the optimal action, and line 15 updates w with TD. D. Context-based Refinements Several methods can improve and simplify the general ABL and AOL algorithms given additional assumptions. First, if we have the prior knowledge about the joint state space, we can import an appropriate initial reference set R. Second, if the action space is discrete and contains categorical variables instead of numerical variables (e.g. signal colors), then a more natural approach is treating each action as a “class label”, and using k(x) and fθaga2A ( or φ(x) and fwaga2A for AOL) instead of k(x; a) and θ (or φ(x; a) and w for AOL). We do not need to maintain a separate set for each action if all actions share the same initial reference set R ⊆ X . Furthermore, if the state space is also discrete, we can calculate K; 8x 2 X off-line so the leaner just needs to search KR; 8R ⊆ X during online learning. Another refinement is to apply nonparametric approaches to estimate weighted θ as well. For example, Algorithm 3 AOL (M; R; k; k0; T; δ; η) 1: R; KR; Φ Chol(KR); w 0 2: for t = 1 ! T do 3: k ^0 max(x;a)2R k0((x; a); (xt; at)) 4: if k ^0 < δ then 5: R R [ (xt; at) 6: KR [KR; kR(xt; at); kR(xt; at)T ; 7: k((xt; at); (xt; at))] 8: Φ Chol(KR) 9: w w [ 0 10: end if 11: φ ~(xt; a) f(Φ−1kR(xt; a)); φ(xt; a)g, 8a 2 A 12: Q ~(xt; a) wT φ ~(xt; a), 8a 2 A 13: at argmina2A Q ~(xt; a) 14: fgt; yt+1g M(xt; at) 15: w ηφ ~(xt; at)(gt + α minb2A Q ~(yt+1; b) − Q ~(xt; at)) + w 16: end for if we use the locally weighted regression and set δ ! 0, λ = 0, then the loss function becomes: lj = 1 2 Kw(j)((KRθ)j − Q ^∗((x; a)j))2; (3) Kw(j) = exp(−kKt − KR(j; ·)k2 2 2b2 ); (4) where Kw(j) is the weight kernel. Now if we define a diagonal matrix W such that Wjj = Kw(j), then the loss becomes 1 2(KRθ − Q ^∗ R)T W (KRθ − Q ^∗ R), and we can get a closed formula for θ: θ = (KR T WKR)−1KR T WQ ^∗ R: (5) III. APPLICATIONS In this section, we apply ABL and AOL algorithms on the traffic signal control problem in a single intersection, with four roads connected to it with probabilities of incoming traffic flows pin t 2 [0; 1]4. Each road has a main lane with maximum capacity Ms, and a left-turn lane near the intersection, with maximum capacity Ml < Ms. Also each car takes an action 2 fgoStraight; turnLeft; turnRightg sampled from a probability simplex ∆out t (i). Therefore the MDP can be defined as: action space: for each signal, Ai = fGreen("); Green( ); Green("; ); Redg while for the intersection, A = fGreen(")1;3; Green( )1;3; Green(")2;4; Green( )2;4; Green("; )1:4g. state space: X = f0; :::; Msg4f0; :::; Mlg4 = 4Q i=1 Xi. transition matrix: There are three types of random variables in this system: incoming cars 1280dt(i) ∼ pin t (i), lane-changing cars et(i) ∼ ∆out t (i;2), and outgoing cars st;at(i) ∼ ∆out t (i)jat. We can sample the next state with them. For example, xs;t+1(i) = xs;t(i) + ds;t(i) − et(i) − st;at(i) if no boundary conditions are needed to be considered. Generally speaking, xs;t+1(i) − xs;t(i) 2 f−2 : 1g, and xl;t+1(i) − xl;t(i) 2 f−1 : 1g. stage cost: gt(xt; at) = ga T xs;t + gb T rt;a, where the second term shows the number of rejected incoming cars. For example, rs;t;a(i) = (xs;t(i) + dt(i) − et(i) − ss;t;at(i) − Ms)+. For the following experiments, the default settings are: Ml = 1, Ms = 3, l = quadratic, η = 0:01, λ = 0:1, R(0) = ;, k = k0 = RBF (GK), α = 0:8 ,ga = [1;2;4;3]T ,gb = [Ms; Ml], pin t = [0:5;0:6;0:3;0:2]T , ∆out t (i) = [1=3;1=3;1=3]; 8t. A. Convergence The convergence of AOL should be equivalent to that of SARSA learning, and we run several sample trajectories to confirm it in this application. The main analysis in this section is to test the convergence speed of ABL. We collect sample observations online based on current Q ~, and fix η since it can hardly influence the convergence speed of Q ~. Then we test the convergence of Q ~ on a specific joint state (x; a) = (0;2). First, We verify the influence of sample size n by fixing δ = 0:8. From Figure 1(a), we can see when n is smaller, Q ~ converges faster. Another disadvantage of large n in online sample collection is that the learner may stick to a number of states which is not enough for the next updating. A simple way to avoid this situation is adding some random explorations in ABL. Now we fix n = 1, and verify how δ influences the convergence speed. This setting is different from AOL due to different loss functions. From Figure 1(b), we can see for larger δ, Q ~ updates more stably, because of the high attaching/replacing rate of the reference set. B. Performance It is proper to apply online learning for traffic signal control, so we estimate the performance of AOL with different kernels, such as Gaussian kernel (GK), Laplacian kernel (LK), and polynomial kernel (PK), and compare with the traditional Q-learning. Here we use φ(x) and wa instead of φ(x; a) and w because the actions are categorical variables. First, we fix R = X , δ = 0 to verify the performance of kernel method without the adaptive reference set. By selecting appropriate parameters1 for different kernels, we get the plots 1All learners: η = 0:001; GK: σ2 = 105; LK: σ2 = 50; PK: p = 3. 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 x 104 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Time Q−value n = 1 n = 5 n = 20 n = 40 (a) Influence of n on convergence 0 0.5 1 1.5 2 2.5 3 3.5 4 x 104 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Time Q−value Q−learning δ = 0.8 δ = 0.5 δ = 0.2 (b) Influence of δ on convergence Fig. 1: Experimental convergence speeds of ABL. in Figure 2(a). Here the performance is estimated as the mean time-average cost of 100 sample trajectories: g ¯(t) = 100 1 t 100 P i =1 tP τ=1 g(τ)i. We can see all the kernellearners have better performance than the Q-learner has. Especially for the PK-learner, the performance is almost as low as that of the DP expert. Interestingly, K-learners use approximate Q-functions, but they have even better performance than Q-learning. One possible explanation is that they apply the restricted exploitation to avoid sticking in the undesired local minimum due to insufficient observations. We also plot g ¯(t) of kernellearners directly using (θ; k) (dashed lines), revealing that Cholesky decomposition may either increase or decrease the performance of kernel-learners. Second, we set k as PK and verify the influence of δ with different initial reference sets R = X1:N . Specifically, we estimate g ¯(t) with δ = f0;0:9g, N = f500;1000g. We use a high δ since both Ms and Ml are very small, leading to high similarities overall. From Figure 2(b), we can see for the fixed reference set, g ¯ is reduced slightly when N is increased from 500 to 1000. However, for the adaptive reference set, N has almost no influence on 1281the performance. In other words, compared to a fixed reference set, especially one with high bias, an adaptive reference set can improve the performance of the learner. 0 100 200 300 400 500 600 700 800 900 1000 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 Time Time−average Cost (regret) Q−learner GK−k GK−φ LK−k LK−φ PK−k PK−φ Expert (a) Learners with maximum reference sets 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2 2.5 3 3.5 Time Time−average Cost (regret) N = 500, δ = 0 N = 500, δ = 0.9 N = 1000, δ = 0 N = 1000, δ = 0.9 (b) PK-learner with adaptive reference sets Fig. 2: Performances of AOL with different kernels. IV. DISCUSSION AND EXTENSIONS This paper develops a kernel-based adaptive Qlearning algorithm in both batch and online mode. The idea is (1) maintaining a learnable reference set of joint states, and (2) selecting appropriate kernel functions to approximate the Q-function with desired exploitation/exploration tradeoff. In addition, several refinements are suggested for general kernel-based algorithms, such as generating feature vectors from the kernel matrix through Cholesky decomposition, and casting the RL problem with categorical actions as a multi-class classification problem. The convergence and effectiveness of this algorithm can be demonstrated in an application on traffic signal control in a single intersection. For future research, improvements can be made on following directions: (1) Currently only a single parameter is used to determine the attaching/replacing of the reference set, more sophisticated criteria could be applied to make a better variance/bias tradeoff. (2) Once the size of the reference set becomes large, the computation cost is high for searching the reference set and updating the kernel matrix, so we could implement appropriate data structure to maintain the flexible data to increase the computational efficiency. (3) In this application, the performance of particular kernel-based learners is boosted with Cholesky decomposition. We will investigate how orthogonality principle gives influence on the feature design in different kernel-learners. REFERENCES [1] R. S. Sutton and A. G. Barto, “Reinforcement learning: an introduction,” Neural Networks, IEEE Transactions on, vol. 9, no. 5, pp. 1054–1054, 1998. [2] J. N. Tsitsiklis, “Asynchronous stochastic approximation and qlearning,” Machine Learning, vol. 16, no. 3, pp. 185–202, 1994. [3] R. Bellman, “A markovian decision process,” DTIC Document, Tech. Rep., 1957. [4] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning: A survey,” arXiv preprint cs/9605103, 1996. [5] G. A. Rummery, “Problem solving with reinforcement learning,” Ph.D. dissertation, University of Cambridge, 1995. [6] G. J. Gordon, “Approximate solutions to markov decision processes,” Robotics Institute, p. 228, 1999. [7] C. Szepesvári, “Algorithms for reinforcement learning,” Synthesis Lectures on Artificial Intelligence and Machine Learning, vol. 4, no. 1, pp. 1–103, 2010. [8] D. P. Bertsekas, D. P. Bertsekas, D. P. Bertsekas, and D. P. Bertsekas, Dynamic programming and optimal control. Athena Scientific Belmont, 1995, vol. 1, no. 2. [9] C. Szepesvári and W. D. Smart, “Interpolation-based q-learning,” in Proceedings of the twenty-first international conference on Machine learning. ACM, 2004, p. 100. [10] J. A. Bagnell, S. Kakade, A. Y. Ng, and J. Schneider, “Policy search by dynamic programming,” 2004. [11] M. Lagoudakis and R. Parr, “Reinforcement learning as classification: Leveraging modern classifiers,” in MACHINE LEARNING-INTERNATIONAL WORKSHOP THEN CONFERENCE-, vol. 20, no. 1, 2003, p. 424. [12] D. Ormoneit and S. Sen, “Kernel-based reinforcement learning,” ´ Machine learning, vol. 49, no. 2-3, pp. 161–178, 2002. [13] D. Ernst, P. Geurts, and L. Wehenkel, “Tree-based batch mode reinforcement learning,” Journal of Machine Learning Research, vol. 6, 2005. [14] M. Wiering, J. Van Veenen, J. Vreeken, and A. Koopman, “Intelligent traffic light control,” Institute of Information and Computing Sciences. Utrecht University, 2004. [15] S. Mikami and Y. Kakazu, “Genetic reinforcement learning for cooperative traffic signal control,” in Evolutionary Computation, 1994. IEEE World Congress on Computational Intelligence., Proceedings of the First IEEE Conference on. IEEE, 1994, pp. 223–228. [16] D. Srinivasan, M. C. Choy, and R. L. Cheu, “Neural networks for real-time traffic signal control,” Intelligent Transportation Systems, IEEE Transactions on, vol. 7, no. 3, pp. 261–272, 2006. [17] E. Bingham, “Reinforcement learning in neurofuzzy traffic signal control,” European Journal of Operational Research, vol. 131, no. 2, pp. 232–241, 2001. [18] C. Cai, C. K. Wong, and B. G. Heydecker, “Adaptive traffic signal control using approximate dynamic programming,” Transportation Research Part C: Emerging Technologies, vol. 17, no. 5, pp. 456–474, 2009. [19] J. N. Tsitsiklis and B. Van Roy, “Feature-based methods for large scale dynamic programming,” Machine Learning, vol. 22, no. 1- 3, pp. 59–94, 1996. 1282