Kernel-based Reinforcement Learning for Traffic Signal Control
with Adaptive Feature Selection
Tianshu Chu1, Jie Wang1, and Jian Cao2
Abstract— Reinforcement learning in a large-scale system is computationally challenging due to the curse of
the dimensionality. One approach is to approximate the
Q-function as a function of a state-action related feature vector, then learn the parameters instead. Although
assumptions from the priori knowledge can potentially
explore an appropriate feature vector, selecting a biased
one that insufficiently represents the system usually leads to
the poor learning performance. To avoid this disadvantage,
this paper introduces kernel methods to implicitly propose
a learnable feature vector instead of a pre-selected one.
More specifically, the feature vector is estimated from a
reference set which contains all critical state-action pairs
observed so far, and it can be updated by either adding
a new pair or replace an existing one in the reference
set. Thus the approximate Q-function keeps adjusting
itself as the knowledge about the system accumulates via
observations. Our algorithm is designed in both batch
mode and online mode in the context of the traffic signal
control. In addition, the convergence of this algorithm is experimentally supported. Furthermore, some regularization
methods are proposed to avoid overfitting of Q-function
on the noisy observations. Finally, A simulation on the
traffic signal control in a single intersection is provided,
and the performance of this algorithm is compared with Qlearning, in which the Q-function is numerically estimated
for each state-action pair without approximation.
I. INTRODUCTION
Reinforcement learning (RL) aims to learn the optimal policy from interactions with the environment. RL
is formalized in the framework of the Markov decision process (MDP) where the learner gains decisionsupporting knowledge about the underlying structures
of the environment from a sequence of observations [1].
In online mode, the learner updates its knowledge after
each observation with the temporal difference (TD);
while in batch mode, it learns in a single step when
enough observations are collected. Q-learning is a typical RL algorithm in which the learner accumulates the
1Tianshu Chu and Jie Wang are with the Department of Civil
and Environmental Engineering, Stanford University, CA 94305, USA
{cts1988, jiewang}@stanford.edu
2Jian Cao is with the Department of Computer Science
and Engineering, Shanghai JiaoTong University, Shanghai, China
[email protected]
knowledge about the underlying Q-function (or actionfunction) to determine the optimal policy [2].
A. Preliminaries
In MDP, the environment is represented as a four-tuple
(X ; A; g; p), where X and A are the state and action
space respectively, p(yjx; a) is the transition probability
going to the next state y 2 X given the current stateaction pair (joint state) (x; a), and g(yjx; a) is the
corresponding stage cost. When g(yjx; a) is bounded,
the optimal cost function J∗(x) of the MDP satisfies
the Bellman’s equation: J∗ = T J∗, where T is the
dynamic programming (DP) operator [3]. Equivalently,
if we define the Q-function as
Q(x; a) := X
y2X
p(yjx; a)(g(yjx; a) + αJ(y));
where α 2 (0;1) is the discounted rate, then
J∗(x) = Q∗(x; µ∗(x)) with the optimal policy µ∗(x) 2
argmina2A Q∗(x; a), and Q∗ can be obtained by solving
Q = FQ with
(FQ)(x; a) = X
y2X
p(yjx; a)(g(yjx; a) + αmin
b2A
Q(y; b))
[4]. In stochastic control, the agent knows all required
information so we call it the expert, or the oracle. In
contrast, a learner does not know the underlying environmental structures, such as the transition probability.
Thus it has to gain the knowledge about them by the
observation of the sampled trajectory. As a result, the
optimal cost of an expert usually gives a lower bound of
that of a learner. In Q-learning, the knowledge of learner
is represented as Q-function. For finite MDPs, the values
of Q-function Q(x; a) can be numerically stored at each
joint state, while for infinite (continuous) MDPs, Qfunction is usually approximated as the inner product:
Q ~(x; a) =< w; φ(x; a) >, where w is the weight vector
to be learned and φ(x; a) is a pre-selected feature vector.
Furthermore, w depends on critical historical observations in the reference set R, so it can be rewritten
as a linear combination of reference feature vectors:
w =< θ; φ((x; a)R) >, where θ is the weight vector
53rd IEEE Conference on Decision and Control
December 15-17, 2014. Los Angeles, California, USA
978-1-4673-6090-6/14/$31.00 ©2014 European Union 1277to be learned. Then the approximate Q-function can be
represented as
Q ~R(x; a) =< θ; k((x; a)R; (x; a)) >; (1)
where k((x; a)R; (x; a)) is an jRj−dimensional vector in
which each element is a kernel k((y; b); (x; a)) implicitly representing the inner product < φ(y; b); φ(x; a) >,
for any (y; b) 2 R.
B. Related Work
A common approach for implementing Q-learning in
infinite MDPs is to approximate Q(x; a) as Q ~(x; a) and
update w with observations to obtain more “correct”
Q ^∗. Specifically, in online mode like SARSA learning
[5], w is updated with the gradient descent of the TD
error, while in batch mode like fitted Q iteration [6],
it is updated with regressions on Q ^∗ estimated from
Monte-Carlo on sample observations. Therefore, the key
procedure in approximate Q-learning is to design a
“good” feature vector as we did in supervised learning
[7]. In the online state aggregation algorithm, φ(x; a)
was an indicator vector mapping joint states to clusters,
and w contained Q-values of clusters on this“induced
MDP”, then the idea of online classification was applied
[8]. In a further modified version, w represented Qvalues of some critical states, and φ was designed as
an exponentiated "distance function" [9]. Others applied
batch-mode classification or regression. For classification on µ ^∗, Bagnell et al. suggested a 0/1 loss classifier
for policy search with “baseline distribution” [10], while
Lagoudakis et al. proposed other classifiers such as SVM
and neural network [11]. For regression of Q ^∗, Ormoneit
et al. maintained a training set Sa for each action, and
suggested ka(xi; x) = k((xi; a); (x; a)) as a normalized
local weight [12], while Ernst et al. suggested a treebased kernel for each joint state [13].
Q-learning was first introduced to traffic signal control
by Wiering [14], then was integrated with other techniques, such as genetic algorithm [15], neural network
[16], and neurofuzzy [17]. Approximate Q-learning was
introduced later [18]. However, all these studies implement the existing algorithms in the domains of traffic
signal control, with properly designed feature vectors
from real-time traffic condition. This paper proposes
a more adaptive kernel-based Q-learning algorithm,
making the feature vector learnable from accumulated
observations. In our algorithm, the learner initializes a
small reference set (or even an empty set if no prior
knowledge is available) and then keeps expanding or
adjusting it to reduce the approximation bias of the Qfunction. Also, the learner maintains the corresponding
parameters for the reference set, such as Q ^∗ and the
kernel matrix, so it can efficiently learn and update the
control actions. This algorithm is especially useful in
traffic signal controls, since a rich information stream
of traffic condition can be obtained online from sensors,
to support better understanding of the environment.
In this paper, we first propose the algorithm in both
batch and online modes. Furthermore, we suggest some
regularization methods to reduce the estimation variance
of the algorithm. Finally, we implement this algorithm
to a traffic signal control problem.
II. ALGORITHMS
In this section, we develop an adaptive Q-learning
algorithm with the kernel method so the learner can keep
adjusting the feature vector with accumulated observations. First, we construct an adaptive reference set and
define a parameter to control its growth. Second, the
learning algorithm is provided in batch and online modes
given kernel matrix of the reference set. In addition,
the convergence issues of this algorithm are discussed.
Finally, we propose some context-based regularization
with additional assumptions on the MDP.
A. Adaptive Reference Set
In Q-learning, we can design the features as similarities between the current joint state to those in a
fixed reference set, which is developed with some priori
knowledge. In other words, we can directly replace
(w; φ) with (θ; k) because whenever φ is shown, it is
in the form of < w; φ >, as mentioned in Section IA. With this representation, the reference set does not
have to be fixed since we can compute the kernel
function between any two joint states for prediction. In
particular, the learner can adaptively attach a new joint
state or replace an old one in the reference set during
the learning process. Usually, the attaching/replacing
decision is made to minimize the estimation variance
of Q ^∗ R. Typically, the variance increases when the joint
states in the reference set are correlated. Here we use a
single parameter δ 2 [0; 1) and another kernel function
k0 to determine if a new joint state is distinguished from
those in the reference set, as shown in Algorithm 1. Here
kR(x; a) is the shorthand of k((x; a)R; (x; a)) in Eq. (1);
the estimated Q-values Q ^∗ R and the kernel matrix KR of
the reference set should also be updated.
B. Adaptive Batch Learning
Batch learning updates the policy after a collection of
observations. The algorithm with the framework of fitted
Q iteration is shown in Algorithm 2. The idea of this
algorithm is originated from the kernel-based supervised
learning algorithms, but instead of i.i.d observations,
1278Algorithm 1 ARS((x; a); R; Q ^∗ R; KR; k; k0; δ)
1: [k ^0;(^ y;^ b)] max(y;b)2R k0((y; b);(x; a))
2: if (k ^0 < δ) then
3: R R [ (x; a)
4: Q ^∗ R Q ^∗ R [ Q ^∗(x; a)
5: KR [KR; kR(x; a); kR T(x; a); k((x; a);(x; a))]
6: else
7: (^ y;^ b) (x; a) 2 R
8: Q ^∗(^ y;^ b) Q ^∗(x; a) 2 Q ^∗ R
9: KR((^ y;^ b); ·) kR T(x; a)
10: KR(·;(^ y;^ b)) kR(x; a)
11: KR((^ y;^ b);(^ y;^ b)) k((x; a);(x; a))
12: end if
the observations here are sampled from the MDP with
some unknown transition probability. Formally, M is
the underlying MDP, k and k0 are kernel functions used
for prediction and reference set updating, respectively, n
is the size of the collection of observations, l is the loss
function, η is the smoothing rate, λ is the regularization
parameter, and is used for stop condition.
Algorithm 2 ABL(M; R; k; k0; n; l; η; δ; λ; )
1: R(0) R; Q ^∗ R(0) 0; S; Q ^∗ S ;
2: m; t 0; Kθ I
3: repeat
4: for t = 1 n do
5: fgt; yt+1g M(xt; at)
6: S S [ (xt; at)
7: Kt kR(m)(yt+1; A)
8: Q ~R(m)(yt+1; A) norm(KtKθ)Q ^∗ R(m)
9: Q ^∗ S Q ^∗ S[η(gt+αminb2A Q ~R(m)(yt+1; b)−
Q ~R(m)(xt; at)) + Q ~R(m)(xt; at)
10: end for
11: m m + 1
12: R(m) R(m − 1)
13: Q ^∗ R(m) Q ^∗ R(m−1)
14: for j = 1 ! n do
15: ARS((x; a)Sj; R(m); Q ^∗ R(m); KR(m); k; k0; δ)
16: end for
17: θ argminθ0 P
j2R(m)
l(Q ~j; Q ^∗ j) + λ 2kQ ~R(m)k2
18: Kθ θ=Q ^∗ R(m)
19: S; Q ^∗ S ;
20: until kQ ~R(m)(S) − Q ~R(m−1)(S)k1 <
Now we explain Algorithm 2 closely. First, we estimate the next-step kernel-based Q-function Q ~R from the
learned parameter θ for acting (lines 7-8). Specifically,
we (1) estimate the kernel vectors for all actions Kt to
represent the similarities between (yt+1; b) and (x; a)R,
8b 2 A; then (2) estimate the equivalent “transition
matrix” P ^ as KtKθ and normalize it with the maximum row summation so that the sum of each row
is not greater than 1; and (3) estimate Q ~R as P ^Q ^∗ R.
Second, we estimate the sample Q-values using one step
projection with observations FQ ~ (line 9). Third, when
enough samples are selected, lines 12-16 determines
the attaching/replacing of the reference set based on
Algorithm 1. Forth, new parameter θ is learned by
minimizing the empirical difference between the kernelbased Q-functions and the sample Q-values (line 17),
then is adjusted by dividing the estimated sample Qvalues. Finally, a stopping condition decides when the Qfunction converges (line 20). Obviously, when the same
sample sets are used during each iteration, the reference
set remains unchanged, and this algorithm will reduce
to the fitted Q iteration.
1) Loss Functions: The most important process in
Algorithm 2 is to update θ such that it has better
explanations on the new observations, i.e., we should
define appropriate loss functions to make Q ~R have a
good regression on Q ^∗ R. The popular loss functions for
regression are quadratic loss, absolute-value loss, and
-insensitive loss, while those for classification are 0/1
loss, hinge loss, and sigmoid loss. Most loss functions
have the closed form solution θ. In particular, we rewrite
line 17 in the matrix form:
θ 2 argmin
θ0
X j2R
l((KRθ0)j; Q ^∗ j) + λ 2 θ0T KRθ0; (2)
where the second term is the L2−norm regularization,
and KR here is the new kernel matrix from the updated reference set (line 15) with new observations
f(x; a)S; Q ^∗ Sg. Taking quadratic loss as an example, the
solution can be easily computed as
θ = (KR+λIjRj)−1Q ^∗ R; for lj = 1 2((KRθ0)j −Q ^∗ j)2;
i.e., Kθ = (KR + λIjRj)−1. However, for classification,
we also need to estimate µ ^∗. A simple way is to replace
Q ~R with Q ^∗ for all samples and then to estimate the
optimal policy again.
2) Convergence Issues: It’s known that not every
gradient method guarantees convergence in fitted Qiteration [19]. Here, we apply the normalization in line
8 to make P ^ a stochastic matrix. So it can be easily
proved that F is a contract mapping from the definition
FQ ~R = g + αP ^Q ~R (A similar proof can be found in
the appendix of [13]), i.e., our algorithm does converge
1279once R is a fixed reference set. We can also verify
the convergence with extreme conditions δ ! 0 and
δ ! 1. This implies the algorithm converges almost
surely, as the iteration number is significantly large.
C. Adaptive Online Learning
Online learning updates the policy after each observation in a sequence. It is more elegant and natural than the
ABL since learning and acting have the same frequency.
Our algorithm (Algorithm 3) is developed by introducing
the kernel trick into the framework of SARSA learning,
with T the planning horizon and η the learning rate.
In online learning, (w; φ) is preferred in the stochastic
gradient updating rule, so we generate pseudo feature
matrix Φ := φ((x; a)R) from KR := ΦΦT with
Cholesky decomposition. This operation automatically
performs the regularization since Φ is a lower triangle
matrix, where orthogonal components are used and
information fades along least important principle components. Therefore Cholesky decomposition is simple
yet useful in signal control problems with rich informed
observations. AOL shares the same convergence issues
of SARSA learning when T is sufficiently large. Now we
examine Algorithm 3 closely. First, lines 3-10 implement
ARS without the replacing operation. This is because W
is updated from the previous reference set, replacing the
old reference joint state will throw its weight as well.
Second, line 11 estimates the feature vector of the new
observed joint state in two different ways, depending on
if it is already inside the reference set (second formula)
or not (first formula). Finally, lines 12-13 estimate the
kernel-based Q-function and use it to select the optimal
action, and line 15 updates w with TD.
D. Context-based Refinements
Several methods can improve and simplify the general
ABL and AOL algorithms given additional assumptions.
First, if we have the prior knowledge about the joint state
space, we can import an appropriate initial reference set
R. Second, if the action space is discrete and contains
categorical variables instead of numerical variables (e.g.
signal colors), then a more natural approach is treating
each action as a “class label”, and using k(x) and
fθaga2A ( or φ(x) and fwaga2A for AOL) instead of
k(x; a) and θ (or φ(x; a) and w for AOL). We do not
need to maintain a separate set for each action if all
actions share the same initial reference set R ⊆ X .
Furthermore, if the state space is also discrete, we can
calculate K; 8x 2 X off-line so the leaner just needs to
search KR; 8R ⊆ X during online learning.
Another refinement is to apply nonparametric approaches to estimate weighted θ as well. For example,
Algorithm 3 AOL (M; R; k; k0; T; δ; η)
1: R; KR; Φ Chol(KR); w 0
2: for t = 1 ! T do
3: k ^0 max(x;a)2R k0((x; a); (xt; at))
4: if k ^0 < δ then
5: R R [ (xt; at)
6: KR [KR; kR(xt; at); kR(xt; at)T ;
7: k((xt; at); (xt; at))]
8: Φ Chol(KR)
9: w w [ 0
10: end if
11: φ ~(xt; a) f(Φ−1kR(xt; a)); φ(xt; a)g, 8a 2 A
12: Q ~(xt; a) wT φ ~(xt; a), 8a 2 A
13: at argmina2A Q ~(xt; a)
14: fgt; yt+1g M(xt; at)
15: w ηφ ~(xt; at)(gt + α minb2A Q ~(yt+1; b) −
Q ~(xt; at)) + w
16: end for
if we use the locally weighted regression and set δ ! 0,
λ = 0, then the loss function becomes:
lj = 1
2
Kw(j)((KRθ)j − Q ^∗((x; a)j))2; (3)
Kw(j) = exp(−kKt − KR(j; ·)k2 2
2b2 ); (4)
where Kw(j) is the weight kernel. Now if we define a
diagonal matrix W such that Wjj = Kw(j), then the
loss becomes 1
2(KRθ − Q ^∗ R)T W (KRθ − Q ^∗ R), and we
can get a closed formula for θ:
θ = (KR T WKR)−1KR T WQ ^∗ R: (5)
III. APPLICATIONS
In this section, we apply ABL and AOL algorithms
on the traffic signal control problem in a single
intersection, with four roads connected to it with
probabilities of incoming traffic flows pin t 2 [0; 1]4.
Each road has a main lane with maximum capacity
Ms, and a left-turn lane near the intersection, with
maximum capacity Ml < Ms. Also each car takes an
action 2 fgoStraight; turnLeft; turnRightg sampled
from a probability simplex ∆out t (i). Therefore the MDP
can be defined as: action space: for each signal, Ai =
fGreen("); Green( ); Green("; ); Redg while
for the intersection, A = fGreen(")1;3; Green(
)1;3; Green(")2;4; Green( )2;4; Green("; )1:4g.
state space: X = f0; :::; Msg4f0; :::; Mlg4 =
4Q
i=1
Xi.
transition matrix: There are three types of
random variables in this system: incoming cars
1280dt(i) ∼ pin t (i), lane-changing cars et(i) ∼ ∆out t (i;2),
and outgoing cars st;at(i) ∼ ∆out t (i)jat. We can
sample the next state with them. For example,
xs;t+1(i) = xs;t(i) + ds;t(i) − et(i) − st;at(i) if no
boundary conditions are needed to be considered.
Generally speaking, xs;t+1(i) − xs;t(i) 2 f−2 : 1g, and
xl;t+1(i) − xl;t(i) 2 f−1 : 1g. stage cost: gt(xt; at) =
ga T xs;t + gb T rt;a, where the second term shows the
number of rejected incoming cars. For example,
rs;t;a(i) = (xs;t(i) + dt(i) − et(i) − ss;t;at(i) − Ms)+.
For the following experiments, the default settings are:
Ml = 1, Ms = 3, l = quadratic, η = 0:01, λ = 0:1,
R(0) = ;, k = k0 = RBF (GK), α = 0:8 ,ga =
[1;2;4;3]T ,gb = [Ms; Ml], pin t = [0:5;0:6;0:3;0:2]T ,
∆out
t (i) = [1=3;1=3;1=3]; 8t.
A. Convergence
The convergence of AOL should be equivalent to
that of SARSA learning, and we run several sample
trajectories to confirm it in this application. The main
analysis in this section is to test the convergence speed
of ABL. We collect sample observations online based
on current Q ~, and fix η since it can hardly influence the
convergence speed of Q ~. Then we test the convergence
of Q ~ on a specific joint state (x; a) = (0;2). First,
We verify the influence of sample size n by fixing
δ = 0:8. From Figure 1(a), we can see when n is
smaller, Q ~ converges faster. Another disadvantage of
large n in online sample collection is that the learner
may stick to a number of states which is not enough
for the next updating. A simple way to avoid this
situation is adding some random explorations in ABL.
Now we fix n = 1, and verify how δ influences the
convergence speed. This setting is different from AOL
due to different loss functions. From Figure 1(b), we can
see for larger δ, Q ~ updates more stably, because of the
high attaching/replacing rate of the reference set.
B. Performance
It is proper to apply online learning for traffic signal control, so we estimate the performance of AOL
with different kernels, such as Gaussian kernel (GK),
Laplacian kernel (LK), and polynomial kernel (PK),
and compare with the traditional Q-learning. Here we
use φ(x) and wa instead of φ(x; a) and w because the
actions are categorical variables. First, we fix R = X ,
δ = 0 to verify the performance of kernel method
without the adaptive reference set. By selecting appropriate parameters1 for different kernels, we get the plots
1All learners: η = 0:001; GK: σ2 = 105; LK: σ2 = 50; PK:
p = 3.
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
x 104
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Time
Q−value
n = 1
n = 5
n = 20
n = 40
(a) Influence of n on convergence
0 0.5 1 1.5 2 2.5 3 3.5 4
x 104
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Time
Q−value
Q−learning
δ = 0.8
δ = 0.5
δ = 0.2
(b) Influence of δ on convergence
Fig. 1: Experimental convergence speeds of ABL.
in Figure 2(a). Here the performance is estimated as
the mean time-average cost of 100 sample trajectories:
g ¯(t) = 100 1 t
100
P i
=1
tP
τ=1
g(τ)i. We can see all the kernellearners have better performance than the Q-learner
has. Especially for the PK-learner, the performance is
almost as low as that of the DP expert. Interestingly,
K-learners use approximate Q-functions, but they have
even better performance than Q-learning. One possible
explanation is that they apply the restricted exploitation
to avoid sticking in the undesired local minimum due
to insufficient observations. We also plot g ¯(t) of kernellearners directly using (θ; k) (dashed lines), revealing
that Cholesky decomposition may either increase or
decrease the performance of kernel-learners. Second,
we set k as PK and verify the influence of δ with
different initial reference sets R = X1:N . Specifically,
we estimate g ¯(t) with δ = f0;0:9g, N = f500;1000g.
We use a high δ since both Ms and Ml are very small,
leading to high similarities overall. From Figure 2(b), we
can see for the fixed reference set, g ¯ is reduced slightly
when N is increased from 500 to 1000. However, for
the adaptive reference set, N has almost no influence on
1281the performance. In other words, compared to a fixed
reference set, especially one with high bias, an adaptive
reference set can improve the performance of the learner.
0 100 200 300 400 500 600 700 800 900 1000
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
3.2
Time
Time−average Cost (regret)
Q−learner
GK−k
GK−φ
LK−k
LK−φ
PK−k
PK−φ
Expert
(a) Learners with maximum reference sets
0 200 400 600 800 1000 1200 1400 1600 1800 2000
2
2.5
3
3.5
Time
Time−average Cost (regret)
N = 500, δ = 0
N = 500, δ = 0.9
N = 1000, δ = 0
N = 1000, δ = 0.9
(b) PK-learner with adaptive reference sets
Fig. 2: Performances of AOL with different kernels.
IV. DISCUSSION AND EXTENSIONS
This paper develops a kernel-based adaptive Qlearning algorithm in both batch and online mode. The
idea is (1) maintaining a learnable reference set of joint
states, and (2) selecting appropriate kernel functions
to approximate the Q-function with desired exploitation/exploration tradeoff. In addition, several refinements
are suggested for general kernel-based algorithms, such
as generating feature vectors from the kernel matrix
through Cholesky decomposition, and casting the RL
problem with categorical actions as a multi-class classification problem. The convergence and effectiveness of
this algorithm can be demonstrated in an application on
traffic signal control in a single intersection.
For future research, improvements can be made on
following directions: (1) Currently only a single parameter is used to determine the attaching/replacing
of the reference set, more sophisticated criteria could
be applied to make a better variance/bias tradeoff. (2)
Once the size of the reference set becomes large, the
computation cost is high for searching the reference set
and updating the kernel matrix, so we could implement
appropriate data structure to maintain the flexible data
to increase the computational efficiency. (3) In this
application, the performance of particular kernel-based
learners is boosted with Cholesky decomposition. We
will investigate how orthogonality principle gives influence on the feature design in different kernel-learners.
REFERENCES
[1] R. S. Sutton and A. G. Barto, “Reinforcement learning: an
introduction,” Neural Networks, IEEE Transactions on, vol. 9,
no. 5, pp. 1054–1054, 1998.
[2] J. N. Tsitsiklis, “Asynchronous stochastic approximation and qlearning,” Machine Learning, vol. 16, no. 3, pp. 185–202, 1994.
[3] R. Bellman, “A markovian decision process,” DTIC Document,
Tech. Rep., 1957.
[4] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning: A survey,” arXiv preprint cs/9605103, 1996.
[5] G. A. Rummery, “Problem solving with reinforcement learning,”
Ph.D. dissertation, University of Cambridge, 1995.
[6] G. J. Gordon, “Approximate solutions to markov decision processes,” Robotics Institute, p. 228, 1999.
[7] C. Szepesvári, “Algorithms for reinforcement learning,” Synthesis Lectures on Artificial Intelligence and Machine Learning,
vol. 4, no. 1, pp. 1–103, 2010.
[8] D. P. Bertsekas, D. P. Bertsekas, D. P. Bertsekas, and D. P.
Bertsekas, Dynamic programming and optimal control. Athena
Scientific Belmont, 1995, vol. 1, no. 2.
[9] C. Szepesvári and W. D. Smart, “Interpolation-based q-learning,”
in Proceedings of the twenty-first international conference on
Machine learning. ACM, 2004, p. 100.
[10] J. A. Bagnell, S. Kakade, A. Y. Ng, and J. Schneider, “Policy
search by dynamic programming,” 2004.
[11] M. Lagoudakis and R. Parr, “Reinforcement learning
as classification: Leveraging modern classifiers,” in
MACHINE LEARNING-INTERNATIONAL WORKSHOP
THEN CONFERENCE-, vol. 20, no. 1, 2003, p. 424.
[12] D. Ormoneit and S. Sen, “Kernel-based reinforcement learning,” ´
Machine learning, vol. 49, no. 2-3, pp. 161–178, 2002.
[13] D. Ernst, P. Geurts, and L. Wehenkel, “Tree-based batch mode
reinforcement learning,” Journal of Machine Learning Research,
vol. 6, 2005.
[14] M. Wiering, J. Van Veenen, J. Vreeken, and A. Koopman,
“Intelligent traffic light control,” Institute of Information and
Computing Sciences. Utrecht University, 2004.
[15] S. Mikami and Y. Kakazu, “Genetic reinforcement learning for
cooperative traffic signal control,” in Evolutionary Computation,
1994. IEEE World Congress on Computational Intelligence.,
Proceedings of the First IEEE Conference on. IEEE, 1994,
pp. 223–228.
[16] D. Srinivasan, M. C. Choy, and R. L. Cheu, “Neural networks
for real-time traffic signal control,” Intelligent Transportation
Systems, IEEE Transactions on, vol. 7, no. 3, pp. 261–272, 2006.
[17] E. Bingham, “Reinforcement learning in neurofuzzy traffic signal
control,” European Journal of Operational Research, vol. 131,
no. 2, pp. 232–241, 2001.
[18] C. Cai, C. K. Wong, and B. G. Heydecker, “Adaptive traffic
signal control using approximate dynamic programming,” Transportation Research Part C: Emerging Technologies, vol. 17,
no. 5, pp. 456–474, 2009.
[19] J. N. Tsitsiklis and B. Van Roy, “Feature-based methods for large
scale dynamic programming,” Machine Learning, vol. 22, no. 1-
3, pp. 59–94, 1996.
1282