A Multi-Agent Adaptive Traffic Signal Control System Using Swarm Intelligence and Neuro-Fuzzy Reinforcement Learning Wei Lu, Yunlong Zhang and Yuanchang Xie Abstract— This research develops and evaluates a new multiagent adaptive traffic signal control system based on swarm intelligence and the neural-fuzzy actor-critic reinforcement learning (NFACRL) method. The proposed method combines the better attributes of swarm intelligence and the NFACRL method. Two scenarios are used to evaluate the method and the new NFACRL-Swarm method is compared with its NFACRL counterpart. First, the proposed control model is applied to isolated intersection signal adaptive control to evaluate its learning performance. Then, the control system is implemented in signal control coordination in a typical arterial. In the isolated intersection, the proposed hybrid method outperforms its previous counterpart by improving the learning speed and is shown to be insensitive to reward function parameters. In the network, by introducing a coordination scheme inspired by swarm intelligence, the proposed method improves the performance by up to 12% and has a faster learning speed. I. INTRODUCTION Traffic signal operation in urban areas is a tough task faced by traffic engineers. Due to the heavy volume of vehicular traffic during peak hours and the stochastic nature of traffic flow, this task requires proper and sophisticated timings without which significant delay and traffic congestion may happen. Conventional signal control systems are either pretimed or actuated control. Pre-timed control is the easiest to implement, however, it has the least flexibility to respond to short-term traffic and traffic pattern changes. Actuated control is designed to solve this problem by extending green phases in respond to real-time traffic. But this strategy sometimes is too myopic in the way that it only looks at the movements being served, and thus causes long queues on other movements when traffic is heavy. [1] Adaptive control uses traffic information collected in real-time from on-street detectors in signal control. Many adaptive signal control systems have been proposed and proved to outperform both pre-timed and actuated control [2]. However, there some general limitations associated with these adaptive control systems. One is the applicability: a lot of the existing adaptive control systems achieve optimization by dynamic programming, which may have restrictions from the problem formulation and solution procedures. Another limitation is the computational burden: adaptive control systems usually result in a huge burden of computation, which is especially the case in centralized systems. Though some W. Lu and Y. Zhang are with the Zachry Department of Civil Engineering, Texas A&M University, TX 77843-3136, USA [email protected], [email protected] Y. Xie is with the Civil and Mechanical Engineering Technology, South Carolina State University, Orangeburg, SC 29117, USA [email protected] authors [3] proposed approximate dynamic programming to overcome the computational nightmare, the applicability is still a problem and the use is limited to isolated intersections. In addition, hardware implementation for communication is a heavy cost task in adaptive control systems, especially for centralized systems. So we can identify a need for a distributed adaptive control system with high computational efficiency that can be implemented in traffic networks. More recently, Xie and Zhang developed a Neural-Fuzzy Actor-Critic Reinforcement Learning (NFACRL) control method [1]. Satisfying and promising results were reported in the evaluation after the tuning and training process. However, they didnt explicitly consider the coordination among the signal controllers located at different intersections, neither a comprehensive study on how the parameters and training would influence on the results was conducted. As mentioned by the author, the learning results might be sensitive to the reward function parameters. As we will see in section 4, this sensitivity may render the reinforcement learning inefficient, and thus compromise its applicability in big traffic network. This drawback is because of the assumption that the parameters in the reward function are tuned. But different traffic demand may have different optimal parameters in the reward function accordingly. Since finding the optimal parameters in the reward function is another sub-problem and the number of experiments needs to be run to achieve that goal is huge, there is a problem in practice. So there is a need to resolve this problem. We solve this problem by introducing swarm intelligence into the NFACRL method. Swarm intelligence has been applied into traffic signal operations recently. In 2009, Putha et al. [4] developed an Ant Colony Optimization algorithm to solve oversaturated network traffic signal coordination problem. In 2006 and 2007, Bazzan and Olivera [5], [6] used metaphors of pheromone accumulating and task allocation of social insects to model traffic light agents. They treated each intersection (plus its signal light) as a social insect, which grounds its decision on the pheromone trail (accumulating when vehicles are waiting) in the environment. They realized coordination in a network through this distributed approach. However, their intersection only has two signal plans to choose, which is not flexible enough to realize local optimization. Besides, their over-simplified road network and traffic data makes the results less convincing and less practical. In this research, a method combining the desirable attributes of NFACRL method and swarm intelligence is developed. The proposed hybrid method is named Neuro-Fuzzy Actor-Critic Reinforcement LearningSwarm 2011 IEEE Forum on Integrated and Sustainable Transportation Systems Vienna, Austria, June 29 - July 1, 2011 978-1-4577-0992-0/11/$26.00 ©2011 IEEE 233(NFACFL-Swarm), which incorporates the simple and effective task allocation and learning process of the swarm intelligence while maintains the accuracy and flexibility of the NFACRL method. The remainder of the paper is organized as follows. First the literature on the existing approaches for traffic signal coordination, methods of reinforcement learning and applications of swarm intelligence is reviewed. Then, a signal control method combining NFACRL and swarm intelligence is proposed. Simulation scenarios in which the proposed model is validated are described and performance results are then presented and analyzed. Concluding remarks then follow. II. LITERATURE REVIEW A. Traffic Signal Control and Synchronization The current signal control can be classified into pre-timed signal control, actuated control and adaptive signal control. Most of current traffic control systems such as pre-timed or actuated control are limited due to their inability to respond to short-term traffic demand and pattern changes [2]. Pre-timed timing plans are determined on off-line average traffic volume data. Some certain techniques such as mixed integer programming, genetic algorithms and stochastic assignment are used in optimizing the timing plans [7]. This type of control is weak in handling traffic flow fluctuations and traffic flow pattern changes, which is often the case in modern cities where the business centers are no longer located exclusively downtown [5]. Actuated control can partially resolve this problem. However, it grounds its phase decisions primarily on the immediate arrivals of the movements being served. This myopia may results in unsatisfactory control performance when traffic demand is heavy [5], [8]. Adaptive traffic control systems, whether or not they have fixed cycle length and phase sequences, determine the basic operation parameters in real time based on existing and predicted traffic flow [2], [3]. Adaptive traffic control has attracted significant attention and researchers have developed several programs using different optimization techniques and hierarchical architectures. Some well-known adaptive control packages are summarized in Table I with their characteristics in terms of traffic data availability, signal coordination, objective for optimization and servo mechanisms. B. Microscopic Simulation and Multi-agent System Microscopic traffic simulation models have received an increasing attention. These models enable the researchers to study and evaluate traffic system performance in scenarios that are not able to be addressed by traditional approaches due to their complexity (such as congested traffic conditions and incident management). A simulation platform based on cellular automata is proposed and described in [9]. Besides, a lot of commercial microscopic simulation packages have been developed in the last decade, such as MITSIM, PARAMICS, AIMSUN-2, CORSIM and VISSIM. TABLE I REVIEW OF ADAPTIVE CONTROL PACKAGES Program Traffic Signal Obj. for Servo Data coordination Opt. OPAC Online Through Delay Decentraupstream profile lized UTOPIA Online With offset Stops Centraupstream optimization & delay lized SCATS Online With offset Capacity Centradownstream optimization lized SCOOT Online With offset Stops, delay Centraupstream optimization , congestion lized PRODYN Online Possible Total delay Decentraupstream lized MOVA Online Nil Stops, delay Decentraupstream , capacity lized DDYPIC Offi-line Nil Delay Decentralized Though all these packages models vehicles in an objectoriented manner, none of them are strictly defined agentbased simulation system because of the essential differences between object and agent. Multi-agent systems have recently attracted the attention of traffic researchers and have been used in traffic operations. The multi-agent traffic system has been modeled in different ways by traffic scientists. In 2002, Logi and Ritchie [10] proposed a multi-agent architecture for cooperative congestion management. In 2005, Ossowski et al. [11] outlined the design guidelines for the construction of a multi-agent decision support system and then presented a demo. Kosonen [12] developed a real-time traffic signal control system combining multi-agent system and fuzzy logic control in 2003. Later, Dresner and Stone [13], [14] proposed a multiagent traffic management scheme using reservation-based intersection control mechanism in which the vehicles are controlled by agents. They showed that their method could produce results that closely approximate an overpass and are hundred times better than that by traffic lights. III. MODEL DEVELOPMENT The NFACRL with variable phase sequences (NFACRLV) structure, reward function and learning procedure of the NFACRL-V used in [2] is also used in this paper., for more details about NFACRL, readers can refer to [1], [2]. But we use a different action selection method instead of the greedy method and a reinforcement learning method inspired by swarm intelligence. In addition, coordination among the signal controllers is explicitly considered. Bonabeau et al. [15] proposed a mathematical model about how the labor is divided in social insects colonies. Their task distribution depends on the insects response threshold and stimulus for each task. These interactions among insects and the environment make the dynamic distribution of tasks happen. These concepts are used in our research in the following way: each agent (traffic signal controller at one intersection) acts like a social insect. It has different tendencies to execute one of its actions (one of its available phases) according to the environment stimulus, which is related to the action 234values from NFACRL and the thresholds that can be updated in a self-reinforced way. A. Computation of Stimulus In NFACRL, the action value V (Aj; S(t)) describes the tendency to execute action j at state S(t), so it is a good indicator of the stimulus from the environment. After the action values are obtained, at time t, the stimulus of executing action q is: stq;t = (1 − α)[V (Aq; S(t)) − V (Aj; S(t))] + α aj jAsj; (1) where j = argminj2As V (Aj; S(t)), meaning the action with the least tendency at time t. The first term of (1) is to make sure that the stimulus is positive. The second term is trying to take the neighbor signals into consideration, where α is the influence coefficient and aj is the number of agents performing action j in the area and As is the set of agents in the area that can perform action j. In the isolated intersection scenario, α equals 0; in the arterial network, α can be any value between 0 and 1. In this research, α is set to be 0.5. The intuitive explanation of the second term is to encourage controllers to execute the same action, so that they can achieve synchronization. With the insertion of the neighbors influence on the agents decision, the aim is to have the advantages of group formation based on direct communication, especially prioritizing the global optimization, while also considering V (Aj; S(t)) which is focused on a more local optimization. B. Task Allocation Equation (2) defines the response function (the probability to select the action j as a function of stimulus intensity sj of the signal agent i. Let k = argmaxj2As V (Aj; S(t)), meaning the action with the largest tendency at time t. Tθij(stj) = 8<: st2 j st2 j+θij 2 ; if j = k (1 − st2 st2 k k+θik 2 ) st2 jst + 2 jθij 2 ; otherwise (2) where θij is the response threshold for agent i to execute action j and stj is the stimulus associated with action j. C. Reinforcement Learning The threshold is updated in a self-reinforced way. Each signal i in the model has one threshold value to each action j. θt+1 ij = θij t − lijδt (3) where δt is the decision time interval, which is set to be 3 seconds in this research. lij is the learning coefficient defined by (4): lij = 1 − 2σij(t) (4) where σij(t) is an indicator of learning success, at time step t, σij(t) is defined by (5): σij(t) = vuut 1 n − 1 nX k =1 [wik k (t) − wik k (t − 1)]2 (5) where wk ik(t) denotes the action weight at time step t for the k-th action output of agent i. IV. EXPERIMENTAL RESULTS In this section we conduct two series of experiments. First we compare the NFACRL-Swarm method with the NFACRL-V method in terms of learning ability, sensitivity to reward function parameters and performances, these experiments are done based on an isolated intersection. Then the NFACRL-Swarm method is introduced to an arterial to test the coordination module. Also, the performance is compared with its NFACRL-V counterpart. Note that NFACRL-V method has been proved to be better than the traditional traffic control algorithms. Data set used in this section is the same as that in [2], which is from a chosen segment of FM 2818 (Harvey Mitchell Parkway) in College Station, Texas. The morning peak period traffic data collected on October 7, 2004 from 7:00 A.M to 8:00 A.M. is used. The proposed method is coded as a .dll (dynamic link library) file and implemented in VISSIM through the signal control interface. All the runs are conducted on VISSIM 5.10 with a desktop computer with Core 2 CPU @3.00 GHz and 8GB RAM. Table II summarizes the basic model input parameters, i.e., β1, β2, β3, β4, β5, which are nonnegative coefficients for each variable in the reward function of learning process. The variables in the reward function are: number of vehicles that have passed the intersection from approaches being given green signal, number of vehicles in queue, number of vehicles newly added to queues, number of vehicles in approaches being given green signal, and number of vehicles being stopped when signal is switched from green to red. For more details of reward function and reinforcement learning process, readers are recommended to refer to [2]. A. Experiments on Isolated Intersection A three-approach intersection of Rio Grande Boulevard and FM 2818 (Fig. 1) is used in this evaluation. To test their sensitivity to the crucial parameters, both controllers are evaluated with different combinations of reward function parameters, as shown in Table 2. For each combination, 90 training runs with random seeds are conducted. In this evaluation, average delay per vehicle (hereinafter referred to as delay) is chosen as the performance criteria. The corresponding performances of three configurations are shown in Table 3. In Figure 2, the performances of two methods are compared together with the best performance value that can be obtained by NFACRL-V. The best value is the average of 30 runs after 90 training runs. Note that 30, 60 and 90 in Table 3 and Figure 2 represent the averaged performance value of runs 1-30, 31-60, 61-90, respectively. From Table III and Fig. 2, the following observations can be made: 1) In configuration 1 and 3, the NFACRL-V method fail to achieve the goal of reinforcement learning since the delay goes up after 60 training runs. 235Fig. 1. A three approach intersection TABLE II β PARAMETERS Configuration β1 β2 β3 β4 β5 1 3 0.77 0.25 3 16 2 3 1 0.5 3 8 3 3 0.75 0.5 3 8 2) In all the configurations, the NFACRL-Swarm method learns well. Since the performance of NFACRL-Swarm method tends to be steady after 30 training runs, generally it has a faster learning speed than NFACRLV. 3) In two of the three configurations, the performance of NFACRL-Swarm obtained after 30 runs is even better than that of NFACRL-V method after 90 runs. B. Experiments on Arterial The proposed NFACRL-Swarm method is also evaluated on an arterial network shown in Fig. 3 to test the new methods coordination performance. Again, 90 training runs with random seeds are conducted. The performances are presented and compared with the original method NFACRLV. In addition to delay, stopped delay, and number of stops per vehicle, speed is also considered in this evaluation. The simulation results and comparison are presented in Table IV and Fig. 4. From Fig. 4 the following observations can be made: 1) The NFACRL-Swarm method consistently outperforms the NFACRL-V in all the four criteria at the whole training process. 2) Compared to the best1 values by NFACRL-V, the NFACRL-Swarm method has improved the performance in terms of speed, delay and stopped delay by 2.2%, 5.5%, and 12.2% respectively, with the only exception of number of stops per vehicle. 3) In general, the NFACRL-Swarm method learns faster than its NFACRL-V counterpart. V. CONCLUSIONS This research investigates the application of swarm intelligence in distributed adaptive traffic signal control. A new 1The best value is the average of 30 runs after 90 training runs TABLE III SUMMARY OF DELAY OF TWO METHODS Delay (s/veh.) 30 60 90 NFACRL-V Configuration 1 16.6 16.05 16.87 Configuration 2 15.87 15.22 15.06 Configuration 3 16.18 15.74 16.01 NFACRL-Swarm Configuration 1 15.63 14.24 14.19 Configuration 2 14.48 13.67 13.67 Configuration 3 16.44 14.8 14.74 Fig. 2. Delay comparison of three configurations method based on swarm intelligence and neuro-fuzzy actorcritic reinforcement learning is developed and evaluated at an isolated intersection and an arterial. Compared to previous studies, this hybrid method incorporates the simple and effective task allocation and learning process which are inspired by swarm intelligence with the elaborate and realistic phase configurations that is found at the NFACRL method. The proposed new method also considers signal controller coordination at an arterial scenario. A comprehensive comparison of the proposed NFACRL-Swarm method with its NFACRLV counterpart is conducted based on VISSIM simulation. This research identifies the sensitivity problem of reward function parameters of the NFACRL method and shows that the new NFACRL-Swarm can overcome this problem. In the evaluation on an isolated intersection, for all combinations of parameters the proposed new method produces considerably less delay than NFACRL-V. Some bad combinations of parameters cause the NFACRL-Vs failure in reinforcement learning, while in all the tests the NFACRL-Swarm method learns well. Compared to the lowest delays obtained by the NFACRL-V method after tuning the parameters and 90 runs of training, the new method produces lower delay after only 30 runs of training. This increase of learning 236Fig. 3. A segment of FM2818 TABLE IV SIMULATION RESULTS OF TWO METHODS AT DIFFERENT TRAINING STAGES Model Range Speed Delay Stopped # of (s/veh.) Delay Stops (s/veh.) /veh. NFACRL-Swarm 30 24.88 57.42 36.18 1.3 60 25.91 51.36 29.75 1.3 90 26.16 50.01 28.19 1.32 NFACRL-V 30 23.79 64.96 41.9 1.41 60 24.47 60.82 36.87 1.44 90 25.48 53.67 30.35 1.46 best 25.6 52.9 32.1 1.25 Improvement 30 1.09 7.54 5.72 0.11 60 1.45 9.47 7.13 0.13 90 0.68 3.66 2.16 0.14 vs. best 0.56 2.89 3.91 -0.07 Improvement (%) 30 4.6% 11.6% 13.6% 7.8% 60 5.9% 15.6% 19.3% 9.3% 90 2.6% 6.8% 7.1% 9.5% vs. best 2.2% 5.5% 12.2% -5.9% speed can extend the new methods practical applicability and viability. In the evaluation on an arterial, by introducing a coordination scheme inspired by swarm intelligence, the proposed NFACRL-Swarm method outperforms its NFACRLV counterpart in terms of delay, stopped delay and arterial speed. Also, the new method learns faster than the original one. Encouraging results are obtained from this this research. To further improve the performance and applicability of the NFACRL-Swarm control, future work will include clearer explanation for the mechanism of swarm intelligence that is implemented in this research. A more comprehensive evaluation of the parameters needs to be conducted. It would also be interesting to extend the modeling framework to consider different traffic network configurations (e.g. network with more than two arterials that need to be coordinated). REFERENCES [1] Y. Xie, Y. Zhang, and L. Li, “Neuro-fuzzy reinforcement learning for adaptive intersection traffic signal control,” in TRB Annual Meeting Compendium of Papers, Transportation Research Board Annual Meeting, Jan. 2010. Fig. 4. NFACRL-Swarm vs. NFACRL-V [2] Y. Xie, “Development and evaluation of an arterial adaptive traffic signal control system using reinforcement learning,” Ph.D. dissertation, Texas A&M University, College Station, TX, 2007. [3] C. Cai, C. K. Wong, and B. G. Heydecker, “Adaptive traffic signal control using approximate dynamic programming,” Transportation Research Part C: Emerging Technologies, vol. 17, no. 5, pp. 456 – 474, 2009, artificial Intelligence in Transportation Analysis: Approaches, Methods, and Applications. [4] R. Putha, L. Quadrifoglio, and E. Zechman, “Comparing ant colony optimization and genetic algorithm approaches for solving traffic signal coordination under oversaturation conditions,” Computer-Aided Civil and Infrastructure Engineering, pp. no–no, 2011. [5] A. L. C. Bazzan, “A distributed approach for coordination of traffic signal agents,” Autonomous Agents and Multi-Agent Systems, vol. 10, pp. 131–164, 2005. [6] D. de Oliveira, P. Ferreira, A. Bazzan, and F. Klgl, “A swarm-based approach for selection of signal plans in urban scenarios,” in Ant Colony Optimization and Swarm Intelligence, ser. Lecture Notes in Computer Science. Springer Berlin / Heidelberg, 2004, vol. 3172, pp. 143–156. [7] E. Cascetta, M. Gallo, and B. Montella, “Models and algorithms for the optimization of signal settings on urban networks with stochastic assignment models,” Annals of Operations Research, vol. 144, pp. 301–328, 2006. [8] G. Abu-Lebdeh and R. Benekohal, “Signal coordination and arterial capacity in oversaturated conditions,” Transportation Research Record : Journal of the Transportation Research Board, vol. 1727, pp. 68–76, 2000. [9] B. da Silva, A. Bazzan, G. Andriotti, F. Lopes, and D. de Oliveira, “Itsumo: An intelligent transportation system for urban mobility,” in Innovative Internet Community Systems, ser. Lecture Notes in Computer Science. Springer Berlin / Heidelberg, 2006, vol. 3473, pp. 224–235. [10] F. Logi and S. G. Ritchie, “A multi-agent architecture for cooperative inter-jurisdictional traffic congestion management,” Transportation Research Part C: Emerging Technologies, vol. 10, no. 5-6, pp. 507 – 527, 2002. [11] S. Ossowski, A. Fernndez, J. Serrano, J. Prez-de-la Cruz, M. Belmonte, J. Hernndez, A. Garca-Serrano, and J. Maseda, “Designing multiagent decision support systems for traffic management,” in Applications of Agent Technology in Traffic and Transportation, ser. Whitestein Series in Software Agent Technologies and Autonomic Computing. BirkhŁuser Basel, 2005, pp. 51–67. [12] I. Kosonen, “Multi-agent fuzzy signal control based on real-time simulation,” Transportation Research Part C: Emerging Technologies, 237vol. 11, no. 5, pp. 389 – 403, 2003, world Congress on Intelligent Transport Systems. [13] K. Dresner and P. Stone, “Multiagent traffic management: a reservation-based intersection control mechanism,” in Autonomous Agents and Multiagent Systems, 2004. AAMAS 2004. Proceedings of the Third International Joint Conference on, 2004, pp. 530 – 537. [14] ——, “Multiagent traffic management: an improved intersection control mechanism,” in Proceedings of the fourth international joint conference on Autonomous agents and multiagent systems, ser. AAMAS ’05. New York, NY, USA: ACM, 2005, pp. 471–477. [15] E. Bonabeau, M. Dorigo, and G. Theraulaz, Swarm intelligence: from natural to artificial systems. New York, NY, USA: Oxford University Press, Inc., 1999. 238