Introduction
The research on robot path planning originates from the demand for automation and intelligence, aiming to solve the core problem of autonomous and safe movement of robots in complex environments. Its development has gone through various stages, from early static environment planning methods such as visual mapping and grid methods, to random sampling algorithms suitable for high-dimensional spaces, and now to intelligent planning methods that integrate deep learning and reinforcement learning. The significance of this research is significant as it marks a crucial leap for robots from passive execution to active perception and decision-making, directly driving practical applications in many fields such as autonomous driving, unmanned aerial vehicle surveying, intelligent warehousing and logistics, and service robots. By optimizing the efficiency, safety, and robustness of paths, path planning technology is not only a technical prerequisite for robots to be deployed in real work, but also one of the core driving forces for improving social production efficiency and automation level. Considering that in continuous path planning problems, compared to discrete action spaces, the continuous action space is larger and the exploration problem faced by robots is more complex. Robots need to face more choices in the path planning process, which increases randomness and avoids getting stuck in local optimal solutions. Therefore, this paper studies the implementation of further enhancing its exploration ability to better explore and learn continuous actions in the environment, considering the convergence speed and stability of the algorithm, and designing a reasonable reward function to encourage the robot to move towards the target position, punish the robot for colliding with obstacles, and enable the robot to have good path planning ability in different scenarios. At the same time, the research on robot path planning is moving from traditional environment abstract calculation to a new stage of deep integration with embodied intelligence. Embodied intelligence emphasizes the core role of the interaction between the "body" and the environment in intelligent behavior, which means that path planning is no longer an isolated optimal solution calculation, but a embodied decision-making process that relies on the specific form, physical characteristics, and multimodal perception of robots..
RELATED WORK
A
The research on robot path planning originates from the demand for automation and intelligence, aiming to solve the core problem of autonomous and safe movement of robots in complex environments. In this section, we will provide an overview and analysis of these studies. In recent years, traditional path planning algorithms have been widely studied and applied, Khatib et al. proposed the Artificial Potential Field (APF) method, which treats robots as ppapers and introduces potential energy and repulsion functions to describe the motion state of robots and the forces of surrounding obstacles
1.Yang et al. proposed an improved artificial potential field method, which solves the problem of local minima by modifying the direction and influence range of the gravitational field, adding virtual targets and evaluation functions, and solving the problem of unreachable targets by increasing gravity
2.Wang et al. proposed an artificial driving vehicle that adapts to different environments based on the characteristics of the driver and the artificial potential field method. Compared with the traditional artificial potential field method, the improved local path planning algorithm obtains more reasonable path curves
3.Li et al. proposed an OSM robot navigation method that combines road network information and local perception information to address the problem of global path inconsistency with the real environment caused by map errors in OpenStreetMap (OSM). OSM provides road network information and obtains the global path through Dijkstra's algorithm
4.Lee et al. proposed a method called Finite Distribution Estimation of Dynamic Window (FDEDWA), which avoids dynamic obstacles by estimating the overall distribution of obstacles. The FDEDWA algorithm uses the Frobenius matrix factorization method to estimate the distribution of obstacles and predict their future positions, enabling real-time perception of dynamic environments
5.Yuan et al. proposed a mobile robot path planning method that combines bat algorithm and DWA algorithm to achieve global optimization and dynamic obstacle avoidance goals
6.Ma et al. proposed a CGAN-RRT algorithm consisting of a conditional generative adversarial network generation model and an improved RRT, which can generate probability distributions of effective feasible paths given map information
7.Jiang et al. proposed an improved neural network based on DQN architecture. On a 30 × 30 map, DQN ensures that the optimal path between any two points can be successfully found within approximately 9000 rounds
8.Xin et al. proposed a ladder reward mechanism to replace traditional simple reward functions and introduced it into experience replay based DQN, which significantly improved its adaptability to dynamic environments
9.Huang et al. proposed a weighted reward function to balance obstacle avoidance and approach to the target, using the weighted reward function to balance obstacle avoidance and approach to the target. Two reward thresholds were used to modify abnormal rewards to study path planning for mobile robots in unknown dynamic environments
10.Yang et al. proposed an N-step priority dual DQN NPDDQN path planning algorithm and experience filtering mechanism, which screens the positive experiences mined and improves their reuse rate, effectively achieving obstacle avoidance in complex environments
11.Kretzschmar et al. proposed a DDPG algorithm that utilizes deep neural networks based on the DPG algorithm. The DDPG algorithm combines Q-Learning and deterministic policy gradient methods, and approximates the value function and policy function using deep neural networks to achieve planning and control of continuous action spaces
12.Tai et al. proposed an asynchronous DDPG algorithm to improve the training efficiency of deep reinforcement learning algorithms
13.Li et al. proposed a method to replace the neural network optimization algorithm in DDPG algorithm with Radam algorithm, and added priority experience replay to improve the success rate and convergence speed of the algorithm
14.Jiang et al. proposed an interactive deep deterministic policy gradient method based on action guidance to solve the problem of low sample efficiency in traditional deep reinforcement learning
15.Zhou et al. designed a dual noise mechanism to explore the environment more effectively, embedding LSTM network based on DDPG algorithm to ensure that the model obtains information from previous navigation experience, thereby optimizing the model's decisions
16.Yang et al. provided a comprehensive overview of the current progress in embodied intelligence, achieving autonomous learning capabilities for robots through deep reinforcement learning, enabling robots to continuously optimize path planning strategies in their interaction with the environment
17.
Methods
Considering that in continuous path planning problems, compared to discrete action spaces, the continuous action space is larger and the exploration problem faced by robots is more complex. Robots need to face more choices in the path planning process, which increases randomness and avoids getting stuck in local optimal solutions. Therefore, it is necessary to further enhance its exploration ability to better explore and learn continuous actions in the environment, considering the convergence speed and stability of the algorithm, and designing a reasonable reward function to encourage the robot to move towards the target position, punish the robot for colliding with obstacles, and enable the robot to have good path planning ability in different scenarios. For this purpose, the DDAPG algorithm is designed in this paper. Firstly, the robot is designed as a Markov decision process in continuous path planning tasks. A simulation robot is used to collect environmental information around it through laser ranging sensors, and a deep neural network is built. Considering the issue of sparse rewards, the reward function should provide smooth and continuous rewards in a continuous action space. A potential energy based reward reshaping method was designed. In order to improve exploration intensity, an adaptive round OU noise model was designed. Finally, a simulation environment was built based on Gazebo to compare the algorithm performance. In this chapter, the DDAPG algorithm was proposed to accelerate convergence speed and reduce collision rate compared to the baseline algorithm, and its adaptive ability was verified in different scenarios.
Design of Continuous Path Planning Process
The simulation environment built by Gazebo is more in line with the operation of robots in real scenes, and can use higher precision sensors to obtain environmental information, such as using sensors such as LiDAR, thereby improving the accuracy and flexibility of path planning. Turtlebot3 is one of the default robot models in Gazebo, which is easy to set up and integrate with other models and plugins. The advantage of using Turtlebot3 to simulate robots is that it provides a realistic simulation of physical robots, including dynamics, sensors, and actuators, which can be used to test and validate algorithms in controlled and safe environments. Easy to customize: Turtlebot3 is highly customizable, allowing researchers to modify and add different sensors, actuators, and control mechanisms to test and evaluate various path planning algorithms. ROS Integration: Turtlebot3 is essentially integrated with ROS and can be integrated with other ROS software packages and nodes. This makes it easier to develop and test path planning algorithms using Gazebo simulation software. Turtlebot3 has two main models in the simulation environment: Hamburg style and waffle style.
In the subsequent simulation experiments, the Turtlebot3 Hamburg style simulation robot was selected as the experimental object, and its motion mode is based on differential drive wheel movement. It uses two wheels to control the movement and rotation of the robot, where the speed of each wheel can be achieved by controlling the speed of the motor. Its kinematic model can be described using basic kinematic equations:
(1)
Among them,
represents the linear velocity of the robot,
represents the angular velocity of the robot,
and
represents the velocity of the right and left wheels,
representing the distance between the two wheels.These equations describe the motion of robots at different wheel speeds and can be used to control the path planning of robots in continuous simulation environments. It should be noted that the Turtlebot3 burger style robot also has incompleteness, which means it cannot achieve independent control in some directions of motion, such as being unable to move directly in a direction perpendicular to the two wheels. Therefore, when conducting path planning and control, it is necessary to consider the non integrity characteristics of the robot to avoid problems during its movement.The linear motion of the Turtlebot3 hamburger style robot is controlled by two differential drive wheels, which control the forward and backward movement of the robot.The curved motion of the Turtlebot3 hamburger style robot can be achieved by controlling the angular velocity of the robot. When the speed of the left and right driving wheels of the robot is not equal, the robot will rotate. The motion of a robot on a curve can be expressed as:
Then, Gazebo in ROS was used to build a simulation environment for robot motion, and performance testing was conducted using Turtlebot3 simulation robot. The goal of robot path planning is to avoid obstacles from the starting point and reach the endpoint. In Gazebo, simulation environments can be designed by adding objects, terrain, light sources, etc., and real-world scenes can be simulated by adding obstacles, target points, sensors, etc.
Laser ranging sensors were selected to obtain environmental information around the Turtlebot3 simulation robot. Compared to image sensors, laser ranging sensors can accurately measure in various complex environments and obtain information about the surrounding environment through rotational scanning, with high flexibility. Robots can measure the distance and direction between surrounding obstacles and the robot through laser ranging sensors, providing accurate environmental information. Compared to other sensors, laser ranging sensors are not affected by external factors such as light, climate, dust, etc., and can provide stable performance in complex environments.
The robot selects actions to interact with the environment based on its current state and determines the current state. If the maximum number of steps per turn is exceeded, the training environment and the position of the robot are reset. If the robot encounters an obstacle, it will receive a negative reward and continue to learn in the environment. If the robot reaches the target point, it receives a positive reward and then determines whether the deep reinforcement learning algorithm has converged. If it has converged, the program ends. Otherwise, the training environment and the robot's position are reset, and the robot continues to interact with it in the environment until the algorithm converges.
DDAPG Algorithm Design
The simulation environment built by Gazebo is more in line with the operation of robots in real scenes, and can use higher precision sensors to obtain environmental information, such as using sensors such as LiDAR, thereby improving the accuracy and flexibility of path planning.
This paper studies the problem of continuous path planning, where the robot's actions are continuous. Firstly, the problem is modeled as a Markov decision process, and a reward reshaping method based on potential energy and an adaptive turn OU noise mechanism are designed in combination with the DDPG algorithm, which is named the DDAPG algorithm. The deep neural network is also set up.
State space: The experiment is based on ROS and uses Gazebo to build a continuous simulation environment. The robot needs to complete the path planning task from the starting point to the target point without colliding with obstacles. Therefore, the state space of the robot is related to the position information of the robot to the target point and obstacles around the robot. In this chapter, the state space of the robot is defined as follows:
Among them, they respectively represent the current position of the robot, the angle and direction of the target point recorded by the robot, the position information from the current position of the robot to the nearest obstacle, and the position information from the current position of the robot to the target point.
Action Space: It refers to the actions performed by a robot in a time step, and its output actions are continuous. By outputting continuous action values, the robot can try various combinations of actions to find the optimal path.
Reward Function: The reward function is used to measure the quality of a robot's execution of a certain action. The robot observes the state of the environment at each time step and decides the next action to take based on the current state and behavior. It also receives a reward value indicating the quality of the current action. At each time step, the robot's reward depends on the actions of state
and execution
.
However, the reward function designed in the passable area has a reward value of 1, which can easily lead to the robot needing a long time of training to find a path to reach the target point, making it very difficult for the robot to complete the path planning task and possibly falling into a local optimal situation.To address this issue, this paper proposes a reward reshaping method based on potential energy. For the robot path planning problem, its main task is to approach the target and avoid obstacles. Therefore, this section designs a reward function from the above two perspectives. In the passable area, the idea of potential energy is used to reshape the reward function, which is divided into rewards
and
for approaching the target and rewards
for avoiding obstacles.
This paper proposes OU noise for adaptive rounds. Considering that introducing noise into the DRL algorithm can increase the exploration ability of robots, guide them to explore more possibilities in the action space, avoid local optima, and enhance their generalization ability to different environments.A noise impact factor
was designed to reduce the exploration level of the robot in the environment as the number of training epochs increased. In the initial training stage, the noise impact factor is relatively large, which means that the amplitude of OU noise is larger, and the robot's actions are more random, which helps the robot explore the environment. As the training progresses, the noise impact factor gradually decreases, resulting in a decrease in the amplitude of OU noise, and the robot's actions tend to be more stable and steady.
In order to train the network in DDAPG, the algorithm used target networks, namely the target policy network and the target Q-value network. The parameters of these target networks are updated through soft updates, rather than using conventional hard updates, in which the parameters of the target network are periodically fully replicated from the estimation network. Soft updates use a smooth approach to update the parameters of the target network. Only a small portion of the weights are updated each time, not all of them.
Among them,
is the rate of soft updates, which causes the parameters of the target network to slowly approach the parameters of the estimation network over a period of time. Soft updates can improve the stability of algorithms.
At the beginning of the DDAPG algorithm, the relevant initialization work is carried out by clearing the experience replay area and randomly initializing Q network
and policy network
. The target Q network
is set to be the same as
, and the parameter
of the target policy network is set to be the same as
.Using
to represent the size of the sampling small batch, the robot's learning process is divided into
rounds, each round consisting of
steps. After algorithm initialization, the loop begins. At the beginning of each round, for all rounds, the initialization state is
and the OU noise
of the action strategy is calculated to obtain
, and all obstacles, target points, and other information are initialized. Then the algorithm will perform a loop iteration process, using an adaptive round OU noise model at each step
of the robot to solve the exploration and utilization problems of the robot, and avoid getting stuck in local optimal solutions.
When it chooses to execute
, the robot receives a reward value and enters
, placing the experience value in the experience replay pool
. After
rounds of training, the learning process is carried out. For the experience replay pool
, a random sampling method is used to train the Q network and policy network using small batches of
. The Q network is updated using stochastic gradient descent, the strategy network is updated using stochastic gradient ascent (SGA), and the Adam optimizer is used. Compared to the DQN algorithm, the DDAPG algorithm adopts the dual network mechanism of Q network and policy network, and therefore uses a soft update method to update the target network smoothly to update the weights of the neural network. Finally, the trained network is used for decision-making. By taking the current state as input and outputting the action in that state, the path in the current 3D environment map can be obtained
13.
Discussion
In order to evaluate the effectiveness of the proposed DDAPG algorithm, other DRL algorithms were compared. For the designed DDAPG algorithm, all neural networks have one input layer, three hidden layers, and one output layer. All hidden layers are fully connected layers, and the reward function is designed based on the potential energy reward reshaping method. In the exploration phase, the algorithm adopts the designed adaptive round OU noise strategy. The baseline algorithms for comparison mainly include: 1) DDPG algorithm combined with Gaussian noise. 2) The dDDPG algorithm designed an adaptive exploration method based on incorporating OU noise
18. 3) The D-DDPG algorithm integrates the dueling network into the critic network and divides the Q-value into two parts for calculation, adding random noise
19. 4) The LSTM-DDPG algorithm
20 integrates LSTM into the DDPG algorithm framework. For all DRL algorithms, the input is 24 dimensional perception information of laser ranging, the current position of the robot, the angle and direction of the target point recorded by the robot, the information from the current position to the nearest obstacle position, and a total of 28 dimensions of information from the current position to the target point position. The output is angular velocity, with a range value of
.
In order to evaluate the performance and convergence of the robot path planning algorithm in this study, the following indicators were defined for comparative analysis of the algorithms.
Total reward: The total reward value of the robot in each round.
Collision rate: The proportion of robot encounters with obstacles in the total number of turns.
Success rate: The proportion of turns in which the robot does not collide during the process from the starting point to the target point in the total number of turns.
Maximum Reward: The highest total reward value for the number of turns in which the robot does not collide during the process from the starting point to the target point.
Considering the impact of discount factor
on the performance of DRL algorithm, experiments were designed to evaluate the performance of DDAPG algorithm under different discount factors
. Firstly, the discount factors
in the algorithm were set to 0.80, 0.90, and 0.99, respectively.
When the Turtlebot3 robot Hamburg style was performing path planning tasks, it moved according to the DDAPG algorithm. The black dots were the starting position of the robot, the wooden boards and walls in the environment were obstacles, and the red square was the target position of the robot.
The DDAPG algorithm can better adapt to the training needs of different stages, balancing the relationship between exploration and utilization throughout the entire training process. In addition, the reward reshaping method based on task characteristics solves the problem of sparse rewards, making it easier for robots to understand the essence of the task, guiding them towards the target and avoiding obstacles, which helps to improve the convergence speed of the algorithm and reduce collision rates.
A
Table 1
Comparison of algorithm Performance.
| | Collision rate | Success rate | Maximum Reward |
|---|
DDPG | 44.75% | 55.25% | 261 |
D-DDPG | 38.25% | 61.75% | 259 |
dDDPG | 43.5% | 56.5% | 259 |
LSTM-DDPG | 18.25% | 81.75% | 259 |
DDAPG | 7.625% | 92.375% | 261 |
This paper focuses on the path planning problem of robots in continuous simulation environments. Firstly, a simulation environment was built in Gazebo based on ROS, and Turtlebot3 was used to simulate the robot. The environment information was sensed by laser ranging sensors. Modeling the continuous path planning problem of robots as a Markov decision process, there is usually a problem of sparse rewards when applying deep reinforcement learning algorithms to robot path planning, which leads to slow or even inability to converge the robot. Therefore, a potential energy based reward reshaping method was designed to solve this problem, dividing the reward function into rewards towards the target and rewards for avoiding obstacles, in order to better motivate the robot to move towards the target and avoid collisions.In order to improve the exploration of robots in path planning, adaptive round OU noise was designed and introduced into the action selection stage. A corresponding deep neural network was built and named DDAPG algorithm. Finally, the performance and adaptability of the DDAPG algorithm designed in this paper were verified through simulation experiments.