Research on a path optimization algorithm based on enhanced DDPG for continuous dynamic obstacle environments

Zhiyu Huang¹, Xiaoming Wu²,Huaning Song³,^*,and Xiaoliang Wang¹,^*

¹Hunan University of Science and Technology, School of Computer Science and Technology, Xiangtan, 411100, China

²Xiangtan Central Hospital, Xiangtan, 411100, China

³Hechi University, School of Artificial Intelligence and Manufacturing, Hechi, 547000, China

^*shn_2008@163.com, fengwxl@163.com

Abstract

This paper proposes a Continuous Dynamic Obstacle Environment Path Optimization Algorithm (DDAPG) based on Enhanced DDPG, which focuses on solving the problem of path planning for robots in continuous action spaces. The algorithm effectively guides the robot to move towards the target position and punishes collision behavior by designing a reward reshaping method based on potential energy. At the same time, an adaptive noise model is used to enhance exploration ability and avoid falling into local optima. The experimental results show that the DDAPG algorithm outperforms the baseline algorithm in terms of convergence speed, collision rate reduction, and adaptability, and can achieve efficient and safe path planning in different scenarios. This study provides strong support for the application of embodied intelligence in the field of robot path planning and promotes the development of autonomous decision-making technology for robots.

Keywords DDPG, Path optimization, reward reshaping method, continuous action space, Embodied Intelligence

Introduction

The research on robot path planning originates from the demand for automation and intelligence, aiming to solve the core problem of autonomous and safe movement of robots in complex environments. Its development has gone through various stages, from early static environment planning methods such as visual mapping and grid methods, to random sampling algorithms suitable for high-dimensional spaces, and now to intelligent planning methods that integrate deep learning and reinforcement learning. The significance of this research is significant as it marks a crucial leap for robots from passive execution to active perception and decision-making, directly driving practical applications in many fields such as autonomous driving, unmanned aerial vehicle surveying, intelligent warehousing and logistics, and service robots. By optimizing the efficiency, safety, and robustness of paths, path planning technology is not only a technical prerequisite for robots to be deployed in real work, but also one of the core driving forces for improving social production efficiency and automation level. Considering that in continuous path planning problems, compared to discrete action spaces, the continuous action space is larger and the exploration problem faced by robots is more complex. Robots need to face more choices in the path planning process, which increases randomness and avoids getting stuck in local optimal solutions. Therefore, this paper studies the implementation of further enhancing its exploration ability to better explore and learn continuous actions in the environment, considering the convergence speed and stability of the algorithm, and designing a reasonable reward function to encourage the robot to move towards the target position, punish the robot for colliding with obstacles, and enable the robot to have good path planning ability in different scenarios. At the same time, the research on robot path planning is moving from traditional environment abstract calculation to a new stage of deep integration with embodied intelligence. Embodied intelligence emphasizes the core role of the interaction between the "body" and the environment in intelligent behavior, which means that path planning is no longer an isolated optimal solution calculation, but a embodied decision-making process that relies on the specific form, physical characteristics, and multimodal perception of robots..

RELATED WORK

The research on robot path planning originates from the demand for automation and intelligence, aiming to solve the core problem of autonomous and safe movement of robots in complex environments. In this section, we will provide an overview and analysis of these studies. In recent years, traditional path planning algorithms have been widely studied and applied, Khatib et al. proposed the Artificial Potential Field (APF) method, which treats robots as ppapers and introduces potential energy and repulsion functions to describe the motion state of robots and the forces of surrounding obstacles¹.Yang et al. proposed an improved artificial potential field method, which solves the problem of local minima by modifying the direction and influence range of the gravitational field, adding virtual targets and evaluation functions, and solving the problem of unreachable targets by increasing gravity².Wang et al. proposed an artificial driving vehicle that adapts to different environments based on the characteristics of the driver and the artificial potential field method. Compared with the traditional artificial potential field method, the improved local path planning algorithm obtains more reasonable path curves³.Li et al. proposed an OSM robot navigation method that combines road network information and local perception information to address the problem of global path inconsistency with the real environment caused by map errors in OpenStreetMap (OSM). OSM provides road network information and obtains the global path through Dijkstra's algorithm⁴.Lee et al. proposed a method called Finite Distribution Estimation of Dynamic Window (FDEDWA), which avoids dynamic obstacles by estimating the overall distribution of obstacles. The FDEDWA algorithm uses the Frobenius matrix factorization method to estimate the distribution of obstacles and predict their future positions, enabling real-time perception of dynamic environments⁵.Yuan et al. proposed a mobile robot path planning method that combines bat algorithm and DWA algorithm to achieve global optimization and dynamic obstacle avoidance goals⁶.Ma et al. proposed a CGAN-RRT algorithm consisting of a conditional generative adversarial network generation model and an improved RRT, which can generate probability distributions of effective feasible paths given map information⁷.Jiang et al. proposed an improved neural network based on DQN architecture. On a 30 × 30 map, DQN ensures that the optimal path between any two points can be successfully found within approximately 9000 rounds⁸.Xin et al. proposed a ladder reward mechanism to replace traditional simple reward functions and introduced it into experience replay based DQN, which significantly improved its adaptability to dynamic environments⁹.Huang et al. proposed a weighted reward function to balance obstacle avoidance and approach to the target, using the weighted reward function to balance obstacle avoidance and approach to the target. Two reward thresholds were used to modify abnormal rewards to study path planning for mobile robots in unknown dynamic environments¹⁰.Yang et al. proposed an N-step priority dual DQN NPDDQN path planning algorithm and experience filtering mechanism, which screens the positive experiences mined and improves their reuse rate, effectively achieving obstacle avoidance in complex environments¹¹.Kretzschmar et al. proposed a DDPG algorithm that utilizes deep neural networks based on the DPG algorithm. The DDPG algorithm combines Q-Learning and deterministic policy gradient methods, and approximates the value function and policy function using deep neural networks to achieve planning and control of continuous action spaces¹².Tai et al. proposed an asynchronous DDPG algorithm to improve the training efficiency of deep reinforcement learning algorithms¹³.Li et al. proposed a method to replace the neural network optimization algorithm in DDPG algorithm with Radam algorithm, and added priority experience replay to improve the success rate and convergence speed of the algorithm¹⁴.Jiang et al. proposed an interactive deep deterministic policy gradient method based on action guidance to solve the problem of low sample efficiency in traditional deep reinforcement learning¹⁵.Zhou et al. designed a dual noise mechanism to explore the environment more effectively, embedding LSTM network based on DDPG algorithm to ensure that the model obtains information from previous navigation experience, thereby optimizing the model's decisions¹⁶.Yang et al. provided a comprehensive overview of the current progress in embodied intelligence, achieving autonomous learning capabilities for robots through deep reinforcement learning, enabling robots to continuously optimize path planning strategies in their interaction with the environment¹⁷.

Methods

Considering that in continuous path planning problems, compared to discrete action spaces, the continuous action space is larger and the exploration problem faced by robots is more complex. Robots need to face more choices in the path planning process, which increases randomness and avoids getting stuck in local optimal solutions. Therefore, it is necessary to further enhance its exploration ability to better explore and learn continuous actions in the environment, considering the convergence speed and stability of the algorithm, and designing a reasonable reward function to encourage the robot to move towards the target position, punish the robot for colliding with obstacles, and enable the robot to have good path planning ability in different scenarios. For this purpose, the DDAPG algorithm is designed in this paper. Firstly, the robot is designed as a Markov decision process in continuous path planning tasks. A simulation robot is used to collect environmental information around it through laser ranging sensors, and a deep neural network is built. Considering the issue of sparse rewards, the reward function should provide smooth and continuous rewards in a continuous action space. A potential energy based reward reshaping method was designed. In order to improve exploration intensity, an adaptive round OU noise model was designed. Finally, a simulation environment was built based on Gazebo to compare the algorithm performance. In this chapter, the DDAPG algorithm was proposed to accelerate convergence speed and reduce collision rate compared to the baseline algorithm, and its adaptive ability was verified in different scenarios.

Design of Continuous Path Planning Process

The simulation environment built by Gazebo is more in line with the operation of robots in real scenes, and can use higher precision sensors to obtain environmental information, such as using sensors such as LiDAR, thereby improving the accuracy and flexibility of path planning. Turtlebot3 is one of the default robot models in Gazebo, which is easy to set up and integrate with other models and plugins. The advantage of using Turtlebot3 to simulate robots is that it provides a realistic simulation of physical robots, including dynamics, sensors, and actuators, which can be used to test and validate algorithms in controlled and safe environments. Easy to customize: Turtlebot3 is highly customizable, allowing researchers to modify and add different sensors, actuators, and control mechanisms to test and evaluate various path planning algorithms. ROS Integration: Turtlebot3 is essentially integrated with ROS and can be integrated with other ROS software packages and nodes. This makes it easier to develop and test path planning algorithms using Gazebo simulation software. Turtlebot3 has two main models in the simulation environment: Hamburg style and waffle style.

In the subsequent simulation experiments, the Turtlebot3 Hamburg style simulation robot was selected as the experimental object, and its motion mode is based on differential drive wheel movement. It uses two wheels to control the movement and rotation of the robot, where the speed of each wheel can be achieved by controlling the speed of the motor. Its kinematic model can be described using basic kinematic equations:

$v=\left( {{v_r}+{v_l}} \right)/2$

(1)

$w=\left( {{v_r} - {v_l}} \right)/L$

Among them,

represents the linear velocity of the robot,

represents the angular velocity of the robot,

${v_r}$

and

${v_l}$

represents the velocity of the right and left wheels,

representing the distance between the two wheels.These equations describe the motion of robots at different wheel speeds and can be used to control the path planning of robots in continuous simulation environments. It should be noted that the Turtlebot3 burger style robot also has incompleteness, which means it cannot achieve independent control in some directions of motion, such as being unable to move directly in a direction perpendicular to the two wheels. Therefore, when conducting path planning and control, it is necessary to consider the non integrity characteristics of the robot to avoid problems during its movement.The linear motion of the Turtlebot3 hamburger style robot is controlled by two differential drive wheels, which control the forward and backward movement of the robot.The curved motion of the Turtlebot3 hamburger style robot can be achieved by controlling the angular velocity of the robot. When the speed of the left and right driving wheels of the robot is not equal, the robot will rotate. The motion of a robot on a curve can be expressed as:

$x(t)={x_0}+\frac{v}{\omega }\left( {\sin \left( {\omega t+{\theta _0}} \right) - \sin {\theta _0}} \right)$

$y(t)={y_0}+\frac{v}{\omega }\left( {\cos {\theta _0} - \cos \left( {\omega t+{\theta _0}} \right)} \right)$

Then, Gazebo in ROS was used to build a simulation environment for robot motion, and performance testing was conducted using Turtlebot3 simulation robot. The goal of robot path planning is to avoid obstacles from the starting point and reach the endpoint. In Gazebo, simulation environments can be designed by adding objects, terrain, light sources, etc., and real-world scenes can be simulated by adding obstacles, target points, sensors, etc.

Laser ranging sensors were selected to obtain environmental information around the Turtlebot3 simulation robot. Compared to image sensors, laser ranging sensors can accurately measure in various complex environments and obtain information about the surrounding environment through rotational scanning, with high flexibility. Robots can measure the distance and direction between surrounding obstacles and the robot through laser ranging sensors, providing accurate environmental information. Compared to other sensors, laser ranging sensors are not affected by external factors such as light, climate, dust, etc., and can provide stable performance in complex environments.

The robot selects actions to interact with the environment based on its current state and determines the current state. If the maximum number of steps per turn is exceeded, the training environment and the position of the robot are reset. If the robot encounters an obstacle, it will receive a negative reward and continue to learn in the environment. If the robot reaches the target point, it receives a positive reward and then determines whether the deep reinforcement learning algorithm has converged. If it has converged, the program ends. Otherwise, the training environment and the robot's position are reset, and the robot continues to interact with it in the environment until the algorithm converges.

DDAPG Algorithm Design

This paper studies the problem of continuous path planning, where the robot's actions are continuous. Firstly, the problem is modeled as a Markov decision process, and a reward reshaping method based on potential energy and an adaptive turn OU noise mechanism are designed in combination with the DDPG algorithm, which is named the DDAPG algorithm. The deep neural network is also set up.

State space

: The experiment is based on ROS and uses Gazebo to build a continuous simulation environment. The robot needs to complete the path planning task from the starting point to the target point without colliding with obstacles. Therefore, the state space of the robot is related to the position information of the robot to the target point and obstacles around the robot. In this chapter, the state space of the robot is defined as follows:

$S=\left\{ {\begin{array}{*{20}{l}} {\left( {{x_r},{y_r}} \right)} \\ \theta \\ {\left( {{x_o} - {x_r}} \right),\left( {{y_o} - {y_r}} \right)} \\ {\left( {{x_a} - {x_r}} \right),\left( {{y_a} - {y_r}} \right)} \end{array}} \right.$

Among them, they respectively represent the current position of the robot, the angle and direction of the target point recorded by the robot, the position information from the current position of the robot to the nearest obstacle, and the position information from the current position of the robot to the target point.

Action Space

: It refers to the actions performed by a robot in a time step, and its output actions are continuous. By outputting continuous action values, the robot can try various combinations of actions to find the optimal path.

$R=\left\{ {\begin{array}{*{20}{l}} {{R_a}=200} \\ {{R_o}= - 150} \\ {{R_p}=1} \end{array}} \right.$

Reward Function

: The reward function is used to measure the quality of a robot's execution of a certain action. The robot observes the state of the environment at each time step and decides the next action to take based on the current state and behavior. It also receives a reward value indicating the quality of the current action. At each time step, the robot's reward depends on the actions of state

$S\left( {\text{t}} \right)$

and execution

$A\left( t \right)$

$R=\left\{ {\begin{array}{*{20}{l}} {{R_a}=200} \\ {{R_o}= - 150} \\ {{R_p}={r_\theta } * {r_a} * \alpha +{r_o}} \end{array}} \right.$

However, the reward function designed in the passable area has a reward value of 1, which can easily lead to the robot needing a long time of training to find a path to reach the target point, making it very difficult for the robot to complete the path planning task and possibly falling into a local optimal situation.To address this issue, this paper proposes a reward reshaping method based on potential energy. For the robot path planning problem, its main task is to approach the target and avoid obstacles. Therefore, this section designs a reward function from the above two perspectives. In the passable area, the idea of potential energy is used to reshape the reward function, which is divided into rewards

${r_\theta }$

and

${r_a}$

for approaching the target and rewards

${r_o}$

for avoiding obstacles.

This paper proposes OU noise for adaptive rounds. Considering that introducing noise into the DRL algorithm can increase the exploration ability of robots, guide them to explore more possibilities in the action space, avoid local optima, and enhance their generalization ability to different environments.A noise impact factor

$\lambda$

was designed to reduce the exploration level of the robot in the environment as the number of training epochs increased. In the initial training stage, the noise impact factor is relatively large, which means that the amplitude of OU noise is larger, and the robot's actions are more random, which helps the robot explore the environment. As the training progresses, the noise impact factor gradually decreases, resulting in a decrease in the amplitude of OU noise, and the robot's actions tend to be more stable and steady.

$\lambda =\frac{{W - {W_e}}}{W}$

In order to train the network in DDAPG, the algorithm used target networks, namely the target policy network and the target Q-value network. The parameters of these target networks are updated through soft updates, rather than using conventional hard updates, in which the parameters of the target network are periodically fully replicated from the estimation network. Soft updates use a smooth approach to update the parameters of the target network. Only a small portion of the weights are updated each time, not all of them.

$\theta ^{\prime} \leftarrow \tau \theta +(1 - \tau )\theta ^{\prime}$

$\omega ^{\prime} \leftarrow \tau \omega +(1 - \tau )\omega ^{\prime}$

Among them,

$\tau$

is the rate of soft updates, which causes the parameters of the target network to slowly approach the parameters of the estimation network over a period of time. Soft updates can improve the stability of algorithms.

Fig. 1

Schematic diagram of DDAPG algorithm

At the beginning of the DDAPG algorithm, the relevant initialization work is carried out by clearing the experience replay area and randomly initializing Q network

$\theta$

and policy network

$\omega$

. The target Q network

$\theta ^{\prime}$

is set to be the same as

$\theta$

, and the parameter

$\omega ^{\prime}$

of the target policy network is set to be the same as

$\omega$

.Using

${D_{\text{b}}}$

to represent the size of the sampling small batch, the robot's learning process is divided into

rounds, each round consisting of

steps. After algorithm initialization, the loop begins. At the beginning of each round, for all rounds, the initialization state is

$S\left( 1 \right)$

and the OU noise

of the action strategy is calculated to obtain

$\lambda$

, and all obstacles, target points, and other information are initialized. Then the algorithm will perform a loop iteration process, using an adaptive round OU noise model at each step

of the robot to solve the exploration and utilization problems of the robot, and avoid getting stuck in local optimal solutions.

When it chooses to execute

$A\left( t \right)$

, the robot receives a reward value and enters

$S\left( {t+1} \right)$

, placing the experience value in the experience replay pool

. After

${W_s}$

rounds of training, the learning process is carried out. For the experience replay pool

, a random sampling method is used to train the Q network and policy network using small batches of

${D_{\text{b}}}$

. The Q network is updated using stochastic gradient descent, the strategy network is updated using stochastic gradient ascent (SGA), and the Adam optimizer is used. Compared to the DQN algorithm, the DDAPG algorithm adopts the dual network mechanism of Q network and policy network, and therefore uses a soft update method to update the target network smoothly to update the weights of the neural network. Finally, the trained network is used for decision-making. By taking the current state as input and outputting the action in that state, the path in the current 3D environment map can be obtained¹³.

Discussion

In order to evaluate the effectiveness of the proposed DDAPG algorithm, other DRL algorithms were compared. For the designed DDAPG algorithm, all neural networks have one input layer, three hidden layers, and one output layer. All hidden layers are fully connected layers, and the reward function is designed based on the potential energy reward reshaping method. In the exploration phase, the algorithm adopts the designed adaptive round OU noise strategy. The baseline algorithms for comparison mainly include: 1) DDPG algorithm combined with Gaussian noise. 2) The dDDPG algorithm designed an adaptive exploration method based on incorporating OU noise¹⁸. 3) The D-DDPG algorithm integrates the dueling network into the critic network and divides the Q-value into two parts for calculation, adding random noise¹⁹. 4) The LSTM-DDPG algorithm²⁰ integrates LSTM into the DDPG algorithm framework. For all DRL algorithms, the input is 24 dimensional perception information of laser ranging, the current position of the robot, the angle and direction of the target point recorded by the robot, the information from the current position to the nearest obstacle position, and a total of 28 dimensions of information from the current position to the target point position. The output is angular velocity, with a range value of

$w \in \left[ { - 1.5rad/s,1.5rad/s} \right]$

In order to evaluate the performance and convergence of the robot path planning algorithm in this study, the following indicators were defined for comparative analysis of the algorithms.

Fig. 2

Total reward value of DDAPG with different discount factors.

Total reward: The total reward value of the robot in each round.

Collision rate: The proportion of robot encounters with obstacles in the total number of turns.

Success rate: The proportion of turns in which the robot does not collide during the process from the starting point to the target point in the total number of turns.

Maximum Reward: The highest total reward value for the number of turns in which the robot does not collide during the process from the starting point to the target point.

Considering the impact of discount factor

$\gamma$

on the performance of DRL algorithm, experiments were designed to evaluate the performance of DDAPG algorithm under different discount factors

$\gamma$

. Firstly, the discount factors

$\gamma$

in the algorithm were set to 0.80, 0.90, and 0.99, respectively.

When the Turtlebot3 robot Hamburg style was performing path planning tasks, it moved according to the DDAPG algorithm. The black dots were the starting position of the robot, the wooden boards and walls in the environment were obstacles, and the red square was the target position of the robot.

Fig. 3

The total reward of the algorithms in continuous environment.

The DDAPG algorithm can better adapt to the training needs of different stages, balancing the relationship between exploration and utilization throughout the entire training process. In addition, the reward reshaping method based on task characteristics solves the problem of sparse rewards, making it easier for robots to understand the essence of the task, guiding them towards the target and avoiding obstacles, which helps to improve the convergence speed of the algorithm and reduce collision rates.

Table 1
Comparison of algorithm Performance.
	Collision rate	Success rate	Maximum Reward
DDPG	44.75%	55.25%	261
D-DDPG	38.25%	61.75%	259
dDDPG	43.5%	56.5%	259
LSTM-DDPG	18.25%	81.75%	259
DDAPG	7.625%	92.375%	261

This paper focuses on the path planning problem of robots in continuous simulation environments. Firstly, a simulation environment was built in Gazebo based on ROS, and Turtlebot3 was used to simulate the robot. The environment information was sensed by laser ranging sensors. Modeling the continuous path planning problem of robots as a Markov decision process, there is usually a problem of sparse rewards when applying deep reinforcement learning algorithms to robot path planning, which leads to slow or even inability to converge the robot. Therefore, a potential energy based reward reshaping method was designed to solve this problem, dividing the reward function into rewards towards the target and rewards for avoiding obstacles, in order to better motivate the robot to move towards the target and avoid collisions.In order to improve the exploration of robots in path planning, adaptive round OU noise was designed and introduced into the action selection stage. A corresponding deep neural network was built and named DDAPG algorithm. Finally, the performance and adaptability of the DDAPG algorithm designed in this paper were verified through simulation experiments.

Discussion

The continuous path planning algorithm based on improved DDPG (DDAPG) proposed in this paper is highly compatible with the concept of embodied intelligence. Through the design of a reward reshaping method based on potential energy and an adaptive noise model, the DDAPG algorithm has achieved efficient path planning for robots in continuous motion spaces, demonstrating strong self-learning, adaptability, and robustness. The experimental results have verified the advantages of DDAPG algorithm in convergence speed, collision rate, and adaptability, providing strong support for the application of embodied intelligence in the field of robot path planning. Future research can further explore the application of DDAPG algorithm in more complex environments and combine it with other embodied intelligence technologies (such as transfer learning, multimodal perception, etc.) to enhance the overall intelligence level of robots.

Data Availability

The datasets used and/or analysed during the current study available from corresponding author on reasonable request.

References

Khatib, O. Real-time obstacle avoidance for manipulators and mobile robots. 2(500–505) (1985). (1985).

Yang, W. L. et al. Improved Artificial Potential Field and Dynamic Window Method for Amphibious Robot Fish Path Planning. APPLIED SCIENCES-BASEL 11(5)(2021). (2021).

Wang, S. B. et al. Autonomous Vehicle Path Planning Based on Driver Characteristics Identification and Improved Artificial Potential Field. ACTUATORS 11(2)(2022). (2022a).

Li, J., Qin, H., Wang, J. Z. & Li, J. H. OpenStreetMap-Based Autonomous Navigation for the Four Wheel-Legged Robot Via 3D-Lidar and CCD Camera. IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS 69(3): 2708–2717(2022). (2022).

Lee, D. H., Lee, S. S., Ahn, C. K., Shi, P. & Lim, C. C. Finite Distribution Estimation-Based Dynamic Window Approach to Reliable Obstacle Avoidance of Mobile Robot. IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS 68(10): 9998–10006(2021). (2021).

Yuan, X., Yuan, X. W. & Wang, X. H. Path Planning for Mobile Robot Based on Improved Bat Algorithm. SENSORS 21(13)(2021). (2021).

Ma, N. C., Wang, J. K., Liu, J. B. & Meng, M. Q. H. Conditional Generative Adversarial Networks for Optimal Path Planning. IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS 14(2): 662–671(2022). (2022).

Jiang, J. X., Zeng, X. Y., Guzzetti, D. & You, Y. Y. Path planning for asteroid hopping rovers with pre-trained deep reinforcement learning architectures. ACTA ASTRONAUTICA 171(265–279)(2020). (2020).

Xin, B. & He, C. DRL-Based Improvement for Autonomous UAV Motion Path Planning in Unknown Environments. In 2022 7th International Conference on Control and Robotics Engineering (ICCRE); 102–105(2022). (2022).

10.

Huang, R., Qin, C., Li, J. L. & Lan, X. Path planning of mobile robot in unknown dynamic continuous environment using reward-modified deep Q‐network. Optimal Control Applications and Methods 44(3): 1570–1587(2023). (2023).

11.

Yang, J., Ni, J., Xi, M., Wen, J. & Li, Y. Intelligent Path Planning of Underwater Robot Based on Reinforcement Learning. IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING 20(3): 1983–1996(2023). (2023).

12.

Kretzschmar, H., Spies, M., Sprunk, C. & Burgard, W. Socially compliant mobile robot navigation via inverse reinforcement learning. INTERNATIONAL JOURNAL OF ROBOTICS RESEARCH 35(11): 1352–1370(2016). (2016).

13.

Tai, L., Paolo, G. & Liu, M. Virtual-to-real deep reinforcement learning: Continuous control of mobile robots for mapless navigation. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); 31–36(2017). (2017).

14.

Li, P., Ding, X. & Ren, W. Research on Path Planning of Cloud Robot in Dynamic Environment Based on Improved DDPG Algorithm. In 2021 China Automation Congress (CAC); 3561–3566(2021). (2021b).

15.

Jiang, D., Fang, Z., Cheng, C., He, B. & Li, G. Action Guidance-Based Deep Interactive Reinforcement Learning for AUV Path Planning. In 2022 International Conference on Machine Learning, Control, and Robotics (MLCR); 158–165(2022). (2022).

16.

Zhou, Q., Lyu, L. & Liu, H. Deep Reinforcement Learning with Long-Time Memory Capability for Robot Mapless Navigation. In 2022 IEEE 25th International Conference on Computer Supported Cooperative Work in Design (CSCWD); 1215–1220(2022). (2022b).

17.

Liu, Y. et al. Aligning Cyber Space With Physical World: A Comprehensive Survey on Embodied AI, in IEEE/ASME Transactions on Mechatronics, (2025). 10.1109/TMECH.2025.3574943

18.

Dong, Y. & Zou, X. Mobile Robot Path Planning Based on Improved DDPG Reinforcement Learning Algorithm. In 2020 IEEE 11th International Conference on Software Engineering and Service Science (ICSESS); 52–56(2020). (2020).

19.

Zhao, P., Zheng, J., Zhou, Q., Lyu, C. & Lyu, L. A Dueling-DDPG Architecture for Mobile Robots Path Planning Based on Laser Range Findings. In PRICAI 2021: Trends in Artificial Intelligence, Pham, DN, Theeramunkong, T, Governatori, G, Liu, F (eds). Springer International Publishing: Cham; 154–168(2021). (2021a).

20.

Chao, W., Han, D., Jie, X. & Engineering, C. 2 Multi-Rotor UAV Autonomous Tracking and Obstacle Avoidance Based on Improved DDPG. In 2021 nd International Conference on Artificial Intelligence and (ICAICE); 261–267(2021). (2021).

Competing interests

(mandatory)

The authors declare no competing interests.

Funding

This work is supported by the Scientific Research Fund of National Natural Science Foun35 dation of China (Grant 62372168), the Hunan Provincial Natural Science Foundation 36 of China (Grants 2023JJ30266), Key Laboratory of AI and Information Processing, Education Department of Guangxi Zhuang Autonomous Region (Hechi University)(Grants 2024GXZDSY010)..

Author Contribution

Z.H. conceived the experiment(s). X.W. and H.S. conducted the experiment(s), and X.W. performed statistical analysis and figure generation. All authors reviewed the manuscript.

Yes