An Inertial Sensor-Based Deep Learning Framework for Road Surface Type Classification in Intelligent Vehicle Systems

Chenyang Li 1

Qiang Lin 1

Jianhong Zhang 2

Yipeng Liu 1

Guifang Shao 1

Qingyuan Zhu 1✉ Emailzhuqy@xmu.edu.cn

1 Pen-Tung Sah Institute of Micro-Nano Science and Technology Xiamen University 361000 Xiamen China

2 Xiamen King Long United Automotive Industry Co., Ltd 361000 Xiamen China

Chenyang Li^a, Qiang Lin^a, Jianhong Zhang^b, Yipeng Liu^a, Guifang Shao^a, and Qingyuan Zhu^a*

^a Pen-Tung Sah Institute of Micro-Nano Science and Technology, Xiamen University, Xiamen 361000, China

^b Xiamen King Long United Automotive Industry Co., Ltd, Xiamen 361000, China

Abstract

With the rapid advancement of intelligent transportation systems, autonomous vehicles are progressively expanding from structured roads to unstructured and dynamically changing environments. Road surface types exert significant influence on vehicle stability, path planning, and suspension control strategies. Therefore, accurate road surface classification is fundamental to ensuring safe driving and adaptive control. However, existing methods based on inertial sensors often suffer from low classification accuracy between highly similar surfaces such as asphalt and concrete road, insufficient feature representation, and fixed network parameters. To address these issues, this paper proposes a deep learning road surface type classification method named KCLMAnet (Kepler-optimized CNN-LSTM with Multi-head Attention network), which integrates Convolutional Neural Networks (CNN), Long Short-Term Memory networks (LSTM), multi-head attention, and the Kepler Optimization Algorithm (KOA). The proposed approach leverages a spatially distributed inertial sensing system to capture multi-band, hierarchical features from different vehicle components. The attention mechanism enhances the extraction of key spatiotemporal features, while KOA dynamically optimizes key network parameters, thereby improving generalization and robustness. Experimental results show that KCLMAnet achieves superior accuracy, recall, and F1-score compared to other models. Notably, it shows strong discriminative ability in scenarios involving similar or mixed surface types, offering effective support for adaptive path selection and suspension control in intelligent vehicles.

Keywords:

Road surface type classification

KCLMAnet

Multi-head attention

Kepler optimization algorithm

Intelligent vehicles

* Corresponding author: Qingyuan Zhu. E-mail address: zhuqy@xmu.edu.cn

1. Introduction

With the advancement of intelligent transportation systems, the operating environments of autonomous vehicles are gradually extending from structured roads to unstructured and diverse complex road scenarios. Variations in road surface types directly affect tire–road interaction, vehicle dynamic response, and driving stability. These factors serve as critical inputs for adaptive path planning and suspension control strategies [1],[2]. Therefore, accurate identification of road surface types enables real-time adjustment of control parameters such as vehicle speed, steering angle, and suspension stiffness during operation, thereby enhancing both driving safety and ride comfort [3].

Recent advances in road surface recognition have increasingly leveraged vision-based and multimodal perception frameworks, integrating RGB cameras, LiDAR, or radar sensors to enhance environmental awareness [4],[5]. These approaches have demonstrated promising performance under favorable sensing conditions and have benefited from rapid progress in deep learning architectures and large-scale visual datasets [6],[7],[8].

However, despite these advances, vision-dominated and multimodal methods still exhibit inherent limitations in real-world deployment. LiDAR-based approaches may suffer from signal attenuation and scattering under adverse environmental conditions, leading to unstable or incomplete surface perception [9]. Similarly, visual sensors are highly sensitive to illumination changes, shadows, weather conditions such as rain, fog, and snow, and occlusions caused by surrounding traffic or road contamination [10],[11],[12]. These limitations significantly constrain the robustness and continuity of road surface recognition in unstructured or degraded driving environments.

In contrast, inertial measurement units (IMUs) provide a fundamentally different sensing modality by capturing high-frequency physical interaction signals between the vehicle and the road surface. The vibration responses recorded by accelerometers and gyroscopes directly reflect tire–road contact dynamics and structural transmission characteristics, and are largely immune to environmental disturbances such as lighting variation or visual occlusion. As a result, inertial sensing offers continuous, reliable, and cost-effective perception capability, making it particularly suitable for road surface classification in complex and dynamically changing environments [13],[14],[15].

Inertial sensor-based road surface classification methods primarily focus on two aspects: road quality classification and road surface type classification. Road quality classification focuses on assessing the condition of the road surface, such as determining the International Roughness Index (IRI) and identifying issues like potholes, cracks, or damage. This is crucial for road maintenance and safe driving of vehicles. In road quality classification research, Aleadelat et al. [16] utilized signal processing and pattern recognition techniques to predict the IRI, categorizing road quality into three levels: good, average, and poor. Guo et al. [17] developed a CatBoost algorithm with sequence model optimization based on vibration acceleration data. This approach addressed the challenge of distinguishing road surface features caused by the highly nonlinear vehicle-coupled vibrations, improving the accuracy of road quality classification. Tiwari et al. [18] proposed a deep learning method called RoadCare, which successfully classifies road quality into three categories: good, average, and poor. Lekshmipathy et al. [19] trained an artificial neural network (ANN) model to identify road conditions such as cracks, potholes, and patches, thereby enabling road quality classification. Carlos et al. [20] established a dataset featuring potholes, cracks, patches, and speed bumps to enhance the accuracy of road quality classification.

Compared to road quality classification, road surface type classification focuses more on identifying the categories of surface materials and structures, such as asphalt, concrete, and dirt. Accurate road surface type information helps intelligent vehicles to promptly adjust planning and control strategies, improving driving smoothness and safety [21]. Therefore, this paper focuses on road surface type classification. In road surface type classification research, Bajic et al. [22] developed a machine learning method based on vertical acceleration data, achieving a classification accuracy of 0.67 and a recall rate of 0.76. To enhance the accuracy of road surface type classification, Kim et al. [23] proposed a data preprocessing method that extracts frequency band features from inertial sensor data. They then built an ANN, which successfully classified three types of road surfaces. Wu et al. [24] proposed a DeepSense neural network, which, combined with specific preprocessing and feature extraction methods, was able to distinguish three road types. However, its accuracy still needs to be improved. Varona et al. [25] developed a convolutional neural network (CNN) for classifying road surface types using acceleration data from inertial sensors. While CNN can effectively capture the static features of data, they overlook the long-term dependencies inherent in time-series data.

Recent architectures such as transformers and graph neural networks have been widely applied to time-series analysis tasks, but their high computational cost poses challenges in scenarios involving small datasets or embedded deployment. In contrast, CNN and LSTM networks offer lightweight structures with effective temporal modeling capabilities. These models demonstrate greater stability when handling vibration variations and long-term dependencies in inertial sensor data. For example, Yin et al. [26] proposed a CNN-LSTM network to improve classification accuracy for samples with long-term temporal dependencies. In addition, Hadj-Attou et al. [27] applied preprocessing techniques such as Fourier transform and wavelet transform to acceleration data from inertial sensors. They then used this data to train a CNN-LSTM model, leveraging the temporal modeling capabilities of LSTM to improve classification accuracy.

Although CNN-LSTM networks based on inertial sensors have demonstrated strong performance in road surface classification, existing methods still face several technical limitations. On the one hand, some studies use a single-point IMU mounted on the vehicle body, which fails to capture the diverse vibration characteristics exhibited by components such as the suspension and tires across different frequency bands. This results in incomplete surface perception. On the other hand, current approaches typically adopt fixed network structures without tailored processing for distinct frequency or temporal features, and lack automated hyperparameter tuning mechanisms. Consequently, their generalization ability and classification accuracy remain limited. This is particularly evident when distinguishing between highly similar surfaces such as asphalt and concrete, where the misclassification rate remains high, significantly hindering practical deployment in real-world autonomous driving applications. To address these challenges, this study proposes a road surface type classification framework, which leverages multi-point inertial sensing to enhance surface perception. The main contributions of this work are summarized as follows:

1. A spatially distributed inertial sensing paradigm is proposed to enhance road surface perception beyond conventional single-point IMU configurations. By simultaneously collecting vibration responses from tires, suspension, and vehicle chassis, the proposed framework captures heterogeneous dynamic characteristics across multiple frequency bands. This multi-level perception strategy significantly improves the discriminative representation of highly similar road surfaces, which remains a persistent challenge in existing inertial-based methods.

2. A multi-head attention–enhanced CNN-LSTM architecture is developed to adaptively model the complex spatiotemporal dependencies inherent in inertial vibration signals. Unlike conventional CNN-LSTM networks with uniform feature weighting, the introduced attention mechanism enables the model to dynamically emphasize informative time segments and sensor channels across different frequency responses. This design effectively strengthens feature selection under mixed and perturbed vibration conditions, leading to improved robustness and classification consistency.

3. An automated hyperparameter optimization strategy based on the Kepler Optimization Algorithm (KOA) is incorporated to jointly optimize key architectural parameters of the CNN-LSTM and multi-head attention modules. By replacing empirical and fixed parameter selection with physics-inspired global optimization, this framework enhances model generalization and stability under limited training data. This coordinated optimization scheme provides a systematic solution to performance degradation commonly observed in inertial-based deep learning models when handling similar surface categories.

The remainder of this paper is organized as follows. Section 2 describes the sensor configuration scheme used for road surface dataset collection. Section 3 provides a detailed explanation of the proposed road surface classification method, including the network architecture of each component. Section 4 presents the experimental results on different datasets. Section 5 concludes the paper.

2. Sensor Configuration for data acquisition

2.1. Sensors configuration scheme

In this study, vehicle speed was maintained within a controlled operational range during data acquisition to minimize excessive excitation variability. The inertial sensors were installed on the chassis, suspension, and tires to capture vibration features across high, medium, and low frequency bands [28]. Compared to conventional methods that rely on a single IMU mounted on the chassis, the proposed sensing configuration establishes a multi-frequency and multi-level perception architecture. This approach significantly improves the detection of subtle vibration signals and enhances the accuracy of road condition perception.

Based on the model shown in Fig. 1. The vehicle body transmits the characteristic response of tires through suspension. The motion equation of the vehicle body is given by the formula (1).

Where M_b is the mass of vehicle, x_b is the vehicle body displacement,

is the velocity of vehicle,

is the accelerate of vehicle.

and

are the damping and spring stiffness of the suspension.

The vibration of the tire is induced by road surface irregularities and transmitted by the vehicle vibration. Its acceleration dynamics are given by the formula (2).

Where M_w is the mass of tire, x_w is the tire displacement,

is the velocity of tire,

is the accelerate of tire,

is the spring stiffness of the tire,

is the external stimulus caused by road surface unevenness. The model reveals the vibration propagation path from the road surface to the tire and subsequently to the vehicle body, providing a theoretical basis for the placement of inertial sensors.

Fig. 1

The vehicle and inertial measurement unit coordinate systems and the simplified model of vehicle suspension system and sensor installation locations.

It should be emphasized that the vehicle dynamic equations presented in Eqs. (1) and (2) are not directly embedded into the proposed deep learning architecture, nor are they used for explicit parameter estimation or physics-based supervision. Instead, these equations serve as a theoretical foundation for sensor placement and vibration signal interpretation in the proposed framework. The tire is in direct contact with the road surface and most accurately reflects road-induced dynamics. The inertial sensor mounted on the tire (Sensor 1) captures high-frequency, high-energy vibration signals caused by surface irregularities, retaining fine-grained features critical for classification. The suspension connects the tire to the vehicle body and transmits and attenuates vibrations. The sensor mounted on the suspension (Sensor 2) records medium-frequency data that, after being filtered by the spring and damper system, still captures coupling characteristics between mechanical structures and road variations. The vehicle body is sensitive to low-frequency vibrations.

The sensor placed on the chassis (Sensor 3) captures these low-frequency, large-scale variations, providing stable information on gradual road condition changes. This multi-band, multi-level sensor layout effectively expands the perceptual dimension and forms a comprehensive vibration information fusion system. It ensures the completeness and diversity of the acquired data, offering richer and more robust features to support the classification performance of the deep learning network.

The data acquisition platform is shown in Fig .2. The system consists of six IMU mounted on the left and right tires, suspension, and chassis of the vehicle. During data collection, a central processing unit synchronously records IMU signals, while a smartphone is used to log vehicle speed and trajectory information. Additionally, a camera and a dashcam capture real-time road images to support subsequent data annotation and validation.

Fig. 2

The dataset collection platform.

2.2. Coordinate system settings

According to Society of Automotive Engineers (SAE) standards, the vehicle coordinate system is defined as shown in Fig. 1. The X-axis is set as forward direction, the Y-axis is set as right side, and the Z-axis is set as vertical direction downward [29]. The coordinate system of the inertial sensor is shown in Fig. 1. To ensure consistency of collected data, we use Euler angle conversion to redirect inertial sensor data. When the vehicle is stationary, the acceleration values can be assumed as a_x=0m/s², a_y=0m/s², a_z=9.81m/s². In the reference coordinate system, rotation around Z is defined as yaw angle

, rotation around X is defined as roll angle

, rotation around Y is defined as pitch angle

. The transformation matrix for Euler angle rotations can be expressed as formula (3).

Where

is the rotation around X,

is the rotation around Y,

is the rotation around Z. The rotation matrixes are expressed as formula (4)-(6).

(4)
(5)
(6)

Given the Euler angles

and the acceleration

measured by inertial sensors. The global acceleration

can be obtained by formula (7).

3. Road surface type classification method

To overcome the limitations of traditional deep learning models in feature extraction and parameter optimization, this study focuses on deep learning network construction, feature extraction, and parameters optimization. As shown in Fig. 3, this paper introduces a deep learning framework named KCLMAnet for road surface type classification. The first component is data acquisition module, which takes multi-band, multi-level data from inertial sensors installed at different positions on the vehicle. And using a sliding window to segment the data to generate data samples.

The second component is feature extraction module, which constructs a CNN-LSTM to capture local features and long-term dependencies features from the time-series data. Incorporating a multi-head attention mechanism, the features are divided into multiple subspaces for parallel processing. This allows the model to better focus on the multidimensional characteristics of the input data, enhancing feature extraction and ultimately improving the model’s generalization ability. The output of network is the road surface type. The third component is the parameter optimization module. Leveraging small-sample data, a Kepler optimization algorithm is used to identify the optimal parameter combination for the CNN-LSTM. This process involves fine-tuning the network structure, optimizing the parameters of the multi-head attention mechanism, and reducing overfitting risks to improve the model’s classification performance.

Fig. 3

The overall process of the proposed method in this paper and the structure of each module corresponding to the network.

3.1. CNN-LSTM model

First, a sliding window approach is applied to segment the time-series data into fixed-length samples of size 𝑇. If each sensor provides 𝐷-dimensional features, the resulting input forms a tensor

, where 𝑁 denotes the number of sensors. These segmented samples are then fed into the CNN-LSTM network for feature extraction and temporal modeling [30]. In road surface type classification, the CNN extracts local features from time-series data, effectively capturing short-term vibration variations and spatial information within temporal patterns. The LSTM network models the temporal dependencies, capturing long-term relationships within the data. This makes it particularly well-suited for classification tasks involving sequential data.

The structure is shown in Fig. 4. The CNN component consists of convolutional layers, activation functions, batch normalization layers, spatial dropout and residual connections, and global average pooling layers. When processing time-series data, the core operation of convolutional layers are to extract local features from the data using filters. A filter is a small weight matrix, and the number of filters determines the variety of features that can be learned. This convolution operation helps reduce computational overhead, ensuring the network maintains high efficiency when processing large-scale data. The size of each filter is determined by the kernel size. The convolution operation for a single filter is expressed by formula (8).

Where

is the output value generated by filter,

is element in the input sequence X,

is the weight of the convolution kernel, and b_i is bias.

Fig. 4

The specific structure of the CNN-LSTM network constructed in this paper.

For each filter, with a kernel size of k, stride s, and input data length T, the length of the output feature map after convolution is

. Assuming there are f filters, the output features form a matrix with f columns, each column represents the convolution output from one filter, so the matrix size is

. The feature representation output by the network can be expressed as formula (9).

After convolution, the LeakyReLU is applied to extract nonlinear features from the data. In road surface classification tasks, fine-grained features are crucial for determining the type of surface. Therefore, instead of directly applying pooling, this study opts for batch normalization after convolution. This helps reduce the loss of critical feature information. maintain efficient learning from the data, and improve the network stability. Advanced network architectures like ResNet [31] and DenseNet [32] have demonstrated that employing Batch Normalization can enhance network performance. The first step of batch normalization is calculating the batch mean

and variance

, as shown in formula (10). The second step is normalizing the input features, as shown in formula (11). The final step is scaling and shifting to enhance the flexibility of the normalized data, as shown in formula (12).

Where x_i,j is the value of the i-th sample on the j-th feature dimension.

Where

is the normalization value and

is a constant.

Where y_i,j is the output, γ and β are the learnable parameters.

To address the vanishing gradient problem, special dropout and residual connections were incorporated into the CNN network. These ensure that the network can effectively extract features and make accurate predictions even when handling large-scale data. Finally, global average pooling is employed for dimensionality reduction and global feature extraction. This approach not only reduces the model’s parameter count but also effectively extracts features from each channel, capturing the overall trend of the entire sequence data and improving the classification accuracy.

The output feature sequence

from the CNN is fed into LSTM network, where y_t is the local feature at time step t. The gating mechanism of the LSTM network, consisting of input gates, forget gates, and output gates, effectively captures the long-term dependencies present in features. It retains important long-term information while discarding less significant data. A schematic representation of the LSTM cell gating mechanism is shown in Fig. 5.

Fig. 5

Gating mechanism and information transmission process of LSTM cell units.

The forget gate determines how much information from the previous memory cell C_t-1 is retained, as given by formula (13).

Where f_t is forget gate,

is activation function, W_f is weight of the forget gate, h_t-1 is hidden state at the previous time step, y_t is the input information at the current time step and b_f is the bias of the forget gate.

The function of the input gate is given in (14).

(14)

Where i_t is the output of the input gate, which determines how much information in the current input need to be written into the cell state c_t. The

is the current cell state, tanh is the activation function, W_i and W_c are the weight matrix, b_i and b_c are the bias of input gate and cell state.

The updating the cell state is given in formula (15).

The formula of output gate is given in (16).

Where h_t is the hidden state, it is the final output of LSTM, and it can be expressed by formula (17).

A two-layer LSTM network was constructed for time-series data processing. The ELU activation function was chosen, and each LSTM layer’s dimension (d_model) should match the model dimension of the multi-head attention mechanism, ensuring that no dimension mismatch occurs. Subsequently, batch normalization and dropout layers are applied to reduce the risk of overfitting. This network can effectively capture more complex temporal features, thereby enhancing the classification accuracy.

3.2. Multi-head attention mechanism

The attention mechanism can dynamically assign weights to different parts of the input, allowing the model to adjust its focus based on the importance of various data features[33]. The attention mechanism is determined by three key parameters Q (Query), K (Key) and V (Value). Where Q represents the part to focus on, K corresponds to the candidates to match against the focus, V generates the output based on the matching results. The calculation process is given by the formulas (18)-(21).

(18)
(19)
(20)
(21)

Formula (18) computes the similarity between the candidates and the focused parts, while formula (19) scales the similarity scores. The scaling factor

is used to prevent vanishing or exploding gradients, where

is the dimensionality of each candidate. Formula (20) represents the corresponding scores of similarities, indicating the importance of each candidate to the current focus. Formula (21) calculates the final output as a weighted sum of the candidates, using their respective attention weights.

The multi-head attention mechanism is an extension of the single attention, and its structure is illustrated in Fig. 6. This structure divides the input features into different subspaces based on the number of attention heads, then applies the attention mechanism in parallel across these subspaces. This can enhance the model’s feature extraction capability [34]. Each head has its own independent linear transformation, allowing each subspace to adjust its focus according to varying vibration frequencies and patterns. This enables the model to learn features of the input data from different perspectives, enhancing the recognition of similar vibration characteristics and improving its ability to capture subtle features.

Fig. 6

The specific multi-head attention mechanism structure introduced in this paper.

Assuming there are h attention heads, and each head having independent dimensions for Q, K, and V, the computation process for multi-head attention is given by formulas (22)-(25).

(22)
(23)
(24)
(25)

Formula (22) represents the linear transformation process for each attention head, Where W_i is the weight matrix. Formula (22) denotes the output of each head, formula (23) represents the concatenation of the data from all attention heads. To ensure consistency during data concatenation, the multi-head attention mechanism must have a dimensionality (d_model) that is divisible by the number of heads h (num_heads), thereby avoiding mismatches that may cause network errors. Finally, formula (24) is applied to perform a linear transformation on the concatenated vector, yielding the final output.

3.3. Kepler optimization algorithm

To dynamically adjust the parameters of the deep learning network, this study employs the Kepler optimization algorithm to optimize the network’s primary parameters and the attention mechanism’s parameters. The Kepler optimization algorithm [35] is a metaheuristic method inspired by the motion of planets. It simulates gravitational interactions between celestial bodies to solve global optimization problems. In this algorithm, the planets in the solar system represent the search space, and each planet’s position corresponds to a potential solution during the optimization process. The objective function is analogous to the planet’s mass or gravitational force, used to evaluate the quality of the solution.

The orbital velocity of a planet represents the trend and adjustment process of the solution. It determines the direction and magnitude of adjustments in each iteration. The orbital velocity is determined by the current velocity and the gravitational interactions with other individuals. These gravitational forces simulate collaborative search behavior among the solutions. Planets with greater mass exert stronger gravitational forces, reflecting better objective function values. Hence, the global optimum is typically represented by the Sun. In the KOA, gravity not only influences the update of an individual’s position, but it can also cause the individual to shift toward potential global optima. The implementation process of the KOA is illustrated in Fig. 7.

In the implementation of KOA, initialization is a crucial step. This phase involves generating each individual’s starting position, velocity, and mass. The initialization formulas are as follows:

Where x_i (0) represents the solution position of individual i in the initialization. x_min and x_max represent the minimum and maximum values of the solution space respectively. Random (x_min, x_max) represents generating a random number in the interval [x_min, x_max].

Fig. 7

The implementation process of Kepler optimization algorithm.

The initial velocity of each individual is typically set to a small random value. This ensures that individuals start searching at a controlled speed without making abrupt changes in the solution space. After initializing positions and velocities, the mass of each individual is calculated based on the objective function values (fitness). This often requires a small set of training samples to evaluate the fitness of each individual.

The core formulas of the KOA include the gravitational force calculating (27), velocity update (28), and position update (29). By iteratively updating positions and velocities, the algorithm gradually approaches the optimal solution.

Where F_ij is the gravitational force between i and j. G is the constant, m_i and m_j are the mass, r_ij is the distance between different individuals.

Where v_i(t) and v_i(t + 1) are the speed of individual i at t and t + 1,

is the gravitational force of all other individuals on individual i. The Δt is the time step.

Where x_i(t + 1) is the position of individual i at t + 1.

In this paper, the Kepler optimization algorithm is used to optimize the parameters num_heads, d_model, dropout_rate, filters, and kernel_size in the road surface classification network. A detailed description of the algorithmic procedure is presented in Table 1 as pseudocode.

Table 1

The pseudocode of the proposed method
Algorithm 1 KCLMAnet
Input: Inertial signal Output: Predicted road surface label ŷ 1 Data Acquisition and Preprocessing: 2 Collect the IMU data and apply Euler transformation: 3 Segment input X using sliding window 4 Define parameter ranges: filters [64, 128], kernel_size [2, 8] d_model [48, 128], num_heads [0, 5], dropout [0.1, 0.3]. 5 Ensure d_model % num_heads = = 0 for attention compatibility 6 Randomly initialize M candidate planets of KOA 7 Fitness evaluation 8 For each planet: 9 Apply CNN to extract local spatial features: CNN output 10 Pass to LSTM to with hidden size = d_model: LSTM output 11 Compute multi-head attention features, the output is divided into h subspaces: 12 , , 13 17 Output: 15 Pass to second LSTM and Dense layer: 16 Output: , is the feature vector used for classification in the attention model output matrix 17 Initialize KOA hyperparameters: 18 For i = 1 to MaxIter: 19 For each candidate : 20 Train model 21 Compute validation loss 22 Select best 23 Update using gravitational rules: 24 , , 25 Return final model hyperparaments and prediction ŷ

4. Experiments and results

4.1. Evaluation metrics

In the classification task, commonly used performance metrics include accuracy, precision, recall, and F1-score [30]. Accuracy is the most straightforward evaluation metric, representing the proportion of correctly classified samples out of the total number of samples. It is given by formula (30).

Where, TP (true positive) represents the number of samples correctly predicted as the positive class. TN (true negative) refers to the number of samples correctly predicted as the negative class. In multi-class classification, negative samples generally refer to misclassified categories. FP (false positive) is the number of negative samples incorrectly predicted as positive, and FN (false negative) is the number of positive samples incorrectly predicted as negative.

Precision is used to evaluate the accuracy of the model’s predictions for a specific type of road surface. It is given by formula (31).

Recall reflects the sensitivity to each type of road surface, and it is used to evaluate the proportion of correctly identified samples. It is given by formula (32).

While a high recall means the model correctly identifies more samples, it can lead to higher misclassification rates. Hence, the F1-score is often considered. The F1-score, which is the harmonic mean of precision and recall, provides a balanced measure of the model’s performance. It is given by formula (33).

In multi-class problems, the AUC-ROC (Area Under Curve - Receiver Operating Characteristic) [36] is a metric used to measure the model’s ability to distinguish between different classes. An ROC curve evaluates the performance of a classification model by plotting the true positive rate (TPR) against the false positive rate (FPR). The TPR calculating formula is given by (34).

The FPR calculating formula is given by (35).

Ideally, the ROC curve should be as close as possible to the top-left corner, indicating that the model achieves a high TPR while maintaining a low FPR. The area under the ROC curve (AUC) serves as a key performance metric. When AUC = 1, it means the model perfectly distinguishes all classes, achieving entirely correct classification. When AUC ≤ 0.5, it indicates poor classification performance.

In addition to the metrics mentioned above, a confusion matrix is a valuable tool for evaluating classification performance. By presenting the relationship between true labels and predicted labels, it provides insights into how well the model performs across different categories. This road surface classification task involves five categories, denoted as A, B, C, D, and E. The confusion matrix for this scenario is represented as shown in formula (36).

Two datasets were used in the experiments. The first is the publicly available Passive Vehicular Sensors dataset, which was employed to evaluate the model’s applicability under standard conditions. The second is a self-collected dataset comprising five common road surface types: dirt road, asphalt, bluestone, cobblestone, and concrete. This dataset was primarily used to assess the robustness and scalability of the proposed algorithm.

4.2. Comparison experiments on public dataset

To evaluate the classification performance of the proposed algorithm, experiments were conducted on the publicly available Passive Vehicular Sensors Dataset [37]. This dataset includes three types of road surfaces: asphalt, cobblestone, and dirt. These data were collected by different drivers operating various vehicles in the municipality of Anita Garibaldi, in upstate Santa Catarina, Brazil. The dataset is divided into training set and test set, with a split ratio of 65% for training and 35% for testing. In the experiments, we evaluated the classification performance of K-means clustering (KMC), K-nearest neighbors (KNN), Support Vector Machines (SVM), LSTM, CNN, CNN-LSTM[37],[38]. Table 2 shows the test results of all methods on the Passive Vehicular Sensors Dataset.

The results show that KCLMAnet outperforms the other models in accuracy, recall, and F1-score. Traditional models like KNN and SVM achieved only 75.42% and 75.59% accuracy, respectively. While deep learning models such as CNN and LSTM improved the performance, with accuracies and F1-scores of 92.27% and 91.89%, and 92.72% and 91.49%, respectively, they still fell short of the performance achieved by KCLMAnet. This improvement can be attributed to the KOA, which effectively tuned the model’s parameters and enhanced overall network performance. The CNN and LSTM components are responsible for extracting road surface features and modeling the dynamic characteristics of time-series data, respectively. Meanwhile, the multi-head attention mechanism helps the model focus on critical features, improving both classification accuracy and robustness.

Table 2

Test results on the public dataset
Model	Accuracy	Precision	Recall	F1-score
KMC	60.48%	58.71%	56.85%	54.58%
KNN	75.42%	71.70%	71.71%	71.67%
SVM	75.59%	70.14%	69.80%	68.35%
CNN	93.20%	92.27%	91.62%	91.89%
LSTM	92.72%	91.56%	91.43%	91.49%
CNN-LSTM	92.78%	92.11%	91.07%	91.46%
KCLMAnet	95.09%	94.52%	93.92%	94.17%

4.3. Comparison experiments on self-collected dataset

The dataset constructed in this study includes five types of road surfaces: dirt, asphalt, cobblestone, bluestone, and concrete. Data for each road surface type was collected under real-world driving conditions, ensuring its authenticity and representativeness. Some example data are shown in Fig .8.

This study divided the dataset into 85% for training and 15% for testing. The training process of the network is illustrated in Fig .9. Some parameters were optimized using the KOA. Ultimately, the number of filters was set to 134, the kernel_size to 6×6, the dropout_rate to 0.2396, the num_heads to 2, and the d_model to 48. The maximum number of training epochs was set to 1000. To prevent overfitting and improve training efficiency, an early stopping strategy was applied with a patience of 50 epochs. Training was terminated early if no improvement in validation accuracy was observed over 50 consecutive epochs.

To comprehensively evaluate the classification performance of different models across various road surface types, this study conducts analysis from four perspectives: overall metrics, class-wise distribution, prediction confusion, and feature separability.

Fig. 8

Example datasets of different road surface types collected under real driving conditions.

As shown in the statistical results of Table 3 and Fig. 10, although traditional machine learning methods such as KNN achieve relatively high classification accuracy on certain road types, their performance drops significantly on dirt and cobblestone surfaces. This indicates limited adaptability to complex or unstructured terrains. Moreover, the F1-score of KNN fluctuates considerably across different surface classes, suggesting poor overall stability.

CNN exhibits certain advantages in spatial feature extraction. As shown in Fig. 10 and Fig. 11(a), it achieves relatively high recall on asphalt and bluestone surfaces. However, its performance on dirt and cobblestone roads remains suboptimal. The overall F1-score is slightly lower, indicating limited discriminative capability when dealing with highly similar surface types.

LSTM networks, known for their robustness in handling time-series data, show more balanced accuracy and recall across different road surfaces, as illustrated in Fig. 10 and Fig. 11(b). Nevertheless, the classification boundaries remain unclear for surfaces with significant feature overlap, reflecting challenges in distinguishing such classes.

Fig. 9

The figures show the loss and accuracy of the network during training and testing respectively.

Table 3

Test results in self-collected dataset
Model	Accuracy	Precision	Recall	F1-score	Parameters(M)	Similar roads classification
KNN[28]	84.79%	82.37%	84.70%	81.91%	-	Poor
CNN[39]	90.33%	87.85%	88.68%	89.98%	1.1	Fair
LSTM[27]	91.06%	89.87%	90.78%	90.27%	1.7	Fair
CNN-LSTM[37]	91.38%	88.02%	90.65%	88.79%.	2.5	Fair
Attention based CNN-LSTM[40]	91.82%	90.60%	91.03%	90.20%	2.8	Good
DeepSense [24]	87.23%	87.59%	87.23%	87.24%	1.2	Fair
Transformer-HAR[26]	93.01%	94.41%	93.01%	92.96%	4.1	Good
IChOA-CNN-LSTM[30]	93.62%	93.98%	93.62%	92.75%	3.5	Good
KCLMAnet	97.53%	97.59%	97.53%	97.52%	2.1	Excellent

Fig. 10

Classification performance of all models on different road surfaces. (a) Precision heatmap of all models on different road surfaces. (b) Recall heatmap of all models on different road surfaces. (c) F1-score heatmap of all models on different road surfaces.

To evaluate the effectiveness of the multi-head attention mechanism, a comparative analysis is conducted between the CNN-LSTM model and its attention-enhanced variants. As shown in Table 3 and Fig. 10, the attention-based CNN-LSTM consistently outperforms the standard CNN-LSTM across all evaluation metrics.

The improvement is particularly evident in scenarios involving highly similar road surface types. The attention mechanism enables the network to dynamically emphasize informative time segments and sensor channels while suppressing redundant or noisy vibration patterns. This effect is further illustrated in the confusion matrices (Fig. 11(c) and Fig. 11(d)), where the attention-based model exhibits clearer class boundaries and reduced inter-class confusion compared to its non-attention counterpart.

The proposed KCLMAnet achieves the best classification performance across all five road surface types. By comparing the performance of attention-based CNN-LSTM models with and without advanced optimization strategies (e.g., IChOA-CNN-LSTM and the proposed KCLMAnet), the effectiveness of the Kepler Optimization Algorithm (KOA) can be implicitly assessed. As shown in Table 3, KCLMAnet achieves consistent gains in accuracy, precision, recall, and F1-score, while maintaining a moderate model size. Furthermore, feature space visualizations in Fig. 12 demonstrate that KOA-enhanced models yield more compact intra-class clustering and clearer inter-class separation. This indicates that automated global hyperparameter optimization facilitates the learning of more discriminative and robust representations, particularly for road surfaces with overlapping vibration characteristics.

Fig. 11

The confusion matrix of the tested road surface type classification methods. (a) Confusion Matrix of CNN. (b) Confusion Matrix of LSTM Network. (c) Confusion Matrix of CNN-LSTM. (d) Confusion Matrix of attention based CNN-LSTM. (e) Confusion Matrix of DeepSense. (f) Confusion Matrix of Transformer-HAR. (g) Confusion Matrix of IChOA-CNN-LSTM. (h) Confusion Matrix of the proposed method KCLMAnet.

The scatter plots in Fig. 12 further validate the above conclusions from the feature space perspective. In models such as CNN, LSTM, and DeepSense (Fig. 12(a), (b), and (e)), significant overlap among different classes is observed, with blurred inter-class boundaries and a lack of effective clustering structures. Although CNN-LSTM and Attention-CNN-LSTM (Fig. 12(c), (d)) show some degree of clustering, the large intra-class dispersion and presence of multiple small clusters suggest instability in feature extraction. IChOA-CNN-LSTM and Transformer-HAR exhibit clearer local clusters (Fig. 12(g), (f)), but the class boundaries remain indistinct, showing transitional zones between classes. In contrast, the proposed method demonstrates well-defined cluster boundaries with minimal inter-class overlap, indicating that the extracted features possess strong separability and discriminative power, further confirming the classification performance.

In summary, KCLMAnet outperforms other models not only in overall performance metrics but also demonstrates superior stability and robustness in distinguishing individual road surface types. This method achieves clearer differentiation among the five road categories, highlighting its strong potential for real-world deployment and practical applications.

Fig. 12

The data clustering effect diagram of the tested road classification network. (a) Clustering effect of CNN network in test data; (b) Clustering effect of LSTM network in test data; (c) Clustering effect of CNN-LSTM network in test data; (d) Clustering effect of attention based CNN-LSTM network in test data; (e) Clustering effect of DeepSense in test data; (f) Clustering effect of Transformer-HAR in test data; (g) Clustering effect of IChOA-CNN-LSTM in test data; (h) Clustering effect of our method KCLMA net in test data

To provide a more intuitive representation of the classification performance, Fig. 13 shows the ROC curves for all road surface classification algorithms. It can be observed that the ROC curve for KCLMAnet is closer to the top-left corner, with an AUC value of 0.9993, indicating strong performance in the road surface classification task. In contrast, the ROC curve for KNN lies closer to the diagonal, suggesting inferior classification effectiveness compared to the other algorithms.

Fig. 13

The ROC curves of different road surface type classification networks. The closer the area under the curve is to 1, the stronger the classification ability of the network

5. Conclusion

This study presented KCLMAnet, a deep learning framework for road surface type classification in complex and unstructured driving environments. Rather than relying on a single architectural enhancement, the superior performance of KCLMAnet arises from the synergistic integration of physically motivated inertial sensing design, adaptive spatiotemporal feature modeling, and automated hyperparameter optimization. This integrated paradigm enables the framework to effectively exploit multi-band vibration information while maintaining robustness under real-world operational variability.

Experimental results on both public and self-collected datasets demonstrate that KCLMAnet consistently outperforms existing methods in terms of accuracy, precision, recall, and F1-score. In particular, the proposed approach exhibits strong discriminative capability for highly similar road surface types, such as asphalt and concrete, which remain challenging for conventional inertial-based and deep learning models. These results confirm that jointly leveraging spatially distributed inertial perception and attention-guided temporal modeling provides a practical and effective solution for fine-grained road surface recognition.

Overall, the proposed framework offers a robust and scalable perception module that can support downstream tasks in intelligent vehicle systems, including adaptive path planning and suspension or motion control in unstructured environments. Future work will focus on extending the framework toward joint recognition of road surface types and surface-related anomalies, such as speed bumps, potholes, and cracks, enabling more comprehensive environment perception and further enhancing driving safety and decision-making reliability.

Acknowledgements

Thank you for the support of the National Key Research and Development Program of China under Grant 2022YFB3206605.

Conflict of interest

On behalf of all the authors, the corresponding author states that there is no conflict of interest.

Data availability statements

The collected dataset and source code will be made publicly available upon acceptance to facilitate reproducibility and further research.

Funding

Declaration

This work was funded by the National Key Research and Development Program of China under Grant 2022YFB3206605.

References

Zhang R, Shi L, Cui D et al (2025) GTRS-Net: a lightweight visual segmentation model for road crack detection. SIViP 19:615. https://doi.org/10.1007/s11760-025-04184-7

Kim JK, Kwon DW, Jung SL et al (2025) Jung, Deep Learning-based Heading Angle Estimation for Enhanced Autonomous Vehicle Backward Driving Control. Int J Control Autom 23(4):1210–1219. https://doi.org/10.1007/s12555-024-0228-2

Jiang S, Strout Z, He B et al (2023) Dual Stream Meta Learning for Road Surface Classification and Riding Event Detection on Shared Bikes. IEEE Trans Syst Man Cybern Syst 53(11):7188–7200. 10.1109/TSMC.2023.3295424

Dong L, Zhu E, Zhu L et al (2024) Road feature enhancement network for remote sensing images based on DeepLabV3Plus. SIViP 18:6019–6028. https://doi.org/10.1007/s11760-024-03289-9

Raslan E, Alrahmawy MF, Mohammed YA et al (2024) Evaluation of data representation techniques for vibration based road surface condition classification. Sci Rep 11:61757. 10.1038/s41598-024-61757-1

Guo H, Yin Z, Cao D et al (2019) A Review of Estimation for Vehicle Tire-Road Interactions Toward Automated Driving. IEEE Trans Syst Man Cybern Syst 49(1):14–30. 10.1109/TSMC.2018.2819500

Sabery SM, Bystrov A, Gardner P et al (2023) Road Surface Classification Based on Radar Imaging Using Convolutional Neural Network. IEEE Sens J 34(6):065201. 10.1109/JSEN.2021.3087336

Liu B, Zhao D, Zhang H (Jun. 2023) Road classification using 3D LiDAR sensor on vehicle. Meas Sci Technol 34(6):065201. 10.1088/1361-6501/acc1fd

Huang T, Song S, Hu H et al (2024) A Hierarchical LiDAR Simulation Framework Incorporating Physical Attenuation Response in Autonomous Driving Scenarios. IEEE Trans Intell Transp Syst 25(6):6200–6211. 10.1109/TITS.2023.3340676

10.

Pandey AK, Iqbal R, Maniak T (2022) et la. Convolution Neural Networks for Pothole Detection of Critical Road Infrastructure, Comput. Electr. Eng. 99, 107725 10.1016/j.compeleceng.2022.107725

11.

Zou J, Guo W, Wang FA (2023) Study on Pavement Classification and Recognition Based on VGGNet-16 Transfer Learning. Electronics 12(15):3370. 10.3390/electronics12153370

12.

Bijelic M, Mannan F, Gruber T et al (2023) Seeing Through Fog Without Seeing Fog: Deep Sensor Fusion in the Absence of Labeled Training Data. IEEE Trans Intell Transp 24(2):1863–1877. 10.1109/TITS.2022.3207149

13.

Hnoohom N, Mekruksavanich S, Jitpattanakul AA (2023) Comprehensive Evaluation of State-of-the-Art Deep Learning Models for Road Surface Type Classification. Intell Autom Soft Comput 37(2):1275–1291. 10.32604/iasc.2023.038584

14.

Zeng H, Mei Z, Lin C et al (2025) Multi-modal sensing insoles for identity authentication across diverse walking activities. Measurement 242:116198. 10.1016/j.measurement.2024.116198

15.

Hadj-Attou A, Kabir Y, Ykhlef F (2023) Hybrid deep learning models for road surface condition monitoring. Measurement 220:113267. 10.1016/j.measurement.2023.113267

16.

Aleadelat W, Ksaibati K, Wright CHG, Wright, Saha P et al (2018) Evaluation of Pavement Roughness Using an Android-Based Smartphone. J Transp Eng Part B Pavements 144(3):04018033. 10.1061/JPEODX.0000058

17.

Guo W, Zhang J, Murtaza M, Wang, Cao D et al (2023) An ensemble learning with sequential model-based optimization approach for pavement roughness estimation using smartphone sensor data. Constr Build Mater 406:133293. 10.1016/j.conbuildmat.2023.133293

18.

Tiwari S, Bhandari R, Raman B, RoadCare: (2020) A Deep-learning Based Approach to Quantifying Road Surface Quality, in Proceedings of the 3rd ACM SIGCAS Conference on Computing and Sustainable Societies, 231–242 10.1145/3378393.3402284

19.

Lekshmipathy J, Samuel NM, Velayudhan S (2020) Vibration vs. vision: best approach for automated pavement distress detection. Int J Pavement Res Technol 13(4):402–410. 10.1007/s42947-020-0302-y

20.

Carlos MR, Gonzalez LC, Wahlstrom J et al (2021) IEEE Trans Mob Comput 20(2):366–376. 10.1109/TMC.2019.2947443. Becoming Smarter at Characterizing Potholes and Speed Bumps from Smartphone Data — Introducing a Second-Generation Inference Problem

21.

Gao Z, Wen W, Xing Y et al (2024) An Integrated Framework for Autonomous Driving Planning and Tracking based on NNMPC Considering Road Surface Variations. IEEE Trans Intell Veh 1–15. 10.1109/TIV.2024.3418951

22.

Bajic M, Pour SM, Skar A et al (2021) Road Roughness Estimation Using Machine Learning, arXiv preprint arXiv:2107.01199 (. 10.48550/arXiv.2107.01199

23.

Kim H-J, Han J, Lee S et al (2020) A Road Condition Classification Algorithm for a Tire Acceleration Sensor using an Artificial Neural Network. Electronics 9(3):404. 10.3390/electronics9030404

24.

Wu S, Hadachi A (2020) Road Surface Recognition Based on DeepSense Neural Network using Accelerometer Data, in 2020 IEEE Intelligent Vehicles Symposium (IV). 305–312 10.1109/IV47402.2020.9304737

25.

Varona B, Monteserin A, Teyseyre A (2020) A deep learning approach to automatic road surface monitoring and pothole detection. Pers Ubiquitous Comput 24(4):519–534. 10.1007/s00779-019-01234-z

26.

Yin X, Liu Z, Liu D et al (2022) A Novel CNN-based Bi-LSTM parallel model with attention mechanism for human activity recognition with noisy data. Sci Rep 12:7878. 10.1038/s41598-022-11880-8

27.

Hadj-Attou A, Kabir Y, Ykhlef F (2023) Hybrid deep learning models for road surface condition monitoring. Measurement 220:113267. 10.1016/j.measurement.2023.113267

28.

Du R, Qiu G, Gao K et al (2020) Abnormal Road Surface Recognition Based on Smartphone Acceleration Sensor. Sensors 20(2):451. 10.3390/s20020451

29.

Ahmed H, Tahir M (2017) Accurate Attitude Estimation of a Moving Land Vehicle Using Low-Cost MEMS IMU Sensors. IEEE Trans Intell Transp Syst 18(7):1723–1739. 10.1109/TITS.2016.2627536

30.

Geethanjali R, Valarmathi A (2024) A novel hybrid deep learning IChOA-CNN-LSTM model for modality-enriched and multilingual emotion recognition in social media. Sci Rep 14:22270. 10.1038/s41598-024-73452-2

31.

Koonce B (2021) Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization. A, Berkeley, CA. 10.1007/978-1-4842-6168-2

32.

Zeng H, Mei Z, Lin C et al (2025) Multi-modal sensing insoles for identity authentication across diverse walking activities. Measurement 242:116198. 10.1016/j.measurement.2024.116198

33.

Guo M-H, Xu T-X, Liu J-J et al (2022) Attention mechanisms in computer vision: A survey. Comput Vis Media 8(3):331–368. 10.1007/s41095-022-0271-y

34.

Cordonnier J-B, Loukas A, Jaggi M Multi-Head Attention: Collaborate Instead of Concatenate. arXiv: arXiv:2006.16362(2021). 10.48550/arXiv.2006.16362

35.

Abdel-Basset M, Mohamed R, Azeem SAA et al (2023) Kepler optimization algorithm: A new metaheuristic algorithm inspired by Kepler’s laws of planetary motion. Knowl -Based Syst 268:110454. 10.1016/j.knosys.2023.110454

36.

Liu Y, Li Y, Xie D (2024) Implications of imbalanced datasets for empirical ROC-AUC estimation in binary classification tasks. J Stat Comput Simul 94(1):183–203. 10.1080/00949655.2023.2238235

37.

Menegazzo J, von Wangenheim A (2021) Road surface type classification based on inertial sensors and machine learning: A comparison between classical and deep machine learning approaches for multi-contextual real-world scenarios. Computing 103:2143–2170. 10.1007/s00607-021-00914-0

38.

Basavaraju A, Du J, Zhou F et al (2020) A Machine Learning Approach to Road Surface Anomaly Assessment Using Smartphone Sensors. IEEE Sens J 20(5):2635–2647. 10.1109/JSEN.2019.2952857

39.

Baldini G, Giuliani R, Geib F (2020) On the Application of Time Frequency Convolutional Neural Networks to Road Anomalies’ Identification with Accelerometers and Gyroscopes. Sensors 20(22):6425. 10.3390/s20226425

40.

Zhao J, Yang W, Zhu F, A CNN-LSTM-Attention (2024) Model for Near-Crash Event Identification on Mountainous Roads. Appl Sci 14(11):4934. 10.3390/app14114934

Yes