1. Introduction
With the advancement of intelligent transportation systems, the operating environments of autonomous vehicles are gradually extending from structured roads to unstructured and diverse complex road scenarios. Variations in road surface types directly affect tire–road interaction, vehicle dynamic response, and driving stability. These factors serve as critical inputs for adaptive path planning and suspension control strategies [1],[2]. Therefore, accurate identification of road surface types enables real-time adjustment of control parameters such as vehicle speed, steering angle, and suspension stiffness during operation, thereby enhancing both driving safety and ride comfort [3].
Recent advances in road surface recognition have increasingly leveraged vision-based and multimodal perception frameworks, integrating RGB cameras, LiDAR, or radar sensors to enhance environmental awareness [4],[5]. These approaches have demonstrated promising performance under favorable sensing conditions and have benefited from rapid progress in deep learning architectures and large-scale visual datasets [6],[7],[8].
However, despite these advances, vision-dominated and multimodal methods still exhibit inherent limitations in real-world deployment. LiDAR-based approaches may suffer from signal attenuation and scattering under adverse environmental conditions, leading to unstable or incomplete surface perception [9]. Similarly, visual sensors are highly sensitive to illumination changes, shadows, weather conditions such as rain, fog, and snow, and occlusions caused by surrounding traffic or road contamination [10],[11],[12]. These limitations significantly constrain the robustness and continuity of road surface recognition in unstructured or degraded driving environments.
In contrast, inertial measurement units (IMUs) provide a fundamentally different sensing modality by capturing high-frequency physical interaction signals between the vehicle and the road surface. The vibration responses recorded by accelerometers and gyroscopes directly reflect tire–road contact dynamics and structural transmission characteristics, and are largely immune to environmental disturbances such as lighting variation or visual occlusion. As a result, inertial sensing offers continuous, reliable, and cost-effective perception capability, making it particularly suitable for road surface classification in complex and dynamically changing environments [13],[14],[15].
Inertial sensor-based road surface classification methods primarily focus on two aspects: road quality classification and road surface type classification. Road quality classification focuses on assessing the condition of the road surface, such as determining the International Roughness Index (IRI) and identifying issues like potholes, cracks, or damage. This is crucial for road maintenance and safe driving of vehicles. In road quality classification research, Aleadelat et al. [16] utilized signal processing and pattern recognition techniques to predict the IRI, categorizing road quality into three levels: good, average, and poor. Guo et al. [17] developed a CatBoost algorithm with sequence model optimization based on vibration acceleration data. This approach addressed the challenge of distinguishing road surface features caused by the highly nonlinear vehicle-coupled vibrations, improving the accuracy of road quality classification. Tiwari et al. [18] proposed a deep learning method called RoadCare, which successfully classifies road quality into three categories: good, average, and poor. Lekshmipathy et al. [19] trained an artificial neural network (ANN) model to identify road conditions such as cracks, potholes, and patches, thereby enabling road quality classification. Carlos et al. [20] established a dataset featuring potholes, cracks, patches, and speed bumps to enhance the accuracy of road quality classification.
Compared to road quality classification, road surface type classification focuses more on identifying the categories of surface materials and structures, such as asphalt, concrete, and dirt. Accurate road surface type information helps intelligent vehicles to promptly adjust planning and control strategies, improving driving smoothness and safety [21]. Therefore, this paper focuses on road surface type classification. In road surface type classification research, Bajic et al. [22] developed a machine learning method based on vertical acceleration data, achieving a classification accuracy of 0.67 and a recall rate of 0.76. To enhance the accuracy of road surface type classification, Kim et al. [23] proposed a data preprocessing method that extracts frequency band features from inertial sensor data. They then built an ANN, which successfully classified three types of road surfaces. Wu et al. [24] proposed a DeepSense neural network, which, combined with specific preprocessing and feature extraction methods, was able to distinguish three road types. However, its accuracy still needs to be improved. Varona et al. [25] developed a convolutional neural network (CNN) for classifying road surface types using acceleration data from inertial sensors. While CNN can effectively capture the static features of data, they overlook the long-term dependencies inherent in time-series data.
A
Recent architectures such as transformers and graph neural networks have been widely applied to time-series analysis tasks, but their high computational cost poses challenges in scenarios involving small datasets or embedded deployment. In contrast, CNN and LSTM networks offer lightweight structures with effective temporal modeling capabilities. These models demonstrate greater stability when handling vibration variations and long-term dependencies in inertial sensor data. For example, Yin et al. [
26] proposed a CNN-LSTM network to improve classification accuracy for samples with long-term temporal dependencies. In addition, Hadj-Attou et al. [
27] applied preprocessing techniques such as Fourier transform and wavelet transform to acceleration data from inertial sensors. They then used this data to train a CNN-LSTM model, leveraging the temporal modeling capabilities of LSTM to improve classification accuracy.
Although CNN-LSTM networks based on inertial sensors have demonstrated strong performance in road surface classification, existing methods still face several technical limitations. On the one hand, some studies use a single-point IMU mounted on the vehicle body, which fails to capture the diverse vibration characteristics exhibited by components such as the suspension and tires across different frequency bands. This results in incomplete surface perception. On the other hand, current approaches typically adopt fixed network structures without tailored processing for distinct frequency or temporal features, and lack automated hyperparameter tuning mechanisms. Consequently, their generalization ability and classification accuracy remain limited. This is particularly evident when distinguishing between highly similar surfaces such as asphalt and concrete, where the misclassification rate remains high, significantly hindering practical deployment in real-world autonomous driving applications. To address these challenges, this study proposes a road surface type classification framework, which leverages multi-point inertial sensing to enhance surface perception. The main contributions of this work are summarized as follows:
1.
1. A spatially distributed inertial sensing paradigm is proposed to enhance road surface perception beyond conventional single-point IMU configurations. By simultaneously collecting vibration responses from tires, suspension, and vehicle chassis, the proposed framework captures heterogeneous dynamic characteristics across multiple frequency bands. This multi-level perception strategy significantly improves the discriminative representation of highly similar road surfaces, which remains a persistent challenge in existing inertial-based methods.
2.
2. A multi-head attention–enhanced CNN-LSTM architecture is developed to adaptively model the complex spatiotemporal dependencies inherent in inertial vibration signals. Unlike conventional CNN-LSTM networks with uniform feature weighting, the introduced attention mechanism enables the model to dynamically emphasize informative time segments and sensor channels across different frequency responses. This design effectively strengthens feature selection under mixed and perturbed vibration conditions, leading to improved robustness and classification consistency.
3.
3. An automated hyperparameter optimization strategy based on the Kepler Optimization Algorithm (KOA) is incorporated to jointly optimize key architectural parameters of the CNN-LSTM and multi-head attention modules. By replacing empirical and fixed parameter selection with physics-inspired global optimization, this framework enhances model generalization and stability under limited training data. This coordinated optimization scheme provides a systematic solution to performance degradation commonly observed in inertial-based deep learning models when handling similar surface categories.
The remainder of this paper is organized as follows. Section 2 describes the sensor configuration scheme used for road surface dataset collection. Section 3 provides a detailed explanation of the proposed road surface classification method, including the network architecture of each component. Section 4 presents the experimental results on different datasets. Section 5 concludes the paper.
2. Sensor Configuration for data acquisition
2.1. Sensors configuration scheme
In this study, vehicle speed was maintained within a controlled operational range during data acquisition to minimize excessive excitation variability. The inertial sensors were installed on the chassis, suspension, and tires to capture vibration features across high, medium, and low frequency bands [28]. Compared to conventional methods that rely on a single IMU mounted on the chassis, the proposed sensing configuration establishes a multi-frequency and multi-level perception architecture. This approach significantly improves the detection of subtle vibration signals and enhances the accuracy of road condition perception.
Based on the model shown in Fig.
1. The vehicle body transmits the characteristic response of tires through suspension. The motion equation of the vehicle body is given by the formula (1).
Where
Mb is the mass of vehicle,
xb is the vehicle body displacement,
is the velocity of vehicle,
is the accelerate of vehicle.
and
are the damping and spring stiffness of the suspension.
The vibration of the tire is induced by road surface irregularities and transmitted by the vehicle vibration. Its acceleration dynamics are given by the formula (2).
Where
Mw is the mass of tire,
xw is the tire displacement,
is the velocity of tire,
is the accelerate of tire,
is the spring stiffness of the tire,
is the external stimulus caused by road surface unevenness. The model reveals the vibration propagation path from the road surface to the tire and subsequently to the vehicle body, providing a theoretical basis for the placement of inertial sensors.
It should be emphasized that the vehicle dynamic equations presented in Eqs. (1) and (2) are not directly embedded into the proposed deep learning architecture, nor are they used for explicit parameter estimation or physics-based supervision. Instead, these equations serve as a theoretical foundation for sensor placement and vibration signal interpretation in the proposed framework. The tire is in direct contact with the road surface and most accurately reflects road-induced dynamics. The inertial sensor mounted on the tire (Sensor 1) captures high-frequency, high-energy vibration signals caused by surface irregularities, retaining fine-grained features critical for classification. The suspension connects the tire to the vehicle body and transmits and attenuates vibrations. The sensor mounted on the suspension (Sensor 2) records medium-frequency data that, after being filtered by the spring and damper system, still captures coupling characteristics between mechanical structures and road variations. The vehicle body is sensitive to low-frequency vibrations.
The sensor placed on the chassis (Sensor 3) captures these low-frequency, large-scale variations, providing stable information on gradual road condition changes. This multi-band, multi-level sensor layout effectively expands the perceptual dimension and forms a comprehensive vibration information fusion system. It ensures the completeness and diversity of the acquired data, offering richer and more robust features to support the classification performance of the deep learning network.
The data acquisition platform is shown in Fig .2. The system consists of six IMU mounted on the left and right tires, suspension, and chassis of the vehicle. During data collection, a central processing unit synchronously records IMU signals, while a smartphone is used to log vehicle speed and trajectory information. Additionally, a camera and a dashcam capture real-time road images to support subsequent data annotation and validation.
2.2. Coordinate system settings
According to Society of Automotive Engineers (SAE) standards, the vehicle coordinate system is defined as shown in Fig.
1. The X-axis is set as forward direction, the Y-axis is set as right side, and the Z-axis is set as vertical direction downward [
29]. The coordinate system of the inertial sensor is shown in Fig.
1. To ensure consistency of collected data, we use Euler angle conversion to redirect inertial sensor data. When the vehicle is stationary, the acceleration values can be assumed as
ax=0m/s²,
ay=0m/s²,
az=9.81m/s². In the reference coordinate system, rotation around Z is defined as yaw angle
, rotation around X is defined as roll angle
, rotation around Y is defined as pitch angle
. The transformation matrix for Euler angle rotations can be expressed as formula (3).
Where
is the rotation around X,
is the rotation around Y,
is the rotation around Z. The rotation matrixes are expressed as formula (4)-(6).
Given the Euler angles
,
,
and the acceleration
measured by inertial sensors. The global acceleration
can be obtained by formula (7).
3. Road surface type classification method
To overcome the limitations of traditional deep learning models in feature extraction and parameter optimization, this study focuses on deep learning network construction, feature extraction, and parameters optimization. As shown in Fig. 3, this paper introduces a deep learning framework named KCLMAnet for road surface type classification. The first component is data acquisition module, which takes multi-band, multi-level data from inertial sensors installed at different positions on the vehicle. And using a sliding window to segment the data to generate data samples.
The second component is feature extraction module, which constructs a CNN-LSTM to capture local features and long-term dependencies features from the time-series data. Incorporating a multi-head attention mechanism, the features are divided into multiple subspaces for parallel processing. This allows the model to better focus on the multidimensional characteristics of the input data, enhancing feature extraction and ultimately improving the model’s generalization ability. The output of network is the road surface type. The third component is the parameter optimization module. Leveraging small-sample data, a Kepler optimization algorithm is used to identify the optimal parameter combination for the CNN-LSTM. This process involves fine-tuning the network structure, optimizing the parameters of the multi-head attention mechanism, and reducing overfitting risks to improve the model’s classification performance.
3.1. CNN-LSTM model
First, a sliding window approach is applied to segment the time-series data into fixed-length samples of size 𝑇. If each sensor provides 𝐷-dimensional features, the resulting input forms a tensor
, where 𝑁 denotes the number of sensors. These segmented samples are then fed into the CNN-LSTM network for feature extraction and temporal modeling [
30]. In road surface type classification, the CNN extracts local features from time-series data, effectively capturing short-term vibration variations and spatial information within temporal patterns. The LSTM network models the temporal dependencies, capturing long-term relationships within the data. This makes it particularly well-suited for classification tasks involving sequential data.
The structure is shown in Fig.
4. The CNN component consists of convolutional layers, activation functions, batch normalization layers, spatial dropout and residual connections, and global average pooling layers. When processing time-series data, the core operation of convolutional layers are to extract local features from the data using filters. A filter is a small weight matrix, and the number of filters determines the variety of features that can be learned. This convolution operation helps reduce computational overhead, ensuring the network maintains high efficiency when processing large-scale data. The size of each filter is determined by the kernel size. The convolution operation for a single filter is expressed by formula (8).
Where
is the output value generated by filter,
is element in the input sequence X,
is the weight of the convolution kernel, and
bi is bias.
For each filter, with a kernel size of
k, stride
s, and input data length
T, the length of the output feature map after convolution is
. Assuming there are
f filters, the output features form a matrix with
f columns, each column represents the convolution output from one filter, so the matrix size is
. The feature representation output by the network can be expressed as formula (9).
After convolution, the LeakyReLU is applied to extract nonlinear features from the data. In road surface classification tasks, fine-grained features are crucial for determining the type of surface. Therefore, instead of directly applying pooling, this study opts for batch normalization after convolution. This helps reduce the loss of critical feature information. maintain efficient learning from the data, and improve the network stability. Advanced network architectures like ResNet [
31] and DenseNet [
32] have demonstrated that employing Batch Normalization can enhance network performance. The first step of batch normalization is calculating the batch mean
and variance
, as shown in formula (10). The second step is normalizing the input features, as shown in formula (11). The final step is scaling and shifting to enhance the flexibility of the normalized data, as shown in formula (12).
Where
xi,j is the value of the
i-th sample on the
j-th feature dimension.
Where
is the normalization value and
is a constant.
Where yi,j is the output, γ and β are the learnable parameters.
To address the vanishing gradient problem, special dropout and residual connections were incorporated into the CNN network. These ensure that the network can effectively extract features and make accurate predictions even when handling large-scale data. Finally, global average pooling is employed for dimensionality reduction and global feature extraction. This approach not only reduces the model’s parameter count but also effectively extracts features from each channel, capturing the overall trend of the entire sequence data and improving the classification accuracy.
The output feature sequence
from the CNN is fed into LSTM network, where
yt is the local feature at time step
t. The gating mechanism of the LSTM network, consisting of input gates, forget gates, and output gates, effectively captures the long-term dependencies present in features. It retains important long-term information while discarding less significant data. A schematic representation of the LSTM cell gating mechanism is shown in Fig.
5.
The forget gate determines how much information from the previous memory cell
Ct-1 is retained, as given by formula (13).
Where
ft is forget gate,
is activation function,
Wf is weight of the forget gate,
ht-1 is hidden state at the previous time step,
yt is the input information at the current time step and
bf is the bias of the forget gate.
The function of the input gate is given in (14).
Where
it is the output of the input gate, which determines how much information in the current input need to be written into the cell state
ct. The
is the current cell state, tanh is the activation function,
Wi and
Wc are the weight matrix,
bi and
bc are the bias of input gate and cell state.
The updating the cell state is given in formula (15).
The formula of output gate is given in (16).
Where
ht is the hidden state, it is the final output of LSTM, and it can be expressed by formula (17).
A two-layer LSTM network was constructed for time-series data processing. The ELU activation function was chosen, and each LSTM layer’s dimension (d_model) should match the model dimension of the multi-head attention mechanism, ensuring that no dimension mismatch occurs. Subsequently, batch normalization and dropout layers are applied to reduce the risk of overfitting. This network can effectively capture more complex temporal features, thereby enhancing the classification accuracy.
3.2. Multi-head attention mechanism
The attention mechanism can dynamically assign weights to different parts of the input, allowing the model to adjust its focus based on the importance of various data features[33]. The attention mechanism is determined by three key parameters Q (Query), K (Key) and V (Value). Where Q represents the part to focus on, K corresponds to the candidates to match against the focus, V generates the output based on the matching results. The calculation process is given by the formulas (18)-(21).
Formula (18) computes the similarity between the candidates and the focused parts, while formula (19) scales the similarity scores. The scaling factor
is used to prevent vanishing or exploding gradients, where
is the dimensionality of each candidate. Formula (20) represents the corresponding scores of similarities, indicating the importance of each candidate to the current focus. Formula (21) calculates the final output as a weighted sum of the candidates, using their respective attention weights.
The multi-head attention mechanism is an extension of the single attention, and its structure is illustrated in Fig. 6. This structure divides the input features into different subspaces based on the number of attention heads, then applies the attention mechanism in parallel across these subspaces. This can enhance the model’s feature extraction capability [34]. Each head has its own independent linear transformation, allowing each subspace to adjust its focus according to varying vibration frequencies and patterns. This enables the model to learn features of the input data from different perspectives, enhancing the recognition of similar vibration characteristics and improving its ability to capture subtle features.
Assuming there are h attention heads, and each head having independent dimensions for Q, K, and V, the computation process for multi-head attention is given by formulas (22)-(25).
Formula (22) represents the linear transformation process for each attention head, Where Wi is the weight matrix. Formula (22) denotes the output of each head, formula (23) represents the concatenation of the data from all attention heads. To ensure consistency during data concatenation, the multi-head attention mechanism must have a dimensionality (d_model) that is divisible by the number of heads h (num_heads), thereby avoiding mismatches that may cause network errors. Finally, formula (24) is applied to perform a linear transformation on the concatenated vector, yielding the final output.
3.3. Kepler optimization algorithm
To dynamically adjust the parameters of the deep learning network, this study employs the Kepler optimization algorithm to optimize the network’s primary parameters and the attention mechanism’s parameters. The Kepler optimization algorithm [35] is a metaheuristic method inspired by the motion of planets. It simulates gravitational interactions between celestial bodies to solve global optimization problems. In this algorithm, the planets in the solar system represent the search space, and each planet’s position corresponds to a potential solution during the optimization process. The objective function is analogous to the planet’s mass or gravitational force, used to evaluate the quality of the solution.
The orbital velocity of a planet represents the trend and adjustment process of the solution. It determines the direction and magnitude of adjustments in each iteration. The orbital velocity is determined by the current velocity and the gravitational interactions with other individuals. These gravitational forces simulate collaborative search behavior among the solutions. Planets with greater mass exert stronger gravitational forces, reflecting better objective function values. Hence, the global optimum is typically represented by the Sun. In the KOA, gravity not only influences the update of an individual’s position, but it can also cause the individual to shift toward potential global optima. The implementation process of the KOA is illustrated in Fig. 7.
In the implementation of KOA, initialization is a crucial step. This phase involves generating each individual’s starting position, velocity, and mass. The initialization formulas are as follows:
Where xi (0) represents the solution position of individual i in the initialization. xmin and xmax represent the minimum and maximum values of the solution space respectively. Random (xmin, xmax) represents generating a random number in the interval [xmin, xmax].
The initial velocity of each individual is typically set to a small random value. This ensures that individuals start searching at a controlled speed without making abrupt changes in the solution space. After initializing positions and velocities, the mass of each individual is calculated based on the objective function values (fitness). This often requires a small set of training samples to evaluate the fitness of each individual.
The core formulas of the KOA include the gravitational force calculating (27), velocity update (28), and position update (29). By iteratively updating positions and velocities, the algorithm gradually approaches the optimal solution.
Where
Fij is the gravitational force between
i and
j. G is the constant,
mi and
mj are the mass,
rij is the distance between different individuals.
Where
vi(
t) and
vi(
t + 1) are the speed of individual
i at
t and
t + 1,
is the gravitational force of all other individuals on individual
i. The Δ
t is the time step.
Where xi(t + 1) is the position of individual i at t + 1.
In this paper, the Kepler optimization algorithm is used to optimize the parameters num_heads, d_model, dropout_rate, filters, and kernel_size in the road surface classification network. A detailed description of the algorithmic procedure is presented in Table 1 as pseudocode.
Table 1
The pseudocode of the proposed method
|
Algorithm 1 KCLMAnet
|
|
Output: Predicted road surface label ŷ
1 Data Acquisition and Preprocessing:
2 Collect the IMU data and apply Euler transformation:
3 Segment input X using sliding window
4 Define parameter ranges: filters [64, 128], kernel_size [ 2, 8] d_model [48, 128], num_heads [0, 5], dropout [0.1, 0.3].
5 Ensure d_model % num_heads = = 0 for attention compatibility
6 Randomly initialize M candidate planets of KOA
7 Fitness evaluation
8 For each planet:
9 Apply CNN to extract local spatial features: CNN output
10 Pass to LSTM to with hidden size = d_model: LSTM output
11 Compute multi-head attention features, the output is divided into h subspaces:
15 Pass to second LSTM and Dense layer:
16 Output: , is the feature vector used for classification in the attention model output matrix
17 Initialize KOA hyperparameters:
21 Compute validation loss
23 Update using gravitational rules:
25 Return final model hyperparaments and prediction ŷ
|
4. Experiments and results
4.1. Evaluation metrics
In the classification task, commonly used performance metrics include accuracy, precision, recall, and F1-score [
30]. Accuracy is the most straightforward evaluation metric, representing the proportion of correctly classified samples out of the total number of samples. It is given by formula (30).
Where, TP (true positive) represents the number of samples correctly predicted as the positive class. TN (true negative) refers to the number of samples correctly predicted as the negative class. In multi-class classification, negative samples generally refer to misclassified categories. FP (false positive) is the number of negative samples incorrectly predicted as positive, and FN (false negative) is the number of positive samples incorrectly predicted as negative.
Precision is used to evaluate the accuracy of the model’s predictions for a specific type of road surface. It is given by formula (31).
Recall reflects the sensitivity to each type of road surface, and it is used to evaluate the proportion of correctly identified samples. It is given by formula (32).
While a high recall means the model correctly identifies more samples, it can lead to higher misclassification rates. Hence, the F1-score is often considered. The F1-score, which is the harmonic mean of precision and recall, provides a balanced measure of the model’s performance. It is given by formula (33).
In multi-class problems, the AUC-ROC (Area Under Curve - Receiver Operating Characteristic) [
36] is a metric used to measure the model’s ability to distinguish between different classes. An ROC curve evaluates the performance of a classification model by plotting the true positive rate (TPR) against the false positive rate (FPR). The TPR calculating formula is given by (34).
The FPR calculating formula is given by (35).
Ideally, the ROC curve should be as close as possible to the top-left corner, indicating that the model achieves a high TPR while maintaining a low FPR. The area under the ROC curve (AUC) serves as a key performance metric. When AUC = 1, it means the model perfectly distinguishes all classes, achieving entirely correct classification. When AUC ≤ 0.5, it indicates poor classification performance.
In addition to the metrics mentioned above, a confusion matrix is a valuable tool for evaluating classification performance. By presenting the relationship between true labels and predicted labels, it provides insights into how well the model performs across different categories. This road surface classification task involves five categories, denoted as A, B, C, D, and E. The confusion matrix for this scenario is represented as shown in formula (36).
Two datasets were used in the experiments. The first is the publicly available Passive Vehicular Sensors dataset, which was employed to evaluate the model’s applicability under standard conditions. The second is a self-collected dataset comprising five common road surface types: dirt road, asphalt, bluestone, cobblestone, and concrete. This dataset was primarily used to assess the robustness and scalability of the proposed algorithm.
4.2. Comparison experiments on public dataset
To evaluate the classification performance of the proposed algorithm, experiments were conducted on the publicly available Passive Vehicular Sensors Dataset [37]. This dataset includes three types of road surfaces: asphalt, cobblestone, and dirt. These data were collected by different drivers operating various vehicles in the municipality of Anita Garibaldi, in upstate Santa Catarina, Brazil. The dataset is divided into training set and test set, with a split ratio of 65% for training and 35% for testing. In the experiments, we evaluated the classification performance of K-means clustering (KMC), K-nearest neighbors (KNN), Support Vector Machines (SVM), LSTM, CNN, CNN-LSTM[37],[38]. Table 2 shows the test results of all methods on the Passive Vehicular Sensors Dataset.
The results show that KCLMAnet outperforms the other models in accuracy, recall, and F1-score. Traditional models like KNN and SVM achieved only 75.42% and 75.59% accuracy, respectively. While deep learning models such as CNN and LSTM improved the performance, with accuracies and F1-scores of 92.27% and 91.89%, and 92.72% and 91.49%, respectively, they still fell short of the performance achieved by KCLMAnet. This improvement can be attributed to the KOA, which effectively tuned the model’s parameters and enhanced overall network performance. The CNN and LSTM components are responsible for extracting road surface features and modeling the dynamic characteristics of time-series data, respectively. Meanwhile, the multi-head attention mechanism helps the model focus on critical features, improving both classification accuracy and robustness.
Table 2
Test results on the public dataset
|
Model
|
Accuracy
|
Precision
|
Recall
|
F1-score
|
|
KMC
|
60.48%
|
58.71%
|
56.85%
|
54.58%
|
|
KNN
|
75.42%
|
71.70%
|
71.71%
|
71.67%
|
|
SVM
|
75.59%
|
70.14%
|
69.80%
|
68.35%
|
|
CNN
|
93.20%
|
92.27%
|
91.62%
|
91.89%
|
|
LSTM
|
92.72%
|
91.56%
|
91.43%
|
91.49%
|
|
CNN-LSTM
|
92.78%
|
92.11%
|
91.07%
|
91.46%
|
|
KCLMAnet
|
95.09%
|
94.52%
|
93.92%
|
94.17%
|
4.3. Comparison experiments on self-collected dataset
The dataset constructed in this study includes five types of road surfaces: dirt, asphalt, cobblestone, bluestone, and concrete. Data for each road surface type was collected under real-world driving conditions, ensuring its authenticity and representativeness. Some example data are shown in Fig .8.
This study divided the dataset into 85% for training and 15% for testing. The training process of the network is illustrated in Fig .9. Some parameters were optimized using the KOA. Ultimately, the number of filters was set to 134, the kernel_size to 6×6, the dropout_rate to 0.2396, the num_heads to 2, and the d_model to 48. The maximum number of training epochs was set to 1000. To prevent overfitting and improve training efficiency, an early stopping strategy was applied with a patience of 50 epochs. Training was terminated early if no improvement in validation accuracy was observed over 50 consecutive epochs.
To comprehensively evaluate the classification performance of different models across various road surface types, this study conducts analysis from four perspectives: overall metrics, class-wise distribution, prediction confusion, and feature separability.
As shown in the statistical results of Table 3 and Fig. 10, although traditional machine learning methods such as KNN achieve relatively high classification accuracy on certain road types, their performance drops significantly on dirt and cobblestone surfaces. This indicates limited adaptability to complex or unstructured terrains. Moreover, the F1-score of KNN fluctuates considerably across different surface classes, suggesting poor overall stability.
CNN exhibits certain advantages in spatial feature extraction. As shown in Fig. 10 and Fig. 11(a), it achieves relatively high recall on asphalt and bluestone surfaces. However, its performance on dirt and cobblestone roads remains suboptimal. The overall F1-score is slightly lower, indicating limited discriminative capability when dealing with highly similar surface types.
LSTM networks, known for their robustness in handling time-series data, show more balanced accuracy and recall across different road surfaces, as illustrated in Fig. 10 and Fig. 11(b). Nevertheless, the classification boundaries remain unclear for surfaces with significant feature overlap, reflecting challenges in distinguishing such classes.
Table 3
Test results in self-collected dataset
|
Model
|
Accuracy
|
Precision
|
Recall
|
F1-score
|
Parameters(M)
|
Similar roads classification
|
|
KNN[28]
|
84.79%
|
82.37%
|
84.70%
|
81.91%
|
-
|
Poor
|
|
CNN[39]
|
90.33%
|
87.85%
|
88.68%
|
89.98%
|
1.1
|
Fair
|
|
LSTM[27]
|
91.06%
|
89.87%
|
90.78%
|
90.27%
|
1.7
|
Fair
|
|
CNN-LSTM[37]
|
91.38%
|
88.02%
|
90.65%
|
88.79%.
|
2.5
|
Fair
|
|
Attention based CNN-LSTM[40]
|
91.82%
|
90.60%
|
91.03%
|
90.20%
|
2.8
|
Good
|
|
DeepSense [24]
|
87.23%
|
87.59%
|
87.23%
|
87.24%
|
1.2
|
Fair
|
|
Transformer-HAR[26]
|
93.01%
|
94.41%
|
93.01%
|
92.96%
|
4.1
|
Good
|
|
IChOA-CNN-LSTM[30]
|
93.62%
|
93.98%
|
93.62%
|
92.75%
|
3.5
|
Good
|
|
KCLMAnet
|
97.53%
|
97.59%
|
97.53%
|
97.52%
|
2.1
|
Excellent
|
To evaluate the effectiveness of the multi-head attention mechanism, a comparative analysis is conducted between the CNN-LSTM model and its attention-enhanced variants. As shown in Table 3 and Fig. 10, the attention-based CNN-LSTM consistently outperforms the standard CNN-LSTM across all evaluation metrics.
The improvement is particularly evident in scenarios involving highly similar road surface types. The attention mechanism enables the network to dynamically emphasize informative time segments and sensor channels while suppressing redundant or noisy vibration patterns. This effect is further illustrated in the confusion matrices (Fig. 11(c) and Fig. 11(d)), where the attention-based model exhibits clearer class boundaries and reduced inter-class confusion compared to its non-attention counterpart.
The proposed KCLMAnet achieves the best classification performance across all five road surface types. By comparing the performance of attention-based CNN-LSTM models with and without advanced optimization strategies (e.g., IChOA-CNN-LSTM and the proposed KCLMAnet), the effectiveness of the Kepler Optimization Algorithm (KOA) can be implicitly assessed. As shown in Table 3, KCLMAnet achieves consistent gains in accuracy, precision, recall, and F1-score, while maintaining a moderate model size. Furthermore, feature space visualizations in Fig. 12 demonstrate that KOA-enhanced models yield more compact intra-class clustering and clearer inter-class separation. This indicates that automated global hyperparameter optimization facilitates the learning of more discriminative and robust representations, particularly for road surfaces with overlapping vibration characteristics.
The scatter plots in Fig. 12 further validate the above conclusions from the feature space perspective. In models such as CNN, LSTM, and DeepSense (Fig. 12(a), (b), and (e)), significant overlap among different classes is observed, with blurred inter-class boundaries and a lack of effective clustering structures. Although CNN-LSTM and Attention-CNN-LSTM (Fig. 12(c), (d)) show some degree of clustering, the large intra-class dispersion and presence of multiple small clusters suggest instability in feature extraction. IChOA-CNN-LSTM and Transformer-HAR exhibit clearer local clusters (Fig. 12(g), (f)), but the class boundaries remain indistinct, showing transitional zones between classes. In contrast, the proposed method demonstrates well-defined cluster boundaries with minimal inter-class overlap, indicating that the extracted features possess strong separability and discriminative power, further confirming the classification performance.
In summary, KCLMAnet outperforms other models not only in overall performance metrics but also demonstrates superior stability and robustness in distinguishing individual road surface types. This method achieves clearer differentiation among the five road categories, highlighting its strong potential for real-world deployment and practical applications.
To provide a more intuitive representation of the classification performance, Fig. 13 shows the ROC curves for all road surface classification algorithms. It can be observed that the ROC curve for KCLMAnet is closer to the top-left corner, with an AUC value of 0.9993, indicating strong performance in the road surface classification task. In contrast, the ROC curve for KNN lies closer to the diagonal, suggesting inferior classification effectiveness compared to the other algorithms.
5. Conclusion
This study presented KCLMAnet, a deep learning framework for road surface type classification in complex and unstructured driving environments. Rather than relying on a single architectural enhancement, the superior performance of KCLMAnet arises from the synergistic integration of physically motivated inertial sensing design, adaptive spatiotemporal feature modeling, and automated hyperparameter optimization. This integrated paradigm enables the framework to effectively exploit multi-band vibration information while maintaining robustness under real-world operational variability.
Experimental results on both public and self-collected datasets demonstrate that KCLMAnet consistently outperforms existing methods in terms of accuracy, precision, recall, and F1-score. In particular, the proposed approach exhibits strong discriminative capability for highly similar road surface types, such as asphalt and concrete, which remain challenging for conventional inertial-based and deep learning models. These results confirm that jointly leveraging spatially distributed inertial perception and attention-guided temporal modeling provides a practical and effective solution for fine-grained road surface recognition.
Overall, the proposed framework offers a robust and scalable perception module that can support downstream tasks in intelligent vehicle systems, including adaptive path planning and suspension or motion control in unstructured environments. Future work will focus on extending the framework toward joint recognition of road surface types and surface-related anomalies, such as speed bumps, potholes, and cracks, enabling more comprehensive environment perception and further enhancing driving safety and decision-making reliability.