Prediction of distant metastasis in renal cell carcinoma using machine learning algorithms on the basis of the SEER database and a Chinese population

YajianLi3,4

ChuanzhenCao7

MoxuanWang5

CancanChen1✉,6

JianzhongShou3,4

LiWen1,3,4,8✉Emailwenli_cicams@163.com

TiandongHan1,2✉Emailcruiser412@163.com

1Department of Urology, Beijing Friendship HospitalCapital Medical UniversityYangan Road 95#, Xicheng District100050BeijingPR China

2Institute of UrologyBeijing Municipal Health CommissionBeijingChina

3Department of Urology, National Clinical Research Center for Cancer/Cancer HospitalNational Cancer Center, Chinese Academy of Medical Sciences, Peking Union Medical CollegeBeijingChina

4Beijing Key Laboratory of Urologic Cancer Cell and Gene Therapy, National Clinical Research Center for Cancer/Cancer HospitalNational Cancer Center, Chinese Academy of Medical Sciences and Peking Union Medical CollegeBeijingChina

5Qingdao Economic and Technological Development ZoneNo.4 Junior Middle SchoolQingdaoChina

6School of Computer EngineeringJiangsu Ocean UniversityLianyungangChina

7Department of UrologyFriendship HospitalBeijingChina, Japan, China

8National Clinical Research Center for Cancer/Cancer HospitalNational Cancer Center, Chinese Academy of Medical Sciences and Peking Union Medical CollegePanjiayuan Nanli 17#, Chaoyang District100021BeijingPR China

Yajian Li^3,4#, Chuanzhen Cao^7#, Moxuan Wang⁵, Cancan Chen^6*, Jianzhong Shou^3,4, Li Wen^3,4*, Tiandong Han^1,2*

Author E-mails

Yajian Li, liyajian05@163.com

Chuanzhen Cao, chuanzhenc@yeah.net

Moxuan Wang, moxuang0901@163.com

Cancan Chen, chencc@jou.edu.cn

Jianzhong Shou, shoujianzhong@yeah.net

Li Wen, wenli_cicams@163.com

Tiandong Han, cruiser412@163.com

Author affiliations

¹Department of Urology, Beijing Friendship Hospital, Capital Medical University, Beijing, China

²Institute of Urology, Beijing Municipal Health Commission, Beijing, China

³Department of Urology, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China

⁴Beijing Key Laboratory of Urologic Cancer Cell and Gene Therapy, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China

⁵Qingdao Economic and Technological Development Zone No.4 Junior Middle School, Qingdao, China

⁶School of Computer Engineering, Jiangsu Ocean University, Lianyungang, China

⁷Department of Urology, China-Japan Friendship Hospital, Beijing, China

* Correspondence to: Tiandong Han, Cancan Chen and Li Wen

Tiandong Han

Department of Urology, Beijing Friendship Hospital, Capital Medical University, Beijing, Yangan Road 95#, Xicheng District, 100050, PR China

E-mail: cruiser412@163.com

Li Wen

National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, Panjiayuan Nanli 17#, Chaoyang District, 100021, PR China

E-mail: wenli_cicams@163.com

Yajian Li and Chuanzhen Cao contributed equally to this work.

Abstract

Objectives

Few machine learning (ML) studies have investigated the prediction of distant metastasis in patients with renal cell carcinoma (RCC). This study aimed to develop and validate predictive models based on ML algorithms for RCC patients with distant metastasis.

Methods

We extracted RCC data from the SEER database between 2004 and 2015 (n = 192,912) and the Chinese National Cancer Center (CNCC) database between 2010 and 2020 (n = 3034). Seven different algorithms were applied to predict distant metastasis in RCC. Fivefold cross-validation was employed for model construction. The data were analyzed using Python on the basis of incomplete data, complete data, upsampling data and downsampling data.

Results

After data cleaning and screening, 121,741 cases from the SEER dataset and 2803 cases from the CNCC external test set were retained. For the incomplete data, the neutral network model [area under the curve (AUC) 95% confidence interval (CI) of the external data: 0.7467 ± 0.0573] achieved the highest accuracy. For the complete data, the support vector machine (SVM) model achieved the highest accuracy, with an AUC 95% CI of 0.8221 ± 0.0485. The disparity between positive and negative samples varied significantly across different datasets. Upsampling and downsampling analyses were also conducted. For the upsampling data, the extreme gradient boosting (XGBoost) model had the highest accuracy, with an AUC 95% CI for the external data of 0.8162 ± 0.0558. For the downsampling data, the SVM model achieved the highest accuracy, with an AUC 95% CI of 0.8274 ± 0.0546 for the external data.

Conclusions

Our study shows that ML algorithms can effectively predict distant metastasis in patients with RCC. ML models have favorable application prospects in clinical practice.

Keywords:

machine learning

renal cell carcinoma

prediction

distant metastasis

model

Introduction

Renal cell carcinoma (RCC) is one of the most common malignant tumors affecting the urinary system. The incidence of renal cancer in China has shown a continuous upward trend in recent years, with approximately 73,700 new cases in 2022 and a standardized incidence rate of approximately 6.7 per 100,000 [1]. The incidence rate is significantly higher in males than in females (2–3 times). Risk factors for RCC include smoking, obesity, hypertension, occupational exposure, and genetic factors. RCC primarily comprises three histopathological types: clear cell RCC, papillary RCC, and chromophobe RCC, with clear cell RCC accounting for 70%-80% of cases, papillary RCC accounting for 10%-15%, and chromophobe RCC accounting for 5% [2]. The survival of RCC patients has improved owing to surgical management and the use of targeted and immune drugs [3, 4]. However, a large proportion of RCC patients present with high-stage disease, with a poor prognosis because of distant metastases [5–7]. Thus, tumor metastasis is a critical factor in determining the prognosis of RCC patients.

Seventy percent of patients are diagnosed with stage I RCC, and 11% of patients are diagnosed with stage IV RCC [8]. The lungs are the most common affected site of metastasis, followed by regional lymph nodes, brain, bone, and soft tissue. Lee CH et al. [9] analyzed the Korean Renal Cancer Study Group metastatic RCC (mRCC) database. The results revealed the most common sites of metastasis (> 5%), and the median cancer-specific survival (CSS) ranged from 13.9 (liver) to 29.1 months (lung). Through continuous efforts in RCC research, for advanced or mRCC, combinations of immune checkpoint inhibitors or combinations of immune checkpoint inhibitors with tyrosine kinase inhibitors are associated with a tumor response of 42–71%, with a median overall survival of 46 to 56 months [8]. These results indicate that the prognosis of mRCC patients is still poor, regardless of the presence of lymph node metastasis or other organ metastasis [9, 10]. Early detection of distant metastasis is the top priority of clinical work.

Identifying the clinicopathological risk factors that promote RCC distant metastasis is vital. In recent years, several clinical diagnostic tools have been built using different prognostic and prediction models. Although predictive factors and models for mRCC have been developed, the limited data commonly restrict individualized accurate prediction. These studies have significant limitations, such as the use of older tools and limited data. The Surveillance, Epidemiology, and End Results (SEER) database represents 28% of the U.S. population today; it includes large, multi-institutional patients and could be an essential source of mRCC research to provide greater statistical power. Additionally, machine learning (ML), a major subbranch of artificial intelligence, is a promising statistical method and can handle heterogeneous and large amounts of data. ML techniques have not been used in many studies, and some models lack interpretability to fully capture the complex relationships among variables and to provide clinically actionable explanations. Previously, we used ML algorithms as auxiliary tools to predict the overall survival in RCC patients [11]. This technique could be used to quantify the possibility of recurrence in patients and help with more individualized postoperative clinical management.

In our study, we aimed to develop and validate an explainable ML model for the prediction of RCC metastasis. We collected a large amount of data from the SEER database and the Chinese National Cancer Center (CNCC) dataset. The RCC cases were subjected to data standardization and included multiple clinical parameters, such as tumor grade, side, pathology type, T stage, and N stage. The RCC M0/M1 metastasis prediction algorithm was explored and visualized.

Materials and methods

Data collection

The data used in this study were collected from the SEER database using SEER*Stat software (version 8.3.6) and the CNCC. A data agreement form was signed and submitted it to the SEER administration. Our study was approved by the Institutional Review Committee of the CNCC (Institutional Review Board number: 21/405–3076). The variables included race, sex, age, marital status, tumor location, tumor size, histological type, tumor grade, tumor-node-metastasis (TNM) stage information and surgical treatment. The SEER dataset was utilized as the internal dataset for model construction and performance evaluation, whereas the dataset from the CNCC served as an external independent dataset to test the models' generalizability.

Data preparation

We extracted RCC data, including data from the SEER database between 2004 and 2015 (n = 192,912) and from the CNCC database between 2010 and 2020 (n = 3034). After data cleaning and screening, 121,741 cases remained in the SEER dataset, and 2803 cases remained in the CNCC external test set. The flowchart is shown in Fig. 1. Given the substantial volume of the SEER data and the presence of incomplete data for some features, the dataset was first divided into training and testing sets at a 9:1 ratio, resulting in approximately 109,567 cases for the training set and 12,174 cases for the testing set. From all the cases with complete data, 12,174 cases were randomly selected as the testing set, with the remaining 109,567 cases allocated to the training set. Among the 109,567 cases, 80,119 cases had complete data, and 29,448 cases had incomplete data.

Fig. 1

Flow chart for SEER and CNCC data screening.

Data preprocessing

To facilitate extraction and analysis, numerical characterization of the clinicopathologic data of RCC patients was first conducted through different number assignments.

Processing of incomplete data

The random forest interpolation method was used to supplement missing values. First, a random forest regression model was trained based on available complete data. The average values were subsequently used to supplement all missing features. Finally, the random forest regression model was used to regress and predict each missing feature one by one, replacing the missing values with the original average values.

Feature discretization

To improve the generalizability of the model, continuous features such as age and tumor size were discretized and binarized. This study employs a feature discretization method grounded in information entropy, which is a supervised approach. Information entropy serves as a metric for quantifying the uncertainty of an event, and higher entropy values indicate greater uncertainty. The formula for calculating information entropy is as follows:

$\:IE\left(X\right)=\:-\sum\:_{i=1}^{k}p\left({x}_{i}\right)logp\left({x}_{i}\right)$

where X represents all possible categories of the sample, k denotes the total number of categories, and p(x_i) signifies the probability of the sample belonging to category i. The discretization method finds the threshold that maximizes the decrease in the entropy of category information in the training set. On the basis of this threshold, all features are discretized into binary values.

ML models were constructed using fivefold cross-validation based on the training set of 80,119 cases with complete data and the supplemented training set of 109,567 cases by a random forest regressor. Additionally, the internal test set included 12,174 cases, and the external test set included 2,803 cases. The specific division results are shown in Table 1 and Table 2.

Table 1
80,119 cases with complete training set
Set	M0 num		M1 num		Total
Set	train	validation	train	validation	Total
Fold1	59363	14827	4732	1197	80119
Fold2	59347	14843	4748	1181	80119
Fold3	59391	14799	4704	1225	80119
Fold4	59305	14885	4790	1139	80119
Fold5	59354	14836	4742	1187	80119
Test data	11296		878		12174
External data	2718		85		2803

Table 2
109,567 cases with incomplete dada in training set
Set	M0 num		M1 num		Total
Set	train	validation	train	validation	Total
Fold1	76969	19190	10684	2724	109567
Fold2	76897	19262	10756	2652	109567
Fold3	76924	19235	10730	2678	109567
Fold4	76911	19248	10743	2665	109567
Fold5	76935	19224	10719	2689	109567
Test data	11296		878		12174
External data	2718		85		2803

Statistical analysis

We trained all the ML models using Python version 3.7.1, NumPy version 1.20.1, Scikit-Learn version 1.0.2, SciPy version 1.7.3, and XGBoost version 1.6.2. The ML models were developed and evaluated using 5-fold cross-validation. The receiver operating characteristic (ROC) curve, decision curve analysis (DCA), calibration curve, Delong test and SHapley Additive exPlanations (SHAP) images were used to evaluate the accuracy of the model.

Results

Clinical characteristics of patients

A total of 192,912 cases from the SEER dataset and 3,034 cases from the CNCC dataset were included in this study. The clinicopathological features of RCC patients from the SEER database were described in our previous study [11]. The median age of the patients in the CNCC database was 58 years [interquartile range (IQR), 51–65 years]. There were 2056 (67.8%) males and 978 (32.2%) females. Most of the RCC patients were married, accounting for 96.6% (n = 2,929) of cases. The population was mostly Asian (99.7%, n = 3028). The mean tumor size was 4.2 ± 1.7 cm. At the initial diagnosis, 1.8% (n = 53) and 3.0% (n = 92) of patients had lymph node and organ metastasis, respectively. Histologically, the common pathological types were clear cell RCC (87.1%, n = 2642), papillary RCC (3.3%, n = 101), chromophobe RCC (4.2%, n = 127) and others (1.1%, n = 32). The clinicopathological features of the RCC patients from the CNCC cohort are shown in Table 3.

Table 3
Patient characteristics at baseline of the Chinese National Cancer Center dataset.
Characteristics	n (%)
Sex
Male	2,056 (67.8)
Female	978 (32.2)
Median age (interquartile range, year)	58 (51–65)
Marital status
Married	2,929 (96.6)
Unmarried	71 (2.3)
Unknown	34 (1.1)
Race
Yellow	3,028 (99.7)
Black	2 (0.1)
White	2 (0.1)
Unknown	2 (0.1)
Tumor location
Left	1,491 (49.1)
Right	1,535 (50.6)
Bilateral	8 (0.3)
Tumor size (mean ± SD, cm)	4.2 ± 1.7
Histological types
Clear cell	2,642 (87.1)
Papillary	101 (3.3)
Chromophobe	127 (4.2)
Others	32 (1.1)
Unknown	132 (4.3)
Tumor grade
G1/2	2494 (82.2)
G3/4	310 (10.2)
Unknown	230 (7.6)
T stage
T1/2	2649 (87.3)
T3/4	385 (12.7)
N stage
N0	2,968 (97.8)
N1	53 (1.8)
Unknown	13 (0.4)
M stage
M0	2,942 (97.0)
M1	92 (3.0)

Model construction

According to the aforementioned data partitioning table, it is necessary to investigate and compare the performance of models constructed with either complete data or incomplete data. Additionally, as shown in Table 1 and Table 2, there is a significant disparity between the number of positive and negative samples within each dataset. Consequently, the performance of models was compared using upsampling and downsampling methods. The research framework was divided into the following two sections.

Incomplete data (n = 109,567) vs. complete data (n = 80,119)

We used a training set with incomplete data, which underwent missing value processing. Fivefold cross-validation was employed for model construction. The integrated model prediction results are shown in Table 4. According to the results of the test and external datasets, the SVM [area under the curve (AUC) 95% confidence interval (CI) of the test data: 0.8328 ± 0.0165; AUC 0.95% CI of the external data: 0.5874 ± 0.0884)] was not effective for predicting the metastasis of RCC patients. The Bayes (AUC 95% CI of the test data: 0.869 ± 0.0123; AUC 95% CI of the external data: 0.7399 ± 0.0623), decision tree (AUC 95% CI of the test data: 0.8639 ± 0.0132; AUC 95% CI of the external data: 0.7398 ± 0.0593), logistic regression (AUC 95% CI of the test data: 0.8755 ± 0.0126; AUC 95% CI of the external data: 0.739 ± 0.068), neutral network (AUC 95% CI of the test data: 0.8655 ± 0.0129; AUC 95% CI of the external data: 0.7467 ± 0.0573), and random forest (AUC 95% CI of test data: 0.864 ± 0.0131; AUC 95% CI of external data: 0.7425 ± 0.06) and XGBoost (AUC 95% CI of test data: 0.8641 ± 0.0129; AUC 95% CI of external data: 0.7409 ± 0.059) models performed relatively well. The ROC curve is shown in Fig. 2A.

Table 4
The integrated model prediction results after five-fold cross-validation in the incomplete data
Model	Set	Auc.	Acc.	Sens.	Spec.
Bayes	Train data	0.8478 ± 0.0035	0.7659 ± 0.0025	0.8241 ± 0.0066	0.7576 ± 0.0026
	Test data	0.869 ± 0.0123	0.8212 ± 0.0065	0.7935 ± 0.0271	0.8233 ± 0.007
	External data	0.7399 ± 0.0623	0.8232 ± 0.0141	0.6669 ± 0.1032	0.8278 ± 0.0143
Decision Tree	Train data	0.8814 ± 0.0034	0.8613 ± 0.002	0.7451 ± 0.0076	0.8776 ± 0.0021
	Test data	0.8639 ± 0.0132	0.8392 ± 0.0064	0.7894 ± 0.0271	0.8429 ± 0.0065
	External data	0.7398 ± 0.0593	0.7548 ± 0.0151	0.6895 ± 0.1006	0.7565 ± 0.0161
Logistic	Train data	0.873 ± 0.0034	0.8481 ± 0.002	0.755 ± 0.0071	0.861 ± 0.0021
	Test data	0.8755 ± 0.0126	0.8234 ± 0.0065	0.7967 ± 0.0272	0.8255 ± 0.0068
	External data	0.739 ± 0.068	0.8462 ± 0.0132	0.655 ± 0.1064	0.8521 ± 0.0131
Neutral network	Train data	0.8813 ± 0.0034	0.8613 ± 0.002	0.7451 ± 0.0077	0.8776 ± 0.0021
	Test data	0.8655 ± 0.0129	0.8423 ± 0.0063	0.7821 ± 0.027	0.8469 ± 0.0066
	External data	0.7467 ± 0.0573	0.758 ± 0.0149	0.6895 ± 0.1006	0.7601 ± 0.0153
Random Forest	Train data	0.8814 ± 0.0034	0.8613 ± 0.002	0.7454 ± 0.0077	0.8774 ± 0.002
	Test data	0.864 ± 0.0131	0.8342 ± 0.0064	0.7971 ± 0.0267	0.8371 ± 0.0066
	External data	0.7425 ± 0.06	0.7592 ± 0.015	0.6895 ± 0.1006	0.7606 ± 0.0156
SVM	Train data	0.8335 ± 0.0042	0.8472 ± 0.0022	0.7603 ± 0.0076	0.8591 ± 0.0022
	Test data	0.8328 ± 0.0165	0.8473 ± 0.0063	0.7678 ± 0.0276	0.8537 ± 0.0065
	External data	0.5874 ± 0.0884	0.7405 ± 0.0166	0.5547 ± 0.1119	0.7462 ± 0.0166
XGBoost	Train data	0.8816 ± 0.0034	0.8614 ± 0.002	0.7456 ± 0.0077	0.8774 ± 0.002
	Test data	0.8641 ± 0.0129	0.8393 ± 0.0065	0.7887 ± 0.0275	0.8432 ± 0.0066
	External data	0.7409 ± 0.059	0.7548 ± 0.0151	0.6895 ± 0.1006	0.7565 ± 0.0161

Fig. 2

ROC curve analysis of different models based on the incomplete dataset (A) and complete dataset (B) for predicting metastasis in patients with RCC.

After fivefold cross-validation based on the complete data as a training set, the integrated model prediction results are shown in Table 5. The Bayes (AUC 95% CI of test data: 0.8562 ± 0.013; AUC 95% CI of external data: 0.7989 ± 0.0466) and logistic regression (AUC 95% CI of test data: 0.8815 ± 0.0118; AUC 95% CI of external data: 0.7983 ± 0.061) models were not more effective than the decision tree (AUC 95% CI of test data: 0.8828 ± 0.0117; AUC 95% CI of external data: 0.814 ± 0.0559), neutral network (AUC 95% CI of test data: 0.8826 ± 0.0116; AUC 95% CI of external data: 0.8083 ± 0.0575), random forest (AUC 95% CI of test data: 0.8824 ± 0.0118; AUC 95% CI of external data: 0.8197 ± 0.0529), SVM (AUC 95% CI of test data: 0.8368 ± 0.0142; AUC 95% CI of external data: 0.8221 ± 0.0485) and XGBoost (AUC 95% CI of test data: 0.8823 ± 0.0118; AUC 95% CI of external data: 0.8135 ± 0.0561) models. The ROC curve is shown in Fig. 2B.

Table 5
The integrated model prediction results after five-fold cross-validation in the complete data
Model	Set	Auc.	Acc.	Sens.	Spec.
Bayes	Train data	0.8484 ± 0.0054	0.7021 ± 0.0032	0.877 ± 0.0087	0.6882 ± 0.0033
	Test data	0.8562 ± 0.013	0.8298 ± 0.0064	0.7282 ± 0.0306	0.8377 ± 0.0066
	External data	0.7989 ± 0.0466	0.8174 ± 0.0142	0.6443 ± 0.1008	0.8227 ± 0.0144
Decision Tree	Train data	0.8701 ± 0.0053	0.8493 ± 0.0025	0.7405 ± 0.011	0.8579 ± 0.0025
	Test data	0.8828 ± 0.0117	0.8399 ± 0.0064	0.7913 ± 0.0261	0.8438 ± 0.0065
	External data	0.814 ± 0.0559	0.8423 ± 0.0132	0.7012 ± 0.1036	0.8467 ± 0.0137
Logistic	Train data	0.86 ± 0.0051	0.8576 ± 0.0024	0.7261 ± 0.0111	0.8681 ± 0.0024
	Test data	0.8815 ± 0.0118	0.8366 ± 0.0066	0.7921 ± 0.0258	0.8401 ± 0.0067
	External data	0.7983 ± 0.061	0.8336 ± 0.0141	0.7012 ± 0.1036	0.8374 ± 0.0144
Neutral network	Train data	0.87 ± 0.0052	0.8475 ± 0.0025	0.7427 ± 0.0109	0.8558 ± 0.0025
	Test data	0.8826 ± 0.0116	0.8398 ± 0.0063	0.7897 ± 0.0259	0.8438 ± 0.0065
	External data	0.8083 ± 0.0575	0.835 ± 0.0141	0.7012 ± 0.1036	0.8387 ± 0.0144
Random Forest	Train data	0.8698 ± 0.0054	0.8526 ± 0.0025	0.7368 ± 0.0112	0.8619 ± 0.0024
	Test data	0.8824 ± 0.0118	0.8375 ± 0.0064	0.7948 ± 0.026	0.841 ± 0.0067
	External data	0.8197 ± 0.0529	0.8414 ± 0.0134	0.7012 ± 0.1036	0.8456 ± 0.0141
SVM	Train data	0.8329 ± 0.0059	0.8482 ± 0.0024	0.7366 ± 0.0114	0.8571 ± 0.0025
	Test data	0.8368 ± 0.0142	0.8371 ± 0.0065	0.7836 ± 0.0272	0.8414 ± 0.0067
	External data	0.8221 ± 0.0485	0.9181 ± 0.0105	0.4468 ± 0.1062	0.9321 ± 0.0097
XGBoost	Train data	0.8708 ± 0.0053	0.8507 ± 0.0025	0.7397 ± 0.0109	0.8596 ± 0.0025
	Test data	0.8823 ± 0.0118	0.84 ± 0.0063	0.7891 ± 0.0261	0.8438 ± 0.0066
	External data	0.8135 ± 0.0561	0.8414 ± 0.0134	0.7012 ± 0.1036	0.8456 ± 0.0141

The model constructed with the complete training set outperformed the model constructed with the incomplete training set. Furthermore, the average prediction results of these two models based on the external test set were subjected to the Delong test. The results revealed a p value of less than 0.0001, which demonstrated a significant difference in model performance.

Upsampling vs. downsampling analysis

On the basis of the analysis of positive and negative samples in each dataset, the disparity between positive and negative samples varies significantly across different datasets. Therefore, we investigated and compared the performance of models constructed using methods without sampling (the above results), upsampling, and downsampling. We employed the fivefold cross-validation method to construct the models. First, downsampling was performed on the negative samples in the training set, and the model predictions after integrating the fivefold models were obtained, with the results shown in Supplementary table 1 and Fig. 3A. The SVM model achieved the highest accuracy, with an AUC 95% CI of 0.8685 ± 0.0125 for the test data and 0.8274 ± 0.0546 for the external data. Then, upsampling was performed on the positive samples in the training set, and the model predictions after integrating the fivefold models were obtained, with the results shown in Supplementary table 2 and Fig. 3B. The XGBoost model had the highest accuracy, with an AUC 95% CI of 0.8819 ± 0.0117 for the test data and 0.8162 ± 0.0558 for the external data. Both downsampling and upsampling led to relatively high AUCs and accuracy compared with those of the above non-sampling models.

Fig. 3

ROC curves based on the downsampling (A) and upsampling (B) analyses.

To compare the performance of the three methods for constructing models, a Delong test was also conducted on the average prediction results of these three models based on an external test set. The p value of the non-sampling model and the upsampling or downsampling model based on the external test set was less than 0.0001, indicating that the latter model performed better than the former did. The upsampling or downsampling method can significantly improve model performance. There was no significant difference in performance between the two models constructed by the up- or downsampling methods (p = 0.8734).

On the basis of the results obtained, we identified the model constructed using the upsampling method as the optimal model for this study. Furthermore, a visual analysis of the results is presented in Fig. 4 (DCA curve) and Fig. 5 (calibration curve). We also visualized each algorithm with a SHAP chart, as shown in Supplementary Fig. 1. From the above feature visualization SHAP chart, it can be concluded that tumor size, T stage, N stage, and tumor grade have high feature importance in most models. Therefore, these four clinical characteristics can be considered the main predictors of the metastasis parameters of RCC patients.

Fig. 4

DCA curves based on the upsampling training set (A), test set (B) and external test set (C).

Fig. 5

Calibration curves based on the upsampling training set (A), test set (B) and external test set (C).

Discussion

With respect to model development, although nomograms are currently the most commonly used prediction models, ML models are favored by an increasing number of medical workers because of their practicality, innovation and accuracy [12–14]. In clinical practice, tumor-related models can accurately predict prognosis by combining multiple factors, such as tumor pathological subtype, tumor stage, tumor diameter, and molecular marker expression. However, few researchers have attempted to use ML methods to explore the prediction of metastasis in RCC patients. In this study, we used a variety of ML algorithms to predict RCC metastasis through many samples from 2 databases, which indicated good sensitivity and specificity. These algorithms can be applied to accurately predict whether RCC patients have metastasis, providing assistance for the judgment of clinical metastasis.

Various statistics facilitate the understanding and interpretation of data. However, the limitations and efficiency of processing big data limit computing power and accuracy. In recent years, ML has included algorithmic methods that enable machines to solve problems without specific computer programming, leading the way in predictive modeling tasks [15]. The integration of big data with ML algorithms is becoming a clinical necessity. In RCC studies, ML models can be applied to analyze the risk factors associated with specific diseases on the basis of patient information. Yin et al. [16] integrated convolutional neural network models with Cox regression to identify potential prognostic biomarkers for overall survival. Terrematte P et al. [17] created a novel ML 13-gene signature, improving risk analysis and survival prediction for RCC patients. Chen S et al. [18] developed and validated an ML-based prognosis prediction model, which could contribute to clinical decision-making for patients with RCC. Similarly, in our previous study, we investigated and demonstrated that ML algorithms could be used as auxiliary tools to predict the overall survival of patients with RCC using the SEER data [11].

Once distant metastasis occurs in RCC, patient prognosis becomes very poor. Therefore, prediction of distant metastasis in RCC patients is very important. Many clinical studies have used clinicopathological factors to establish models to predict metastatic risk in RCC. Fan Z et al. [19] established nomograms using the SEER database to predict the risk of bone metastasis in patients with RCC. The calibration curve, ROC curve, and DCA confirmed good performance using diagnostic and prognostic nomograms. Wang J et al. [20] developed and validated a nomogram to predict distant metastasis in elderly patients with RCC. The AUCs of the training and validation cohorts indicated excellent predictive ability. DCA indicated that the nomogram had better clinical application value than traditional TN staging did. Some scholars have also explored the use of ML algorithm models to make predictions. Xu C et al. [21] used data from 40,355 RCC patients in the SEER database to build an ML model to predict the risk of bone metastasis in RCC patients. Among the prediction models established by the six ML algorithms, the XGBoost model achieved the best prediction performance (AUC = 0.891). Dong J and colleagues also used the SEER database to predict distant metastasis in RCC patients on the basis of interpretable ML models [22]. The calibration curve indicated that the predicted values are highly consistent with the actual observed values.

The above studies indicate that ML models for predicting RCC metastasis are feasible and highly accurate. However, most of these studies are based on SEER data and lack external validation. In contrast, in this study, we collected RCC data from the SEER database (training set) and the CNCC database (external test set). First, all the data were preprocessed. Preprocessing of feature data consists of two steps: Step 1. Clinicopathologic data including sex, tumor grade, side, pathology type, N stage, T stage, surgery and marital status were conducted numerical characterization through different number assignments. Step 2. For the two features of age and tumor size, the features in the training set and the internal and external test sets are discretized separately. Second, we compared the performance of models constructed by different ML algorithms for incomplete data and complete data. The accuracy of the ML models with complete datasets is significantly higher than that with incomplete datasets. Finally, to overcome the problem of large differences between positive and negative samples in the dataset, we also used up- and downsampling methods to reconstruct the model and test its performance. The results showed that the accuracy of the model was further improved. Our study is the first to combine the SEER database with an external dataset to predict the distant metastasis of RCC. The results showed that after validation with the external dataset, our ML model achieved high accuracy, which provides great guiding value for clinical decision-making.

This study has several limitations. First, the data from the external validation cohort and the SEER database were selected retrospectively, which may introduce some inherent bias. In addition, to validate our prediction model in the general population, prospective clinical studies with larger sample sizes are necessary. Second, we were unable to obtain biomarkers or blood test results from the SEER database. We expect that the addition of data from external data validation will result in a more sophisticated and effective adoption of ML models as supplementary tools for prediction research.

Conclusion

In summary, through data from the SEER database and CNCC dataset, this study explored the factors related to the prediction of RCC metastasis through multialgorithm ML models. Relevant algorithms were established to predict the possibility of distant metastasis in RCC patients. The findings show that ML algorithms can effectively predict distant metastasis in patients with RCC and play a positive role in clinical applications.

Electronic Supplementary Material

Below is the link to the electronic supplementary material

Supplementary Material 1

Supplementary Material 2

Supplementary Material 3

References

National Cancer Center; Renal Cancer Expert Committee of National Cancer Quality Control Center. Zhonghua Zhong Liu Za Zhi. 2022;44:1256–61.

Motzer RJ, Jonasch E, Agarwal N, et al. NCCN Guidelines® Insights: Kidney Cancer, Version 2.2024. J Natl Compr Canc Netw. 2024;22:4–16.

Tannir NM, Albigès L, McDermott DF, et al. Nivolumab plus ipilimumab versus sunitinib for first-line treatment of advanced renal cell carcinoma: extended 8-year follow-up results of efficacy and safety from the phase III CheckMate 214 trial. Ann Oncol. 2024;35:1026–38.

Motzer RJ, Porta C, Eto M, et al. Lenvatinib Plus Pembrolizumab Versus Sunitinib in First-Line Treatment of Advanced Renal Cell Carcinoma: Final Prespecified Overall Survival Analysis of CLEAR, a Phase III Study. J Clin Oncol. 2024;42:1222–8.

Powles T, Albiges L, Bex A, et al. Renal cell carcinoma: ESMO Clinical Practice Guideline for diagnosis, treatment and follow-up. Ann Oncol. 2024;35:692–706.

Bex A, Ghanem YA, Albiges L, et al. European Association of Urology Guidelines on Renal Cell Carcinoma: The 2025 Update. Eur Urol. 2025;87:683–96.

Rathmell WK, Rumble RB, Van Veldhuizen PJ, et al. Management of Metastatic Clear Cell Renal Cell Carcinoma: ASCO Guideline. J Clin Oncol. 2022;40:2957–95.

Rose TL, Kim WY. Renal Cell Carcinoma: A Review. JAMA. 2024;332:1001–10.

Lee CH, Kang M, Kwak C, et al. Sites of Metastasis and Survival in Metastatic Renal Cell Carcinoma: Results From the Korean Renal Cancer Study Group Database. J Korean Med Sci. 2024;39:e293.

10.

Dogan I, Iribas A, Paksoy N, Vatansever S, Basaran M. Outcomes and prognostic factors in metastatic renal cell carcinoma patients with brain metastases. J Cancer Res Ther. 2023;19:S587–91.

11.

Jiang W, Chen Z, Chen C, Wang L, Han T, Wen L. Machine learning algorithms being an auxiliary tool to predict the overall survival of patients with renal cell carcinoma using the SEER database. Transl Androl Urol. 2024;13:53–63.

12.

Wang X, Zhao J, Marostica E, et al. A pathology foundation model for cancer diagnosis and prognosis prediction. Nature. 2024;634:970–8.

13.

Feng Y, Long Y, Wang H, et al. Benchmarking machine learning methods for synthetic lethality prediction in cancer. Nat Commun. 2024;15:9058.

14.

Collins GS, Moons KGM, Dhiman P, et al. TRIPOD + AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ. 2024;385:e078378.

15.

MacEachern SJ, Forkert ND. Machine learning for precision medicine. Genome. 2021;64:416–25.

16.

Yin Q, Chen W, Zhang C, et al. A convolutional neural network model for survival prediction based on prognosis-related cascaded Wx feature selection. Lab Investig. 2022;102:1064–74.

17.

Terrematte P, Andrade DS, Justino J, et al. A Novel Machine Learning 13-Gene Signature: Improving Risk Analysis and Survival Prediction for Clear Cell Renal Cell Carcinoma Patients. Cancers (Basel). 2022;14:2111.

18.

Chen S, Guo T, Zhang E, et al. Machine learning-based prognosis signature for survival prediction of patients with clear cell renal cell carcinoma. Heliyon. 2022;8:e10578.

19.

Fan Z, Huang Z, Huang X. Bone Metastasis in Renal Cell Carcinoma Patients: Risk and Prognostic Factors and Nomograms. J Oncol. 2021;2021:5575295.

20.

Wang J, Zhanghuang C, Tan X, et al. Development and Validation of a Nomogram to Predict Distant Metastasis in Elderly Patients With Renal Cell Carcinoma. Front Public Health. 2022;9:831940.

21.

Xu C, Liu W, Yin C, et al. Establishment and Validation of a Machine Learning Prediction Model Based on Big Data for Predicting the Risk of Bone Metastasis in Renal Cell Carcinoma Patients. Comput Math Methods Med. 2022;2022:5676570.

22.

Dong J, Duan M, Liu X, et al. Prediction of Distant Metastasis of Renal Cell Carcinoma Based on Interpretable Machine Learning: A Multicenter Retrospective Study. J Multidiscip Healthc. 2025;18:195–207.

Acknowledgement

None

Funding

This research was supported by Elite Medical Professionals Project of China-Japan Friendship Hospital(NO.ZRJY2023-QM01); National High Level Hospital Clinical Research Funding.

Financial disclosure

None.

Ethics statement

The authors state that they have followed the principles outlined in the Declaration of Helsinki for all human or animal experimental investigations. Our study was approved by Institutional Review committee of the National Cancer Center/Cancer Hospital, Chinese Academy of Medical Sciences (NCC/CHCAMS) (Institutional Review Board number:21/405–3076).

Patient consent for publication

Patient study consent was not required due to the study’s retrospective nature.

Data Availability

The datasets used and/or analyzed data in the current study are available from the corresponding author on reasonable request.

Competing interests

The authors declare no conflicts of interests.

Author Contribution

All authors listed in this manuscript contributed significantly to the study. Yajian Li and Chuanzhen Cao contributed to writing the manuscript. Moxuan Wang contributed to layout and beautification of the figures. Cancan Chen contributed to data analysis. Jianzhong Shou contributed to supervision. Tiandong Han and Li Wen contributed to reviewing the manuscript for critical revisions. All authors read and approved the final manuscript.

Supplementary Fig. 1. Visualization images of each algorithm model: (A) random forest SHAP chart; (B) logistic regression SHAP chart; (C) XGBoost SHAP chart; (D) Bayesian SHAP chart; (E) decision tree SHAP chart; (F) neutral network SHAP chart; and (G) SVM SHAP chart.

Yes