Practical Machine Learning Framework for Designing and Predicting C-Amidated Antimicrobial Peptides

Dang-HuyLe1✉EmailTu.Le@rmit.edu.au

WenyiLi2

AndrewHung3

ShadiHoushyar1

TuC.Le1

School of EngineeringSTEM College, RMIT University3000MelbourneVictoriaAustralia

2Department of Biochemistry and Chemistry, La Trobe Institute for Molecular ScienceLa Trobe University3086BundooraVictoriaAustralia

3School of ScienceSTEM College, RMIT University3000MelbourneVictoriaAustralia

Dang-Huy Le¹, Wenyi Li², Andrew Hung³, Shadi Houshyar¹ & Tu C. Le^1,*

Affiliations

¹School of Engineering, STEM College, RMIT University, Melbourne, Victoria, 3000, Australia

Author-1, Author-4 & Author-5

²Department of Biochemistry and Chemistry, La Trobe Institute for Molecular Science, La Trobe University, Bundoora, Victoria, 3086, Australia

Author-2

³School of Science, STEM College, RMIT University, Melbourne, Victoria, 3000, Australia

Author-3

Contributions

D.H.L curated the data, developed the machine learning models, and wrote the original draft. W.L, A.H. and S.H provided supervision and reviewed the manuscript. T.C.L. contributed to the formal analysis, provided supervision, provided resources and reviewed the manuscript.

Corresponding author

Correspondence to: Tu.Le@rmit.edu.au

ORCID: 0000-0003-3552-8211

Abstract

Antimicrobial peptides (AMPs) offer promising alternatives to conventional antibiotics, yet most predictive models fail to account for chemical modifications that influence real-world efficacy. Among these, C-terminal amidation is a widely adopted and effective strategy that improves structural stability, membrane interaction, and protease resistance. To address the limitations of existing models, we developed an integrated framework for the design and prediction of C-terminal amidated AMPs targeting Escherichia coli. Our approach combined a design-oriented model, based on an interpretable Explainable Boosting Machine (EBM), which extracts actionable sequence-level design rules, alongside with a high-accuracy deployment model, built on a fine-tuned ESM2 deep learning architecture. The resulting tool, CAmidPred, enables both predictive classification and amino acid pattern analysis, with outputs validated against published alanine-scanning experiments. This dual-model approach bridges computational modeling and wet-lab discovery, offering a robust and practical pipeline for the rational design of the targeted AMP design pipelines

Introduction

Antimicrobial peptides (AMPs) are short, naturally occurring proteins that play a crucial component in innate immunity and offer promising alternatives to conventional antibiotics, especially in the face of rising antimicrobial resistance.^1–3 Beyond therapeutic applications, AMPs are also used in biomedical materials as surface coatings for implants and catheters, as well as in non-medical fields such as food preservation and agriculture.^4–8

AMPs are typically composed of 10 to 50 amino acids, possess a net positive charge (typically ranging from + 2 to + 13) and exhibit amphiphilic properties that enable selective binding to negatively charged microbial surfaces and enable membrane disruption or cellular penetration.⁹ To further improve their efficacy, various chemical modifications have been explored.^10–13 Among these, C-terminal amidation is one of the most widely adopted. This modification improves structural stability, membrane interaction and protease resistance but its effect can vary depending on the peptide context.¹⁴ C-terminal amidation can stabilize the α-helical conformation, enhance their interaction with membranes and protect the peptide from enzymatic degradation by reducing its susceptibility to proteases, thereby affecting the antimicrobial activity.^15–17 However, C-terminal amidation does not always provide these advantages. For instance, amidated aurein 1.2 shows significantly reduced membrane-binding ability compared to its non-amidated counterpart.¹⁸ Therefore, it is important to study the activity of C-terminally amidated peptides more thoroughly.

The rise of machine learning (ML), together with the growing availability of public peptide databases, has accelerated AMP discovery by improving prediction efficiency and reducing experimental costs. Traditional ML methods – such as Support Vector Machines (SVM),^19–21 Random Forest,^22–24 and XGBoost^25,26 – valuable for their interpretability and effectiveness on small datasets, as they allow researchers to link peptide activity to specific features like amino acid composition or physicochemical properties, though they require careful feature engineering. In contrast, deep learning approaches, including Convolutional Neural Networks (CNN),^27–29 Recurrent Neural Networks (RNN),^30,31 and hybrid models,^32–34 excel at capturing complex sequence patterns and often achieve higher predictive accuracy but are harder to interpret. The integration of natural language processing (NLP) has further transformed AMP prediction by treating amino acids as words, enabling pretrained protein language models like ESM2³⁵ and ProtTrans³⁶ to learn transferable sequence representations from massive datasets.^37–39 These models capture both local and global sequence contexts, enhancing the detection of antimicrobial activity, though they also inherit deep learning’s “black box” limitations.

Most existing AMP prediction models are limited to peptides made of natural amino acids, overlooking chemical modifications that can significantly influence activity and lead to inaccurate predictions in practice. Furthermore, many studies evaluate antimicrobial activity in a general sense, without considering organism-specific mechanisms, which reduces accuracy since modes of action differ across targets. One regression model aimed at predicting MIC (minimum inhibitory concentration) generalized across all targets, encoding peptide sequences with one-hot vectors and amidation with a binary label, a simplistic strategy that does not capture detailed sequence information.⁴⁰ It also introduced weak assumptions, such as labeling peptides without antimicrobial records as inactive and assigning them an arbitrary logMIC of 4, which risks distorting biological relevance. Huang et al. (2023) also examined amidated peptides but only for Staphylococcus aureus.²⁶ Our work addresses these gaps by focusing on C-terminal amidated peptides targeting Escherichia coli, a WHO-designated highest-priority Gram-negative pathogen due to its antibiotic resistance, widespread infections, and lack of effective treatments.⁴¹ We consider only sequences with experimentally determined MIC values against E. coli and develop two complementary classification models: a traditional ML model for interpretability and a deep learning model for predictive performance (Fig. 1). For the traditional ML approach, we use the Explainable Boosting Machine (EBM), a glass-box algorithm that offers strong accuracy while preserving interpretability.⁴² For deep learning, we mitigate data scarcity by applying transfer learning with the pretrained ESM2 protein language model and uses its multi-head self-attention to weight amino acids by importance and highlight sequence regions most relevant to antimicrobial activity. This combination enables both actionable sequence recommendations and residue-level analysis, bridging computational modelling with experimental validation.

Fig. 1

Overview of the modeling framework. The pipeline integrates two complementary branches: a design-oriented model (top) based on handcrafted sequence features and EBM for interpretable rule extraction, and a deployment model (bottom), CAMidPred, which leverages pretrained ESM2 embeddings and a 5-fold ensemble classifier to predict AMPs versus non-AMPs.

Results

During data curation, two MIC thresholds (10 µM and 15 µM) were evaluated for classifying peptides as active or inactive. The design-oriented model showed better performance with the 15 µM threshold, which was therefore selected as the primary cutoff. For completeness, results of all trials from both thresholds are provided in Tables S1 and S2 (Supporting Information). To maintain consistency across the pipeline, the deployment model was also developed using the 15 µM threshold.

Interpretable Design-Oriented Model for Sequence Suggestion

To objectively compare model quality across seven trials, each corresponding to a different combination of the three feature groups (composition, composition-transition-distribution (CTD), and global descriptors), we defined three composite evaluation metrics: performance score, generalization score, and ranking score. These metrics are designed to capture both the predictive strength of the model and its ability to generalize without overfitting.

The performance score reflects a model’s predictive power and is calculated as the average of five evaluation metrics: cross-validation F1 score, test balanced accuracy, test F1 score, test AUC-ROC, and test AUC-PR. These metrics are well-suited for imbalanced classification tasks and collectively capture different aspects of model behavior. Note that only F1 score is available from the cross-validation set because the Sequential Feature Selection (SFS) process optimizes specifically for F1, so other metrics are not computed during that step. We intentionally excluded precision, recall, and MCC from this average: precision and recall are already implicitly represented by F1 (which balances both), and MCC has a different scale (ranging from − 1 to 1), which would distort the averaging process. A higher performance score indicates better predictive ability.

The generalization score evaluates how consistently the model performs across training, validation, and test sets. It is based on the absolute differences (gaps) between various pairs of metrics: the F1 gap between training and cross-validation, training and test, and cross-validation and test sets. Additionally, it includes train-test gaps for balanced accuracy, AUC-ROC, and AUC-PR. The reason for using these specific metrics in the generalization score is the same as in the performance score mentioned above. The smaller the average gap, the better the model’s generalization ability, as it indicates less overfitting and more stable performance across datasets.

Finally, the ranking score combines the two previous metrics into a single value to fairly compare models. The performance score is first normalized to the [0, 1] range using MinMax scaling. Meanwhile, the generalization score is also normalized, but with an inverted MinMax scale so that models with smaller gaps (better generalization) receive higher scores. The final ranking score is the average of these two scaled scores, offering a balanced and interpretable metric that highlights models which are not only accurate but also generalize well to unseen data.

Figure 2A presents the scores for all seven trials. The best-performing trial, based on the highest ranking score, is the EBM model that combines all three feature types: composition, CTD, and global descriptors. This model not only achieves the highest performance score but also shows a reasonable generalization score, indicating strong predictive capability with minimal overfitting. Detailed evaluation metrics for this model are provided in Table 1. As shown in Table 1, the test set metrics are generally close to those from the training set, and their values are sufficiently high to confirm that the model performs well on unseen data. This indicates that the model is both accurate and generalizable, making it suitable as a design-oriented model. A complete summary of evaluation metrics for all trials is available in Table S2 (Supporting Information).

Fig. 2

(A) Performance vs. generalization plot of seven feature combinations. Each point represents an individual trial, with the colour ranging from blue to red, where warmer colours indicate better outcomes. (B) Feature importance plot from the selected EBM model. The plot includes both univariate and pairwise interaction features, shown in abbreviated form, with full annotations provided in the legend box.

Table 1
Evaluation metrics for the best EBM model.
Metric	Training	Test
Balanced Accuracy	0.8332	0.7565
F1	0.8672	0.8025
Precision	0.8379	0.7789
Recall	0.8987	0.8277
AUC ROC	0.9214	0.8355
AUC PR	0.9386	0.8563
MCC	0.6771	0.5199

To interpret the model and extract design insights for AMPs, we analyzed how features contribute to the EBM model’s predictions, as illustrated in Fig. 2B. EBM provides two types of scores: univariate scores for individual features and interaction scores for feature pairs. In the design-oriented model, 9 univariate features and 9 interaction features were identified as influential. A score above zero indicates a tendency toward an active prediction, while a score below zero suggests inactivity. Feature importance in this figure is represented by the mean absolute score, which reflects the strength of a feature's contribution but not its direction (positive or negative). To capture both importance and direction, Figs. 3A–E present the univariate score curves for the features with measurable impact, ranked by their contributions. Figures 4A–I illustrate all detected interaction effects.

Fig. 3

Univariate feature score plots from the selected EBM model, showing the features with measurable impact. The blue line shows the EBM score across value bins, where each bin represents a range of feature values. Orange bars indicate the number of samples in each bin. Gray shaded regions mark areas with limited or no data, where the model extrapolates. Larger gray regions indicate lower reliability of the score in those ranges.

Fig. 4

Interaction score heatmaps from the selected EBM model. Only interaction scores greater than or equal to zero are visualized, with color intensity representing the strength of the contribution (brighter colors indicate stronger positive influence).

Deployment Model for High-Confidence Prediction

We used 5-fold cross-validation to generate five different train-validation splits and built a separate model for each fold. Since we aimed to analyze results at a later stage using prediction probabilities, we selected the best model in each fold based on the lowest validation loss. Additionally, other evaluation metrics were calculated on the validation sets. Detailed performance metrics from all hyperparameter trials across the five models obtained through 5-fold cross-validation are provided in Table S3 (Supporting Information). The hyperparameters selected for the best-performing model in each fold (based on validation loss) are shown in Table 4. The evaluation metrics for the five models are presented in Fig. 5A. The performance metrics across all folds are similar and consistently high, indicating that our modeling strategy performs reliably across different data splits. To leverage the strengths of models trained on different subsets of the data, we created an ensemble model by averaging the logits from all five models. This ensemble model was used to make final predictions. When tested on unseen data, the ensemble model achieved the following results: balanced accuracy of 0.8239, F1 score of 0.8535, precision of 0.8426, recall of 0.8646, AUC-ROC of 0.9087, AUC-PR of 0.9317, and MCC of 0.6510. Compared to the test set metrics of our design-oriented model shown in Table 4, the ensemble fine-tuned ESM2 model demonstrates superior performance, offering more accurate and reliable predictions. These results suggest that the ensemble model is well-suited for practical deployment in future applications.

Table 4
Selected hyperparameters corresponding to the lowest validation loss per fold.
Fold	Dropout rate	Learning rate	Loss
1	0.1	1e-5	0.4321
2	0.1	1e-5	0.4673
3	0.1	1e-5	0.4583
4	0.3	1e-5	0.4468
5	0.1	5e-6	0.4405

Fig. 5

(A) Five-fold radar plot of fine-tuned ESM2 metrics. (B–E) Alanine scanning of Indolicidin (B) and SB056 variants (C–E). Bars show the change in predicted activity upon alanine substitution; the dashed line marks the original probability. Pink indicates decreased and blue increased predicted activity. (F–H) Circos plot attention maps of SB056 variants corresponding to (C–E). Arcs denote residues; links indicate interactions above the third quartile of attention weights, with color and thickness reflecting interaction strength and direction. Residues are color-coded: orange, hydrophobic; blue, positively charged; green, uncharged polar.

Discussion

Our design-oriented model identifies features that can be translated into actionable design rules for antimicrobial peptides. The most dominant feature identified by the model is Charge (Fig. 3A), which is calculated as the sum of electrical charges of all ionizable groups (side chains and termini) at physiological pH 7.4. For unmodified peptides composed of natural amino acids, the N-terminus is considered positively charged and the C-terminus negatively charged; however, in this study, we focus on C-amidated peptides, where the C-terminal charge is neutralized. As shown in Fig. 3A, peptides with higher net positive charge tend to have a greater likelihood of being classified as antimicrobial. Specifically, the EBM score becomes positive when the net charge is 4.051 or higher, suggesting that a charge above 4 is indicative of potential antimicrobial activity. This aligns with prior findings that positively charged AMPs interact favorably with the negatively charged bacterial membrane, especially in Gram-negative bacteria like E. coli, which possess an outer membrane rich in negatively charged lipopolysaccharides. Thus, charge is a critical determinant of bioactivity. However, due to limited sample availability with net charges above 11.526 in our dataset, model predictions in this range are less reliable. Therefore, for the most confident classification, peptides with charges between 4.051 and 11.526 are most representative. The charge contributions of amino acid residues at pH 7.4 were computed using modlamp⁴³ and are summarized in Table 2.

Table 2
Charge contribution values for amino acids and termini in C-amidated peptides at pH 7.4.
Components	Charge contribution
N-terminal	0.99
K	0.999
R	1
H	0.042
D	-1
E	-0.999
C	-0.154
Y	-0.002
C-amidated and other amino acids	0

Unlike Charge, which shows a clear and continuous trend, most features in the design-oriented EBM model contribute to predictions only within specific value ranges. For example, CTDD_hydrophobicity_CASG920101.3.residue0 (a distribution descriptor from the CTD group) has two regions where the model assigns a positive score (Fig. 3B). However, the second region, approximately from 83.5 to 100, contains very few samples and exhibits high score variability, making predictions in this range less reliable. A more consistent and dependable range is between 4.211 and 11.882, indicating that highly hydrophobic amino acids such as F, I, W, and C should appear early in the sequence (within the first 4.21% to 11.88% of its length) to enhance antimicrobial activity. Another feature, CTDC_hydrophobicity_PONP930101.G3 (a composition descriptor from the CTD group), shows a preferred value range between 0.455 and 0.7, suggesting that 45.5% to 70% of the amino acids in the sequence should belong to the hydrophobic YMFWLCVI group (Fig. 3C). These observations together highlight the importance of hydrophobicity in AMP function. After initial electrostatic interaction with the negatively charged bacterial membrane, hydrophobic residues help the peptide insert into or disrupt the hydrophobic inner membrane. If the peptide lacks sufficient hydrophobicity, it may fail to engage the membrane. On the other hand, excessive hydrophobicity may cause the peptide to aggregate and lose solubility, ultimately reducing its antimicrobial activity. The feature CTDD_hydrophobicity_ZIMJ680101.3.residue50 provides additional support for this pattern (Fig. 3D). It favors sequences where 50% of LPFYI residues (which are also highly hydrophobic) are located between 35.5% and 59% of the sequence length, suggesting that both composition and spatial distribution of hydrophobic residues influence activity.

Another informative feature, CKSAAGP_uncharger.uncharger.gap3 (Composition of k-Spaced Amino Acid Group Pairs), represents the normalized frequency of uncharged amino acid pairs (S, T, C, P, N, Q) separated by three residues (Fig. 3E). This feature contributes positively when such uncharged amino acid pairs occur in 2.12% to 6.78% of all possible gap-3 positions (sequence length minus 3). This suggests a subtle structural preference associated with antimicrobial activity.

In contrast, four remaining univariate features show limited relevance (Figures S1A-D). Their EBM scores stay nearly constant across the observed value ranges, and the apparent increases above zero occur only in regions with very few samples, suggesting extrapolation rather than reliable effects. These features also exhibit low overall importance and minimal interaction with others. Collectively, these features contribute little to guiding AMP design.

Analysis of the interaction heatmap in Figs. 4, together with the recommended value ranges from the univariate score plots, reveals a clear alignment: features that are favorable individually also tend to interact positively in combination. This suggests that these properties can co-occur and jointly enhance antimicrobial activity. Furthermore, several studies have shown that AMPs with excessively high amphipathicity may experience reduced efficacy.⁴⁴ This emphasizes the importance of considering multiple features simultaneously, each within its appropriate range, to design peptides with optimal antimicrobial potential.

While the design-oriented model provides interpretable design rules, our deployment model demonstrates higher predictive performance and shows reliable consistency when analyzing alanine scanning data. We first analyze predictions on Indolicidin (ILPWKWPWWPWRR-NH₂), an AMP isolated from the cytoplasmic granules of bovine neutrophils.⁴⁵ This peptide is notable for its unusually high tryptophan content, with five W residues making up 39% of its 13 amino acids. Multiple studies agree that W plays a key role in antimicrobial activity.^46–48 W not only helps localize the peptide to the membrane interface but also promotes membrane insertion and disruption. In addition, the two arginine residues at the C-terminus are essential for antibacterial activity.⁴⁸ These positively charged residues disrupt bacterial membranes through electrostatic interactions. Indolicidin analogs that lack R residues show a complete loss of activity. Our model’s alanine scanning results align with experimental findings, identifying all five W residues and the two C-terminal R residues as the most critical for activity (Fig. 5B). Replacing any of these residues with alanine causes the predicted activity probability to drop significantly, from 0.36 to 0.5, resulting in the peptide being classified as inactive when the active probability falls below 0.5.

In another study, proline residues at positions 3, 7, and 10 were substituted with alanine to evaluate their roles in antimicrobial activity.⁴⁹ When P3 was replaced with A, the MIC against E. coli increased significantly from 4 µM to 35 µM, showing a major drop in potency. In contrast, substitutions at P7 and P10 led to only moderate changes, with MIC values of 10 µM and 15 µM respectively. These experimental findings align well with our model's predictions, as shown in Fig. 5B. When P3 is replaced with alanine, the predicted probability of the peptide being active decreases from 0.75 to 0.45. Since our model uses a threshold of 15 µM to classify activity, this change places the peptide in the inactive category, corresponding to a predicted MIC above 15 µM. In comparison, replacing P7 or P10 causes only minor changes in the predicted probability, and the peptide remains classified as active, with a predicted MIC of 15 µM or less. It is also important to highlight that the variant with P10 replaced by alanine was not included in the training dataset. Nevertheless, the model was still able to generate an accurate and interpretable prediction, demonstrating its ability to generalize to novel or unseen sequences.

We also examined a case study involving SB056. Manzo et al. reported that swapping the first two residues (W and K) significantly improved antimicrobial activity, reducing the MIC from 64 µg/mL to 4 µg/mL.⁵⁰ Our model supports this finding: the predicted probability of activity increased from 0.16 to 0.51 after the substitution (Fig. 5C and 6D). Since 0.5 is the probability threshold used to classify peptides as active or inactive, this change corresponds to a transition from the inactive to the active class. Notably, the modified sequence KWKIRVRLSA-NH₂ was not present in the training data, yet the model correctly predicted its increased activity, demonstrating strong generalization capability.

The experimental explanation centers on structural amphipathicity.⁵⁰ Swapping the first two residues creates a more regular alternating pattern of hydrophobic and hydrophilic amino acids, promoting a stable β-strand structure on the membrane. In this structure, hydrophobic residues embed into the membrane while hydrophilic ones are directed towards the lipid-water interface. The original sequence had an alternating pattern, except for the first two residues. After swapping them, the entire sequence forms a nearly perfect alternating profile while maintaining the same net positive charge (+ 5). Our model reflects the same structural explanation observed experimentally. As shown in Figs. 5C and 5D, swapping the first two residues increases the influence of most amino acids on the class decision. In the original sequence, the irregular alternating pattern had little effect on the model's prediction. After the swap, the more regular alternation of hydrophobic and hydrophilic residues creates a stable pattern that the model recognizes as important. In this context, substituting any residue with alanine disrupts the ideal structure, leading to a significant drop in the predicted probability of activity. Consequently, the peptide is no longer be predicted as active.

Despite the improved amphipathic profile of KWKIRVRLSA-NH₂, the serine at position 9 is a polar but uncharged residue, making it have the weakest hydrophobic character among the hydrophobic residues in the sequence. According to our model, this position may hinder activity; substituting S9 with alanine increases the predicted probability of activity by 0.1 (Fig. 5D). This result suggests that replacing S9 with a positively charged residue, such as lysine, could further enhance antimicrobial activity by reinforcing the alternating hydrophobic–hydrophilic pattern. Consistent with this hypothesis, the modified sequence KWKIRVRLKA-NH₂ is predicted to have a high probability of activity (0.76) (Fig. 5E). This variant has not been experimentally reported and could be a novel and promising AMP candidate.

To investigate how the model attends to different residues during prediction, we visualized attention patterns using Circos plots.⁵¹ Specifically, we highlighted attention links exceeding the third quartile of the distribution, focusing on the most influential interactions. As shown in Figs. 5F and 5G, both sequences WKKIRVRLSA and KWKIRVRLSA reveal consistent attention toward residues forming an alternating hydrophilic–hydrophobic motif. This is reflected in the dense web of inbound and outbound connections across these positions, suggesting their importance in the model’s decision-making. However, S9 appears to play a limited role, as evidenced by the sparse attention links associated with it. Upon substituting S9 with lysine (Fig. 5H), the attention network becomes more interconnected, incorporating the new K9 residue into the model’s focus. This increased engagement suggests that the modified sequence enhances the alternating polarity pattern, leading to greater model confidence in its prediction.

Overall, we present a framework for the discovery and design of C-amidated AMPs, supported by a design-oriented model that provides practical guidance for generating candidate sequences with a higher likelihood of antimicrobial activity. Promising peptides can then be evaluated using our deployment model, CAmidPred, which predicts activity, confidence scores, and residue-level contributions via alanine scanning. While C-amidation can enhance antimicrobial activity, it may also compromise selectivity, which can lead to increased toxicity, and further exacerbate issues of solubility, aggregation, and protease stability.^15,52 These factors not yet addressed by our model. Future work will extend CAmidPred to incorporate them, offering a more holistic assessment of therapeutic potential. In combination with molecular dynamics simulations and experimental validation, our approach provides a practical and interpretable framework for advancing novel AMP candidates from initial design toward clinical application.

Methods

Design-oriented model

Feature representation

To support the goal of recommending design rules, we selected feature representations that are easy to interpret and biologically meaningful. Composition and Composition-Transition-Distribution (CTD) features were extracted using the iFeature⁴⁶ toolkit, while global descriptors were calculated with the modlamp⁴⁷ package. Using these features allows the ML model to capture essential information about amino acid content, sequence organization, and overall molecular properties.

Composition

The most fundamental representation is Amino Acid Composition (AAC), which measures the frequency of each of the 20 standard amino acids in a sequence. This feature captures the overall composition of residues and is defined as:

$\:{\text{AAC}}_{{a}_{i}}=\frac{{N}_{{a}_{i}}}{L}$

where

$\:{N}_{{a}_{i}}$

is the number of occurrences of amino acid

$\:{a}_{i}$

, and

$\:L$

is the total length of the sequence. For example, in the peptide GLFDIIKKIAESF (length

$\:L=13$

), the amino acid ‘K’ appears 2 times, resulting in

$\:{\text{A}\text{A}\text{C}}_{K}=2/13\approx\:0.1538$

. AAC generates 20 features, corresponding to the frequencies of the 20 natural amino acids.

Dipeptide Composition (DPC) extends AAC by computing the frequency of every pair of adjacent amino acids. This allows the model to capture short-range sequence patterns. It is calculated as:

$\:{DPC}_{\left({a}_{i},{a}_{j}\right)}=\frac{{N}_{\left({a}_{i},{a}_{j}\right)}}{L-1}$

where

$\:{N}_{\left({a}_{i},{a}_{j}\right)}$

is the number of times the dipeptide

$\:{a}_{i}{a}_{j}$

appears in the sequence. The denominator

$\:L-1$

corresponds to the number of possible overlapping dipeptides. For instance, the dipeptide ‘KK’ appears once in GLFDIIKKIAESF, yielding

$\:{\text{D}\text{P}\text{C}}_{\left(K,K\right)}=1/12\approx\:0.0833$

. DPC creates 400 features (

$\:20\times\:20)$

, corresponding to the occurrences of all possible dipeptides formed by the 20 natural amino acids.

Tripeptide Composition (TPC) further generalizes this concept to every sequence of three consecutive amino acids. The TPC feature captures richer local patterns and is calculated by:

$\:{TPC}_{\left({a}_{i},{a}_{j},{a}_{k}\right)}=\frac{{N}_{\left({a}_{i},{a}_{j},{a}_{k}\right)}}{L-2}$

where

$\:{N}_{\left({a}_{i},{a}_{j},{a}_{k}\right)}$

is the count of tripeptide

$\:{a}_{i}{a}_{j}{a}_{k}$

, and

$\:\:L-2$

is the number of overlapping tripeptides. For the example, in the sequence GLFDIIKKIAESF (length 13), there are

$\:L-2=11$

tripeptides: ‘GLF’, ‘LFD’, ‘FDI’, ‘DII’, ‘IIK’, ‘IKK’, ‘KKI’, ‘KIA’, ‘IAE’, ‘AES’, and ‘ESF’. For example, the tripeptide ‘DII’ appears once, so its frequency is calculated as

$\:{\text{T}\text{P}\text{C}}_{\left(D,I,I\right)}=1/11\approx\:0.0909$

. In total, TPC generates 8000 features representing all possible combinations of three consecutive amino acids.

Instead of representing each amino acid individually, Grouped Amino Acid Composition (GAAC) clusters amino acids based on physicochemical properties and computes the frequency of each group within a sequence. Although various grouping strategies exist, this study adopts the 5-group scheme implemented in iFeature, which categorizes amino acids as follows: aliphatic (G, A, V, L, M, I), aromatic (F, Y, W), positively charged (K, R, H), negatively charged (D, E), and uncharged polar (S, T, C, P, N, Q). The frequency of each group is calculated as:

$\:{\text{G}\text{A}\text{A}\text{C}}_{i}=\frac{{f}_{i}}{L}$

where

$\:{f}_{i}\:$

is the number of residues belonging to

$\:i$

-th group, and

$\:L$

is the total sequence length. For example, in the peptide sequence GLFDIIKKIAESF, the aliphatic group includes G, L, and three occurrences of I, resulting in a total of five aliphatic residues out of 13. The GAAC value for the aliphatic group is given by

$\:{\text{G}\text{A}\text{A}\text{C}}_{aliphatic}=5/13\approx\:0.3846$

. This same calculation is applied to the other groups, producing 5 features in total for GAAC.

Composition of K-Spaced Amino Acid Group Pairs (CKSAAGP) also uses amino acid groupings, but instead of computing their frequencies individually, it focuses on pairwise group relationships separated by a fixed number of residues. Each amino acid is first assigned to one of several predefined groups. Because CKSAAGP was also computed using iFeature, it uses the same five amino acid group definitions as GAAC. For each group pair, the frequency of occurrences with a specific residue gap is calculated as:

$\:{\text{C}\text{K}\text{S}\text{A}\text{A}\text{G}\text{P}}_{\left({g}_{i},{g}_{j}\right)}^{k}=\frac{{N}_{\left({g}_{i},{g}_{j}\right)}^{k}}{L-k-1}$

where

$\:{N}_{\left({g}_{i},{g}_{j}\right)}^{k}$

is the number of times the group pair

$\:\left({g}_{i},{g}_{j}\right)$

appears in the sequence with exactly

$\:k$

amino acids in between, and

$\:L$

is the length of the sequence. In this study, we considered gap sizes from 0 to 5, which results in a total of

$\:25\times\:6=150$

features per sequence, given that there are

$\:5\times\:5=25$

possible group pairs for each gap size.

With the peptide GLFDIIKKIAESF, after converting each amino acid to its corresponding group, the sequence becomes:

aliphatic – aliphatic – aromatic – negatively charged – aliphatic – aliphatic – positively charged – positively charged – aliphatic – aliphatic – negatively charged – uncharged polar – aromatic.

To calculate the frequency of the group pair (aliphatic, aromatic) with gap

$\:k=1$

, we search for occurrences where two residues belonging to these groups appear with one intervening residue between them. For example, the pair at positions 1 and 3 forming a valid (aliphatic, aromatic) pair. This occurs once in the sequence. Since there are

$\:\:L-k-1=11$

valid positions for gap

$\:k=1$

, the frequency is

$\:{\text{C}\text{K}\text{S}\text{A}\text{A}\text{G}\text{P}}_{\left(aliphatic,\:aromatic\right)}^{1}=1/11\approx\:0.0909$

Composition-Transition-Distribution (CTD)

The Composition-Transition-Distribution (CTD) features capture not only the presence of specific amino acid types but also how these types are organized and distributed along the sequence in relation to physicochemical properties.⁴⁸ CTD consists of three components: Composition (CTDC), Transition (CTDT), and Distribution (CTDD). To compute CTD features, amino acids are first grouped into categories based on a selected physicochemical attribute. The way amino acids are divided into groups depends on the specific attribute and the scale used to define it.⁴⁹ For example, in the case of hydrophobicity, many scales exist, each offering different groupings of amino acids. Some representative hydrophobicity-based groupings are summarized in Table 3.

Table 3
Amino acid attributes and groupings used for CTD features.
Attribute	Division
hydrophobicity_CASG920101	group 1 (polar): KDEQPSRNTG group 2 (neutral): AHYMLV group 3 (hydrophobicity): FIWC
hydrophobicity_PONP930101	group 1 (polar): KPDESNQT group 2 (neutral): GRHA group 3 (hydrophobicity): YMFWLCVI
hydrophobicity_ZIMJ680101	group 1 (polar): QNGSWTDERA group 2 (neutral): HMCKV group 3 (hydrophobicity): LPFYI
hydrophobicity_FASG890101	group 1 (polar): KERSQD group 2 (neutral): NTPG group 3 (hydrophobicity): AYHWVMFLIC

The Composition (CTDC) feature measures the fraction of residues in the sequence that fall into each group. The formula for CTDC is:

$\:{CTDC}_{i}=\frac{{N}_{i}}{L}$

where

$\:{N}_{i}$

is the number of residues in group

$\:i$

, and

$\:L$

is the total sequence length. For example, with the peptide GLFDIIKKIAESF (length = 13) using the CASG920101 hydrophobicity grouping, there are 6 amino acids classified into group 1 (as shown in Table 1). Therefore, the composition for group 1 is calculated as

$\:{CTDC}_{1}=6/13\approx\:\:0.4615$

The Transition (CTDT) quantifies the frequency of changes between groups along adjacent residues in the sequence. The formula is:

$\:{\text{CTDT}}_{i-j}=\frac{{T}_{i-j}}{L-1}$

where

$\:{T}_{i-j}$

is the number of transitions between groups

$\:i$

and

$\:j$

, and

$\:L-1$

is the number of adjacent residue pairs. For the sequence GLFDIIKKIAESF, the group sequence becomes: 1–2–3–1–3–3–1–1–3–2–1–1–3. We then count the transitions between adjacent pairs. For example, for group 1 ↔ 2, transitions occur at positions (1–2) and (10–11), resulting in 2 transitions. Thus,

$\:{\text{CTDT}}_{1-2}=2/12\:\approx\:0.1667$

The Distribution (CTDD) feature describes where in the sequence amino acids from a group are found, reporting the relative positions where the first, 25%, 50%, 75%, and 100% of the group’s members appear. In the peptide GLFDIIKKIAESF, the residues from group 3 are found at positions 2 (L), 3 (F), 5 (I), 6 (I), 9 (I), and 13 (F), giving a total of 6 residues in this group. To determine the 25% occurrence level, we calculate 25% of 6, which is 1.5. This value is rounded up to the second occurrence. The second group 3 residue is at position 3 (F). Therefore, the CTDD value at the 25% level is calculated as

$\:{\text{CTDD}}_{\text{group\:3,\:25\%}}=3/13\times\:100\approx\:23.08$

. This means that 25% of the hydrophobic residues have appeared by approximately 23.08% of the way through the sequence.

Global descriptors

Global descriptors in modlamp⁴⁷ provide sequence-level features that summarize the overall physicochemical properties of peptides, independent of amino acid position or amino acid scale encoding. These descriptors capture essential characteristics such as sequence length, molecular weight, molecular formula, net charge, charge density, isoelectric point, instability index, aromaticity, aliphatic index, Boman index, and hydrophobic ratio. In this study, all these descriptors were computed using modlamp's default settings, with the pH set to 7.4, and C-terminal amidation explicitly enabled (amide = True) to reflect the modified peptide structure.

Feature selection

We conducted feature selection across seven trials, each based on different combinations of three primary feature groups: composition, CTD, and global descriptors. These groups were combined in all possible non-empty ways, resulting in seven distinct feature sets. For each set, the same feature selection process was applied.

First, semi-constant features were removed. These are features where the most frequent value appears in 80 percent or more of the samples, which indicates low variance and limited discriminative power. Next, highly correlated features were eliminated by computing the Pearson correlation matrix and removing one feature from any pair with a correlation coefficient greater than 0.8. This step helps reduce redundancy and prevent multicollinearity.

After this cleaning, the data was split into training and testing sets in an 80:20 ratio, stratified by the target label to maintain class balance. On the training set, we performed forward Sequential Feature Selection (SFS) using the Explainable Boosting Machine (EBM) as the base estimator and F1 score as the selection metric. SFS is a wrapper-based method that begins with no features and incrementally adds the one that provides the greatest improvement in model performance, stopping when further improvements fall below a tolerance threshold of 0.001. During selection, a 5-fold stratified cross-validation was used to ensure balanced class representation across folds. The selected features from each trial were then used to train a final EBM model.

Explainable Boosting Machine

The Explainable Boosting Machine (EBM) is an interpretable, high-performance machine learning model based on the framework of Generalized Additive Models (GAMs), with enhancements that incorporate ensemble learning and interaction detection.⁴² Formally, EBM models the prediction as an additive combination of learned shape functions over individual features and selected feature pairs:

$\:\widehat{{y}_{i}}={\beta\:}_{0}+{\sum\:}_{j=1}^{p}{f}_{j}\left({x}_{ij}\right)+{\sum\:}_{m=1}^{M}{f}_{m}\left({x}_{i{m}_{1}},{x}_{i{m}_{2}}\right)$

where

$\:\widehat{{y}_{i}}$

is the predicted response for observation

$\:i$

$\:{\beta\:}_{0}$

denotes the intercept term;

$\:{f}_{j}\left({x}_{ij}\right)$

are univariate shape functions that capture the main effect of each individual feature

$\:{x}_{j}$

; and

$\:{f}_{m}\left({x}_{i{m}_{1}},{x}_{i{m}_{2}}\right)$

are bivariate shape functions that model the interaction effects between selected pairs of features

$\:\left({x}_{{m}_{1}},{x}_{{m}_{2}}\right)$

. The total number of individual features is denoted by

$\:p$

, while

$\:M$

represents the number of pairwise interactions included in the model. These shape functions are learned through an additive boosting process using ensembles of shallow decision trees, which enables EBM to capture complex, non-linear relationships while preserving interpretability. As a result, the contribution of each individual feature and selected interaction to the final prediction can be directly visualized and understood, offering clear and transparent insights into the model’s decision-making process.

Deployment model

ESM2 model architecture

In this paper, we use transfer learning from ESM2-t6-8M-UR50D, the most lightweight model in the ESM2 family.³⁵ This model was selected due to the relatively short lengths of peptides in our dataset and its suitability for building an efficient, practical inference tool. The ESM2-t6-8M-UR50D model is a transformer-based protein language model developed by Meta FAIR for learning contextual representations of protein and peptide sequences through self-supervised masked language modeling. It is the smallest variant in the ESM2 family, with approximately 8 million trainable parameters.

The model consists of six sequential transformer encoder layers. Each layer includes a multi-head self-attention mechanism with 20 attention heads, which enables the model to capture long-range dependencies by simultaneously attending to different parts of the sequence.⁵⁰ The self-attention computation for input embeddings

$\:\:X$

involves generating queries

$\:Q=X{W}_{Q}$

, keys

$\:K=X{W}_{K}$

, and values

$\:V=X{W}_{V}$

via learned projection matrices, followed by scaled dot-product attention:

$\:\text{A}\text{t}\text{t}\text{e}\text{n}\text{t}\text{i}\text{o}\text{n}\left(Q,K,V\right)=softmax\left(\frac{Q{K}^{\mathsf{T}}}{\sqrt{{d}_{k}}}\right)V$

where

$\:{d}_{k}$

is the dimension of each head. The outputs from all heads are concatenated and projected back to the hidden size of 320.

Each transformer layer also contains a position-wise feed-forward network with two linear transformations separated by a non-linear activation (typically GELU), and uses residual connections and layer normalization before each sublayer for stable optimization.

The ESM2-t6-8M-UR50D model uses a learned embedding matrix with 33 unique tokens representing 20 standard amino acids (A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V) plus special tokens such as < cls>, <pad>, <eos>, <unk>, <mask>, ambiguous amino acid tokens (B, U, Z, O, X), the period symbol (“.”), and the gap symbol (“-”). Each token is mapped to a 320-dimensional vector, resulting in an embedded input sequence matrix of shape (sequence length, 320).

Variable-length sequences are padded using the < pad > token to ensure that all sequences in a batch have the same length. During attention computations, ESM2 masks out these padded positions so they do not affect the learned contextual representations of real amino acid token.

A special < cls > token is prepended to every input sequence, and after the transformer stack, the final hidden state of the < cls > token, a vector of size 320, is typically used as a fixed-size summary representation of the entire sequence. Although ESM2-t6-8M-UR50D also includes a contact prediction head for inter-residue distance prediction in structural biology tasks, this component is not required for sequence-level applications such as classification tasks.

Tokenization of peptide sequences

Peptide sequences were tokenized using the ESM2 model’s native tokenization system, which converts sequences into discrete tokens corresponding to individual amino acids. Tokenization and dynamic padding were performed using the BatchConverter utility provided by the ESM2 library, which handles batching multiple sequences together while automatically aligning their lengths.35 In this study, we used tokens corresponding only to the 20 natural amino acids, as our dataset contained only canonical residues without ambiguous or modified amino acids.

The special < cls > token was used sequence-level representation output from the pretrained model for classification tasks. During batching, BatchConverter automatically padded sequences within each batch to the length of the longest sequence in that batch, ensuring consistent input tensor dimensions. This dynamic padding approach allowed efficient processing without requiring a fixed maximum sequence length for the entire dataset. The final embedded representation of each token in the input sequence had a dimensionality of 320, matching the embedding size defined by the ESM2-t6-8M-UR50D model.

Fine-tuning for AMP classification

We developed a peptide sequence classification framework based on the pre-trained ESM2-t6-8M-UR50D. Our classification architecture combined the ESM2 encoder with a three-layer feedforward head, which sequentially projected the < cls > token embedding through linear layers with hidden dimensions of 256 and 64 before producing binary class logits. Each linear layer was followed by a ReLU activation and dropout for improved regularization and non-linearity. The final output logits were converted to probabilities using a sigmoid activation function, and a threshold of 0.5 was applied during inference: predicted probabilities greater than or equal to 0.5 were classified as active, while probabilities below 0.5 were classified as inactive. The loss function used for training was the cross-entropy loss, implemented as CrossEntropyLoss, which measures the difference between the predicted logits and the true class labels. Models were fine-tuned using the AdamW optimizer,⁵¹ with per-device batch sizes of 8 and gradient accumulation steps of 8, resulting in an effective batch size of 64 for each optimization step. Mixed precision training was enabled using 16-bit floating point (FP16) to accelerate training and reduce GPU memory consumption.

The same dataset used for the design-oriented model was reused here. It was divided into training-validation and external test sets with a 9:1 ratio. Stratified five-fold cross-validation was performed on the training-validation set to ensure balanced class distribution across folds and to enable reliable performance estimation. During each fold, models were trained with early stopping, using a patience of five epochs without improvement in validation loss to prevent overfitting. Hyperparameter tuning was conducted using a grid search strategy, testing combinations of learning rates (1e-5 and 5e-6), weight decay values (0 and 0.01), label smoothing factors (0 and 0.1), and dropout rates (0.1, 0.3, and 0.5) applied to the fully connected layers. This resulted in a total of 24 hyperparameter trials designed to identify the combination that achieved the best performance across all folds. The training workflow used the Hugging Face Transformers library⁵² for model configuration, training loops, and evaluation utilities. The best-performing model for each fold was retained, resulting in an ensemble of five models. For evaluation on the external test set, predictions were generated using this ensemble approach by averaging outputs from the five best models obtained across the cross-validation folds. Final predictions were compared to ground truth labels to assess the model’s performance on unseen data. This procedure is also used in our published tool to predict new input sequences. Models were implemented and trained using PyTorch.⁵³

Evaluation metrics

To evaluate classification performance, we calculated a set of standard metrics, including balanced accuracy, precision, recall, F1 score, Matthews correlation coefficient (MCC), area under the receiver operating characteristic curve (AUC-ROC), and average precision (AUC-PR). Because the dataset was imbalanced, we used balanced accuracy instead of overall accuracy. These metrics were computed for both the design-oriented and deployment models. Specifically, the metrics were defined as follows, using true positive (TP), false positive (FP), true negative (TN), and false negative (FN):

$\:\text{B}\text{a}\text{l}\text{a}\text{n}\text{c}\text{e}\:\text{A}\text{c}\text{c}\text{u}\text{r}\text{a}\text{c}\text{y}=\:\frac{1}{2}\left(\frac{\text{T}\text{P}}{\text{T}\text{P}+\text{F}\text{N}}+\frac{\text{T}\text{N}}{\text{T}\text{N}+\text{F}\text{P}}\right)$

$\:\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n}=\frac{\text{T}\text{P}}{\text{T}\text{P}+\text{F}\text{P}}$

$\:\text{R}\text{e}\text{c}\text{a}\text{l}\text{l}=\frac{\text{T}\text{P}}{\text{T}\text{P}+\text{F}\text{N}}$

$\:F1=2\times\:\frac{\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n}\times\:\text{R}\text{e}\text{c}\text{a}\text{l}\text{l}}{\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n}+\text{R}\text{e}\text{c}\text{a}\text{l}\text{l}}$

$\:\text{M}\text{C}\text{C}=\frac{\text{T}\text{P}\times\:\text{T}\text{N}-\text{F}\text{P}\times\:\text{F}\text{N}}{\sqrt{\left(\text{T}\text{P}+\text{F}\text{P}\right)\left(\text{T}\text{P}+\text{F}\text{N}\right)\left(\text{T}\text{N}+\text{F}\text{P}\right)\left(\text{T}\text{N}+\text{F}\text{N}\right)}}$

$\:\text{AUC-ROC}={\int\:}_{0}^{1}\text{TPR}\hspace{0.17em}d\left(\text{FPR}\right)$

where TPR (True Positive Rate) = TP / (TP + FN) and FPR (False Positive Rate) = FP / (FP + TN)

$\:\text{AUC-PR}={\int\:}_{0}^{1}\text{Precision}\hspace{0.17em}d\left(\text{Recall}\right)$

These metrics were calculated based on the predicted probabilities using standard sklearn implementations.

Alanine scanning

Alanine scanning is an experimental mutagenesis technique used to investigate the role of individual amino acids in proteins or peptides.⁵⁴ By systematically substituting each residue with alanine, a small and non-reactive amino acid, researchers can identify side chains essential for biological activity, structural stability, or molecular interactions. Because alanine removes side-chain atoms beyond the β-carbon without introducing steric or electrostatic disruptions, it serves as a clean probe for evaluating functional importance.

This approach is especially relevant for AMPs, whose activity depends on their amino acid sequence, structural features, and the distribution of hydrophobic and charged residues. In this work, we simulate alanine scanning using our deployment model to evaluate how the predicted probability of a peptide being active changes when each residue is replaced. By “shutting down” individual residues through alanine substitution, we can estimate their contribution to the overall activity and provide insights for optimizing AMP design.

Here, we apply this scanning process to peptides with known MICs against E. coli, focusing on Indolicidin, SB056, and their analogs.^55–59 Since some analogs are not included in the training set, this evaluation allows us to check the model’s ability to generalize beyond memorized sequences.

Data Availability

AMP sequences were collected from the DBAASP database,^43–45 which compiles experimentally validated data from multiple publications, focusing on MIC values against E. coli. Only peptides composed of natural amino acids were included, and any sequences with modifications other than C-terminal amidation, such as disulfide bridges, cyclization, or N-terminal modifications, were excluded. All MIC values were converted to micromolar (µM), and sequences were labeled as active if MIC was less than or equal to the threshold or inactive otherwise. For sequences with multiple reported MIC values, the label was assigned by majority vote; if no clear majority was found, the sequence was removed. Two MIC thresholds, 10 µM and 15 µM, were tested to identify the boundary yielding the best prediction performance, and the lists of sequences corresponding to each threshold are provided in the Supporting Information.

Code Availability

The tool is freely available at https://huggingface.co/spaces/danghuyle/CAmidPred.

Electronic Supplementary Material

Below is the link to the electronic supplementary material

Supplementary Material 1

References

Huan Y, Kong Q, Mou H, Yi H (2020) Antimicrobial Peptides: Classification, Design, Application and Research Progress in Multiple Fields. Front Microbiol 11

Ho CS et al (2025) Antimicrobial resistance: a concise update. Lancet Microbe 6

Li W, Separovic F, O’Brien-Simpson NM, Wade JD (2021) Chemically modified and conjugated antimicrobial peptides against superbugs. Chem Soc Rev 50:4932–4973

Monteiro C, Costa F, Pirttilä AM, Tejesvi MV, Martins MC (2019) L. Prevention of urinary catheter-associated infections by coating antimicrobial peptides from crowberry endophytes. Sci Rep 9:10753

Liu Y, Sameen DE, Ahmed S, Dai J, Qin W (2021) Antimicrobial peptides and their application in food packaging. Trends Food Sci Technol 112:471–483

Sun Z et al (2023) The overview of antimicrobial peptide-coated implants against oral bacterial infections. Aggregate 4:e309

Lin B et al (2021) Cationic Antimicrobial Peptides Are Leading the Way to Combat Oropathogenic Infections. ACS Infect Dis 7:2959–2970

Chen P et al (2024) Embracing the era of antimicrobial peptides with marine organisms. Nat Prod Rep 41:331–346

Chen N, Jiang C (2023) Antimicrobial peptides: Structure, mechanism, and modification. Eur J Med Chem 255:115377

10.

Yang R et al (2025) Advances in antimicrobial peptides: From mechanistic insights to chemical modifications. Biotechnol Adv 81:108570

11.

Lin B et al (2023) The effect of tailing lipidation on the bioactivity of antimicrobial peptides and their aggregation tendency. Aggregate 4:e329

12.

Li W et al (2022) Enhancing proline-rich antimicrobial peptide action by homodimerization: influence of bifunctional linker. Chem Sci 13:2226–2237

13.

Li W et al (2017) The Effect of Selective D- or Nα-Methyl Arginine Substitution on the Activity of the Proline-Rich Antimicrobial Peptide, Chex1-Arg20. Front Chem 5

14.

Hollmann A, Martinez M, Maturana P, Semorile LC, Maffia PC (2018) Antimicrobial Peptides: Interaction With Model and Biological Membranes and Synergism With Chemical Antibiotics. Front Chem 6

15.

Zhu S, Li W, O’Brien-Simpson N, Separovic F, Sani M (2021) -A. C-terminus amidation influences biological activity and membrane interaction of maculatin 1.1. Amino Acids 53:769–777

16.

Dennison SR et al (2015) The role of C-terminal amidation in the membrane interactions of the anionic antimicrobial peptide, maximin H5. Biochim et Biophys Acta (BBA) - Biomembr 1848:1111–1118

17.

Strömstedt AA, Pasupuleti M, Schmidtchen A, Malmsten M (2009) Evaluation of Strategies for Improving Proteolytic Resistance of Antimicrobial Peptides by Using Variants of EFK17, an Internal Segment of LL-37. Antimicrob Agents Chemother 53:593–602

18.

Shahmiri M, Mechler A (2020) The role of C-terminal amidation in the mechanism of action of the antimicrobial peptide aurein 1.2. EuroBiotech J 4:25–31

19.

Meher PK, Sahu TK, Saini V, Rao AR (2017) Predicting antimicrobial peptides with improved accuracy by incorporating the compositional, physico-chemical and structural features into Chou’s general PseAAC. Sci Rep 7:42362

20.

Chowdhury AS, Reehl SM, Kehn-Hall K, Bishop B, Webb-Robertson B-JM (2020) Better understanding and prediction of antiviral peptides through primary and secondary structure feature importance. Sci Rep 10:19260

21.

Zhang J et al (2022) Large-Scale Screening of Antifungal Peptides Based on Quantitative Structure–Activity Relationship. ACS Med Chem Lett 13:99–104

22.

Bhadra P, Yan J, Li J, Fong S, Siu SW (2018) I. AmPEP: Sequence-based prediction of antimicrobial peptides using distribution patterns of amino acid properties and random forest. Sci Rep 8:1697

23.

Pinacho-Castellanos SA, García-Jacas CR, Gilson MK, Brizuela CA (2021) Alignment-Free Antimicrobial Peptide Predictors: Improving Performance by a Thorough Analysis of the Largest Available Data Set. J Chem Inf Model 61:3141–3157

24.

Chung C-R, Liou J-T, Wu L-C, Horng J-T, Lee T-Y (2023) Multi-label classification and features investigation of antimicrobial peptides with various functional classes. iScience 26:108250

25.

Kavousi K et al (2020) IAMPE: NMR-Assisted Computational Prediction of Antimicrobial Peptides. J Chem Inf Model 60:4691–4701

26.

Huang J et al (2023) Identification of potent antimicrobial peptides via a machine-learning pipeline that mines the entire space of peptide sequences. Nat Biomed Eng 7:797–810

27.

Yan J et al (2020) Deep-AmPEP30: Improve Short Antimicrobial Peptides Prediction with Deep Learning. Mol Therapy - Nucleic Acids 20:882–894

28.

Hussain W, sAMP-PFPDeep (2022) Improving accuracy of short antimicrobial peptides prediction using three different sequence encodings and deep neural networks. Brief Bioinform 23:bbab487

29.

Xu J et al (2023) iAMPCN: a deep-learning approach for identifying antimicrobial peptides and their functional activities. Brief Bioinform 24:bbad240

30.

Tucs A et al (2020) Generating Ampicillin-Level Antimicrobial Peptides with Activity-Aware Generative Adversarial Networks. ACS Omega 5:22847–22851

31.

Szymczak P et al (2023) Discovering highly potent antimicrobial peptides with deep generative model HydrAMP. Nat Commun 14:1453

32.

Li J, Pu Y, Tang J, Zou Q, Guo F (2020) DeepAVP: A Dual-Channel Deep Neural Network for Identifying Variable-Length Antiviral Peptides. IEEE J Biomedical Health Inf 24:3012–3019

33.

Lin T-T et al (2021) AI4AMP: an Antimicrobial Peptide Predictor Using Physicochemical Property-Based Encoding Method and Deep Learning. mSystems 6, e00299-21

34.

Cao R et al (2023) FFMAVP: a new classifier based on feature fusion and multitask learning for identifying antiviral peptides and their subclasses. Brief Bioinform 24:bbad353

35.

Lin Z et al (2023) Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379:1123–1130

36.

Elnaggar A et al (2022) ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE Trans Pattern Anal Mach Intell 44:7112–7127

37.

Lee H, Lee S, Lee I, Nam H (2023) AMP-BERT: Prediction of antimicrobial peptide function based on a BERT model. Protein Sci 32:e4529

38.

Yao L et al (2023) DeepAFP: An effective computational framework for identifying antifungal peptides based on deep learning. Protein Sci 32:e4758

39.

Cordoves-Delgado G, García-Jacas CR (2024) Predicting Antimicrobial Peptides Using ESMFold-Predicted Structures and ESM-2-Based Amino Acid Features with Graph Deep Learning. J Chem Inf Model 64:4310–4321

40.

Jacob Witten & Zack Witten (2019) Deep learning regression model for antimicrobial peptide design. bioRxiv 692681. 10.1101/692681

41.

Tacconelli E et al (2018) Discovery, research, and development of new antibiotics: the WHO priority list of antibiotic-resistant bacteria and tuberculosis. Lancet Infect Dis 18:318–327

42.

Nori H, Jenkins S, Koch P, Caruana R (2019) InterpretML: A Unified Framework for Machine Learning Interpretability. Preprint at https://doi.org/10.48550/arXiv.1909.09223

43.

Müller AT, Gabernet G, Hiss JA, Schneider G (2017) modlAMP: Python for antimicrobial peptides. Bioinformatics 33:2753–2755

44.

Zhang Q et al (2021) Antimicrobial peptides: mechanism of action, activity and clinical potential. MILITARY Med Res 8

45.

Selsted ME et al (1992) Indolicidin, a novel bactericidal tridecapeptide amide from neutrophils. J Biol Chem 267:4292–4295

46.

Rozek A, Friedrich CL, Hancock REW (2000) Structure of the Bovine Antimicrobial Peptide Indolicidin Bound to Dodecylphosphocholine and Sodium Dodecyl Sulfate Micelles^,. Biochemistry 39:15765–15774

47.

Joshi S et al (2010) Interaction studies of novel cell selective antimicrobial peptides with model membranes and E. coli ATCC 11775. Biochim et Biophys Acta (BBA) - Biomembr 1798:1864–1875

48.

Ando S et al (2010) Structure-activity relationship of indolicidin, a Trp-rich antibacterial peptide. J Pept Sci 16:171–177

49.

Bera S et al (2015) Probing the role of Proline in the antimicrobial activity and lipopolysaccharide binding of indolicidin. J Colloid Interface Sci 452:148–159

50.

Manzo G et al (2015) Enhanced Amphiphilic Profile of a Short β-Stranded Peptide Improves Its Antimicrobial Activity. PLoS ONE 10:e0116379

51.

Shimoyama Y, pyCirclize (2022) Circular visualization in Python

52.

Carratalá JV, Serna N, Villaverde A, Vázquez E (2020) Ferrer-Miralles, N. Nanostructured antimicrobial peptides: The last push towards clinics. Biotechnol Adv 44:107603

Acknowledgements (optional)

Keep acknowledgements brief and do not include thanks to anonymous referees or editors, or effusive comments. Grant or contribution numbers may be acknowledged.

Ethics declarations

Competing interests

Submission of a competing interests statement is required for all content of the journal.

Supplementary Information

The online version contains supplementary material available at…, i including a tutorial video demonstrating the use of CAmidPred for designing new C-terminal amidated AMPs.

Tables

Yes