Introduction

Anticancer peptides (ACPs) are small peptides exerting selective and toxic properties toward cancer cells. Owing to its inherent high penetration, high selectivity and ease of modification, synthetic peptide-based drugs and vaccines1,2,3 represents a promising class of therapeutic agents. Designed ACPs can improve affinity, selectivity and stability for enhancing cancer cell elimination. The influence of amino acid residues towards the anticancer activity of ACPs is dependent on cationic, hydrophobic and amphiphilic properties with helical structure to drive cell permeability. Particularly, cationic amino acid residues (i.e., lysine, arginine, and histidine) can disrupt and penetrate the cancer cell membrane to induce cytotoxicity whereas anionic amino acids (i.e., glutamic and aspartic acids) affords antiproliferative activity against cancer cells. Furthermore, hydrophobic amino acid residues (i.e., phenylalanine, tryptophan, and tyrosine) exerts their effect on the cancer cytotoxic activity1,4,5. Moreover, the secondary structure of ACPs that is formed by cationic and hydrophobic amino acids, plays a crucial role in peptide-cancer cell membrane interaction that inherently leads to cancer cell disruption and death1,6. Therefore, it is desirable to develop a simple, interpretable and efficient predictor for achieving accurate ACP identification as well as facilitating the rational design of new anticancer peptides with promising clinical applications.

In the past few years, most methods in existence were developed via the use of machine learning (ML) and statistical methods as applied on peptide sequence information for discriminating ACPs from non-ACPs7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23. More details of those existing methods are summarized in two comprehensive review papers2,3. Amongst the various types of ML approaches, both support vector machine (SVM) (i.e. AntiCP8, Hajisharifi et al.’s method9, ACPP24, iACP10, Li and Wang’s method11, iACP-GAEnsC12, TargetACP14 and ACPred19) and the ensemble approach (i.e. MLACP13, ACPred19, PTPD21, ACP-DL22, PEPred-Suite20, ACPred-FL15, ACPred-Fuse18, PPTPP23 and AntiCP_2.025) were widely used to develop ACP predictors. As summarized in a recent review2, we could see that TargetACP has been developed by integrating the split amino acid composition and pseudo position-specific scoring matrix descriptors14, which was shown to outperform SVM-based predictors8,9,10,11,12,19,24. In the meanwhile, the state-of-the-art ensemble methods comprising PEPred-Suite20 and ACPred-Fuse18 provided the highest prediction accuracies as evaluated on the dataset collected by Rao et al.18. In ACPred-Fuse, it was developed using random forest (RF) model in conjunction with 114 feature descriptors. And then, a total of 114 RF models were trained to generate class information and probabilistic information used for developing a final model. Most recently, Agrawal et al. proposed an updated version of AntiCP called AntiCP2.0 and also provided two high-quality benchmark datasets (i.e. main and alternative datasets) having the largest number of peptides. AntiCP2.0 was developed by extremely randomized trees (ETree) algorithm with amino acid composition (AAC) and dipeptide composition (DPC). On the basis of independent test results reported by the prior work of AntiCP2.0, it can be noticed that AntiCP2.0 was superior to other existing ACP predictors (e.g. AntiCP8, iACP10, ACPred19, ACPred-FL15, ACPred-Fuse18, PEPred-Suite20). All in all, much progress has been achieved in existing methods. Nevertheless, two potential drawbacks of existing ACP predictors motivated us to develop a new ACP predictor in this study. First, their interpretable mechanisms are not easily understood and implemented by the viewpoint of biologists and biochemists. Existing ACP models do not provide a straight-forward explanation on the underlying mechanism of the biological activity of what constitute ACPs. Meanwhile, a simple and easily interpretable models is more useful in a further analysis of characteristics of anticancer activities of peptides. Second, their accuracy and generalizability still require improvement.

In consideration of these problems, we propose herein the development of a novel ML-based predictor called the iACP-FSCM for further improving the prediction accuracy as well as shedding light on characteristics governing anticancer activities of peptides. The conceptual framework of the iACP-FSCM approach proposed herein for predicting and analyzing ACPs is summarized in Fig. 1. The major contributions of iACP-FSCM for predicting and characterizing ACPs can be summarized as follows. Firstly, we proposed herein a novel, flexible scoring card method (FSCM) for effective and simple prediction and characterization of peptides affording anticancer activity using only sequence information. The FSCM method is an updated version of the SCM method developed by Huang et al.26 and Charoenkwan et al.27 by making use of propensity scores of both local and global sequential information. Secondly, unlike the rather complex classification mechanisms as afforded by state-of-the-art ensemble approaches15,18,20, the iACP-FSCM method proposed herein identifies ACPs using only weighted-sum scores between the composition and propensity scores, which is easily understood and implemented by biologists and biochemists. Thirdly, the FSCM-derived propensity scores can be adopted to identify informative physicochemical properties (PCPs) that may provide crucial information pertaining to local and global properties of ACPs. Finally, comparative results revealed that iACP-FSCM outperformed those of state-of-the-art ACP predictors for ACP identification and characterization. The iACP-FSCM webserver presented herein has been demonstrated to be robust as deduced from its superior prediction accuracy, interpretability and publicly availability, which is instrumental in helping biologists in their identification of ACPs with potential bioactivities. Furthermore, the proposed FSCM method has great potential for estimating the propensity scores of amino acids and dipeptides that can be used to predict and analyze various bioactivities of peptides such as haemolytic peptides28, antihypertensive peptides29 and antiviral peptides20,23.

Figure 1
figure 1

System flowchart of the proposed iACP-FSCM. There are five main steps are involved in the development of proposed iACP-FSCM as follows: (i) preparing the training and independent datasets, (ii) calculating the initial propensity score (init-PS) using a statistical approach, (iii) estimating the optimized propensity score (opti-PS) using a genetic algorithm (GA), (iv) evaluating the prediction ability of iACP-FSCM, (v) ACPs characterization using the propensity scores and docking approach.

Materials and methods

Benchmark datasets

In order to make a fair comparison with existing methods, the most recent and high-quality benchmark datasets (i.e. main and alternative datasets) collected from the work of AntiCP_2.025 were used in the development and validation of the iACP-FSCM model proposed herein. Both datasets can be downloaded from https://webs.iiitd.edu.in/raghava/anticp2/download.php. The main dataset consists of 861 experimentally validated ACPs and 861 AMPs while the alternative dataset consists of 970 experimentally validated 970 ACPs and 970 random peptides from protein in SwissProt. All peptides on main and alternative datasets were unique. To avoid overestimation in the prediction model, the main and alternative dataset were randomly divided as the training (named MAIN-TR and ALTER-TR) and independent sets (named MAIN-TS and ALTER-TS) using the 80:20 ratio. Further details regarding the construction of the main and alternative datasets is provided in the original work of AntiCP_2.025.

Protein feature representation

In this study, we employed 11 feature classes generated from 3 different feature encodings using AAC, DPC and terminus compositions for representing peptide sequences as feature vectors with fixed length. Herein, we briefly describe each feature encoding definition in forthcoming subsections.

Amino acid composition

AAC is the proportion of any amino acid in a given peptide P. AAC descriptor can be represented as formulated by:

$$ {\text{AAC}}\left( {\mathbf{P}} \right) = ({\text{aac}}_{1} ,{\text{aac}}_{2} , \ldots , {\text{aac}}_{20} ) $$
(1)

where aaci is the normalized composition of the ith amino acid (aai). The dimension of AAC descriptor is 20.

Dipeptide composition

DPC is the proportion of any two adjacent amino acids (aai, aaj) in a given peptide P. DPC descriptor can be represented as formulated by:

$$ {\text{DPC}}\left( {\mathbf{P}} \right) = ({\text{dpc}}_{1} ,{\text{dpc}}_{2} , \ldots , {\text{dpc}}_{400} ) $$
(2)

where dpci is the normalized composition of the ith dipeptide (dpi). The dimension of DPC descriptor is 400.

Composition on terminal region

Keeping in mind that the information on N- and C-terminus are important in the biological activity of peptides7,8,19,30,31,32,33, we thus calculated the DPC information using the first 5, 10 and 15 residues from the N (i.e. N5, N10 and N15, respectively) and C terminus (i.e. C5, C10 and C15, respectively). In addition, we also joined these terminus sequence and their DPC as follows: N5C5, N10C10 and N15C15. The dimension of DPC on terminal region descriptor is 400.

Flexible scoring card method

The original SCM method uses only the global sequential information (i.e. 20 amino acids (APS) and 400 dipeptides (DPS) propensity scores) for prediction and analysis of proteins26,27. Inspired by this method, we developed and implemented a novel flexible SCM-based method called FSCM to further improve the prediction accuracy and interpretability by utilizing both local and global sequential information of peptides. DPS was used to provide local sequence information as they were found to yield better prediction performance and provide more information than APS. Particularly, the FSCM method estimated the propensity scores of 400 dipeptides on N- (N5PS, N10PS and N15PS) and C- (C5PS, C10PS and C15PS) terminus as well as their joint terminus sequences (N5C5PS, N10C10PS and N15C15PS). In the proposed iACP-FSCM, we built 11 FSCM models obtained using different 11 propensity scores of amino acids, dipeptides and dipeptide on N- and C-terminus for main and alternative dataset each. Below, we briefly describe the basic concepts and the optimization procedures of C15PS on main dataset, since the other types of propensity scores can be estimated in the same procedure without significant modifications.

Phase 1: Preparing the training (MAIN-TR) and independent (MAIN-TS) datasets for the development and evaluation of the proposed model as described above.

Phase 2: Calculating the initial propensity score of 400 dipeptides on the first 15 residues from the C terminus (init-C15PS). According to Charoenkwan et al.34,35,36,37, the init-C15PS is estimated, as follows:

Step 1: Computing the frequency of all 400 dipeptides found in ACP and non-ACP. For example, the frequency of KK presented in ACP and non-ACP classes consisted of 280 and 40, respectively.

Step 2: Calculating the ratio between each dipeptide by the total number of dipeptides for ACP and non-ACP classes. For example, the total number of dipeptides in ACP and non-ACP classes were 450 and 200, respectively. Therefore, normalized compositions of KK dipeptide in ACP and non-ACP classes (called NPS+ and NPS-, respectively) were 0.622 and 0.2, respectively.

Step 3: Computing the score of each dipeptide by subtracting NPS+ from NPS-. For example, the score of DE dipeptide is 0.422 (0.622–0.2).

Step 4: Normalizing the score of each dipeptide into the range of 0–1000.

Phase 3: Estimating the optimized propensity score of 400 dipeptides (opti-C15PS) and the threshold value using the GA algorithm37. More details of the GA algorithm used in this study can be found in the Supplementary information S1. To obtain the best opti-C15PS, the corresponding threshold value are subjected to the fitness function26,27 whereby the prediction performance in terms of the AUC (\(W_{1}\)) and the Pearson’s correlation coefficient (\(W_{2}\)) between init-C15PS and opti-C15PS are linearly combined and assessed by a tenfold cross-validation procedure:

$$ {\text{F}}\left( {{\text{opti}} - {\text{C}}15{\text{PS}}} \right) = W_{1} \times {\text{AUC}} + { }W_{2} \times {\text{R}} $$
(3)

where values of \(W_{1}\) and \(W_{2}\) are 0.9 and 0.1, respectively. Furthermore, weights for \(W_{1}\) and \(W_{2}\) were set based on our previous studies27,34,35,36,37.

Phase 4: Computing the propensity scores of 20 amino acids using the opti-C15PS from Phase 3. Taking Lys as an example, the propensity score for Lys is calculated by averaging the propensity scores of 40 dipeptides containing Lys.

Phase 5: Predicting an unknown peptide (P) by using the scoring function S(P) and the opti-C15PS from Phase 3. A query peptide P is predicted to be ACP if S(P) is greater than the threshold value, otherwise P is predicted to be a non-ACP.

$$ S\left( P \right) = \mathop \sum \limits_{i = 1}^{400} DP_{i} PS_{i} $$
(4)

where \(DP_{i}\) and \(PS_{i}\) represent the occurrence frequency and propensity score of the ith dipeptide from the opti-C15PS, respectively, where i = 1, 2, 3, …, 400.

Phase 6: Evaluating the prediction ability of the model by using four widely used metrics for binary classification problems consisting of accuracy (Ac), sensitivity (Sn), specificity (Sp) and Matthew’s coefficient correlation (MCC)38,39. Receiver operating characteristic (ROC) curves were plotted to further investigate the prediction performance of the proposed model using threshold-independent parameters. Further details on the definition of these metrics can be found in the Supplementary data S1.

Characterization of anticancer activities of peptides

The propensity score of 20 amino acids are informative PCPs that were employed for providing an in-depth understanding on the basis and important factors governing the anticancer activity. Particularly, propensity scores of each amino acid reflect its influence on the biological, functional and structural properties of peptides. It is well-known that PCPs are one of the most intuitive feature descriptors associated with biophysical and biochemical reactions. Informative PCPs were determined from the iACP-FSCM method according to three main steps. Firstly, PCPs having not applicable (NA) as their amino acid indices were excluded and this resulted in a total of 531 PCPs40 that were further used in this study. Secondly, the Pearson’s correlation coefficient (R) value between the propensity scores of amino acids with those of 531 PCPs were calculated. Finally, PCPs with an absolute R value greater than 0.5 will be selected as candidate PCPs for further analysis.

Reproducible research

To ensure the repeatability and reproducibility of proposed models, all codes and the benchmark datasets (i.e. main and alternative datasets) are available on GitHub at https://github.com/Shoombuatong/Dataset-Code/tree/master/iACP-FSCM.

Results and discussion

Performance evaluation on main dataset

In this study, we employed 11 feature classes generated from 3 different feature encodings using AAC, DPC and terminus compositions (i.e. N5, C5, N5C5, N10, C10, N10C10, N15, C15 and N15C15). Particularly, this led to the generation of 11 types of propensity scores (i.e. APS, DPS, N5PS, C5PS, N5C5PS, N10PS, C10PS, N10C10PS, N15PS, C15PS and N15C15PS). To examine which types of propensity scores are beneficial for distinguishing ACPs from non-ACPs, we performed performance comparisons of different types of propensity scores via tenfold cross-validation and independent tests on main dataset. For each type of propensity scores, 10 sets of propensity scores were generated by the GA algorithm and then used in the development of 10 different FSCM classifiers. Tables 1 and 2 lists the best prediction results as derived from optimal sets for each type of propensity scores via tenfold cross-validation and independent tests, respectively (Fig. 2).

Table 1 Cross-validation results of FSCM models with various types of sequence features as evaluated on the main dataset.
Table 2 Independent test results of FSCM models with various types of sequence features as evaluated on the main dataset.
Figure 2
figure 2

Heatmap of amino acids propensity scores obtained from the proposed iACP-FSCM.

As can be seen from Table 1 and Supplementary Table S1, the best Ac of 0.754 with an MCC of 0.496 and AUC of 0.762 was achieved by using C15PS (Fig. 3A). Meanwhile, the use of N5C5PS and N5PS performed well with correspondingly second and third highest Ac/MCC of 0.750/0.508 and 0.750/0.504, respectively. As noticed in Table 1, the performance of the widely used DPS (affording an Ac of 0.726 and AUC of 0.754) was comparable to that of the C15PS with regards to all of the five evaluation indices. In the case of independent test results, Table 2 showed that the C15PS also achieved better performance than other types of propensity scores and provided an Ac of 0.825 and an MCC of 0.646 (Fig. 3B). In the meanwhile, N15C15PS and N15PS performed well with the second and third highest independent test with Ac of 0.796 and 0.783, respectively. Hence, we selected the FSCM-based classifier in conjunction with propensity scores of 400 dipeptides on the C15 terminus (C15PS) as the optimal classifier for ACP identification using the main dataset. These results implied that the local sequential information plays a crucial role in distinguishing ACPs from non-ACPs than that of the global sequential information.

Figure 3
figure 3

ROC curves of top-five types of propensity scores over tenfold cross-validation (A,C) and independent test (B,D) on main (A,B) and (C,D) alternative dataset.

Performance evaluation on alternative dataset

In this section, the same experimental setting as those used in the main dataset (from the original work from which it was taken) was utilized to determine which types of propensity scores were the most effective for distinguishing ACPs from random peptides in the alternative dataset. A series of performance comparison experiments using various types of propensity scores was carried out and their results were compared via a tenfold cross-validation and independent test as summarized in Tables 3 and 4.

Table 3 Cross-validation results of FSCM models with various types of sequence features as evaluated on the alternative dataset.
Table 4 Independent test results of FSCM models with various types of sequence features as evaluated on the alternative dataset.

From Table 3, it could be seen that the model affording the highest Ac had a value of 0.884 with a corresponding MCC of 0.770 and an AUC of 0.924 that was achieved using APS (Fig. 3C), while models affording the second and third highest Ac had values of 0.872 and 0.867, respectively, which were obtained using DPS and N15C15PS, respectively. As for results from the independent test (Table 2), both APS and DPS were amongst the 2 top-ranked classifiers also having the highest prediction results. Furthermore, it was found that DPS achieved slightly better performances than APS (0.910 vs 0.889 for Ac and 0.820 vs 0.779 for MCC). In the meanwhile, APS was found to achieve very comparable than that of the DPS feature as deduced from the AUC value (Fig. 3D). Hence, we selected the FSCM-based classifier in conjunction with the propensity scores of 20 amino acids from the whole sequence (APS) as the optimal classifier for ACP identification on alternative dataset. For convenience, the FSCM method in conjunction with the selected propensity scores (C15PS and APS for main and alternative datasets, respectively) will be referred to as the iACP-FSCM. Based on the observations described above, it could be demonstrated that the iACP-FSCM could provide the satisfied results for both main and alternative datasets because the composition information on ACPs influenced the interaction on cancer cell membrane, penetration the cell membrane, and then cancer cell cytotoxicity via their physicochemical properties (e.g. amphipathicity, hydrophobicity, and secondary structures)1.

Comparisons of iACP-FSCM with existing methods

To further assess the predictive efficiency and effectiveness of the proposed iACP-FSCM, we compared its performances against existing methods on the same benchmark dataset. Table 5 lists performance comparisons of iACP-FSCM with existing methods on main and alternative datasets over independent test. The prediction results of existing methods (i.e. AntiCP8, iACP10, ACPred19, PEPred-Suite20, ACPred-FL15, ACPred-Fuse18 and AntiCP_2.025) recorded in Table 5 come directly from the work25.

Table 5 Independent test results of the proposed method ACPred-FSCM with state-of-the-art methods as evaluated on main and alternative datasets.

By observing the results listed in Table 5, it is clearly that the performance of iACP-FSCM is superior to that of existing methods with the highest Ac (0.825), Sp (0.903) and MCC (0.646). Improvements of 7%, 17% and 14% for Ac, Sp and MCC on main dataset, respectively, were observed when compared with the state-of-the-art method AntiCP_2.0. In addition, iACP-FSCM achieved a greater than 14% increase in Ac compared with the existing ensemble methods containing PEPred-Suite, ACPred-FL and ACPred-Fuse. Although, AntiCP and ACPred were higher Sn values than the proposed iACP-FSCM, the corresponding Sp and MCC were significantly lower. In case of the comparative results on alternative dataset, we noticed that AntiCP_2.0 provided the highest accuracy of 0.920 with an MCC of 0.840 (Table S3). Meanwhile, the second- and third-best ACP predictors (Ac, MCC) were obtained from AntiCP (0.900, 0.800) and iACP-FSCM (0.889, 0.779), respectively. Although, AntiCP_2.0 obtained better prediction results than our proposed iACP-FSCM, AntiCP_2.0 is limited in terms of interpretability and practical utility for biologists and biochemists. On the other hand, the iACP-FSCM provides the propensity scores that might provide the crucial information relating to local and global properties of ACPs, which is easily understood and implemented. Furthermore, the interpretability of the proposed iACP-FSCM with impressive prediction performance is a more useful and practical approach. Taken together, these results revealed that iACP-FSCM provided more impressive prediction performances on both main and alternative datasets in terms of simplicity, interpretability and generalizability.

Characterization of anticancer activities of peptides using propensity scores

Unlike black-box modeling methods such as SVM and ensemble methods, the advantage of iACP-FSCM are that the estimated propensity scores of amino acids and dipeptides derived from the FSCM method could easily identify informative PCPs for gaining a more in-depth understanding on the characteristics of anticancer activities peptides. The propensity scores of 20 amino acids to be ACPs derived from the DPS (Fig. 2) are recorded in Table 6, which were calculated using Matlab (R2020a). The five amino acids with the highest propensity scores contained Tyr, Trp, His, Met and Lys (355.55, 328.60, 317.03, 311.58 and 296.78, respectively), whereas the five amino acids with the lowest propensity scores contained Gln, Val, Gly, Cys and Arg (198.45, 212.55, 225.08, 226.38 and 229.63, respectively). In case of the propensity scores of 400 dipeptides to be ACPs, Fig. 2 shows that the five top-ranked dipeptides with the highest propensity scores contained KK, LW, GH, HI and MY, whereas the five top-ranked dipeptides with the lowest propensity scores contained KG, LD, LV, CR and TT.

Table 6 Important physicochemical properties (PCPs) as derived from the iACP-FSCM.

In biological process, cancer cell development is mostly caused by free radicals damaged on cells via ionizing radiation mechanism, especially DNA damage5. Meanwhile, reactive oxygen species can promote cancer, growth arrest, cytotoxicity and irreversible damage. The amino acid composition on ACPs can act as antioxidant and dietary source of the cells4. Interestingly, the five amino acids with the highest propensity scores were reported as the important factor for the antioxidant activity. Because electron-rich aromatic rings in side chains of Tyr and Trp, sulfur atoms with two lone electron pairs in side chains of Met, and nitrogen atoms with one lone electron in side chain of His are easily oxidized41. Among anti-oxidative amino acids, Trp is low abundant in natural peptides, but, it is crucial role of biomolecule activity and easy chemical modification42. Although, His is one of the five top-ranked amino acids, His-containing dipeptides such as GH and HI, had no anticancer activity in in vitro study. Furthermore, AH and LH showed antiangiogenic activity without great anticancer potential in zebrafish embryo model43.

It is well recognized that cancer metabolism has focused on glycolysis and tricarboxylic acid (TCA) cycle. Many cancer cells are highly dependent on Gln and Ser uptake for a proliferation and these two amino acids are the most highly consumed nutrients44. Choi and Coloff proposed that Gln serves as anaplerosis metabolite and plays a crucial role in the TCA cycle to maintain mitochondrial ATP production45. Meanwhile, the tumor’s evolution utilizes Gln, as alternative fuels to optimize a nutrient utilization. Similarly, Val, which is one of branched-chain amino acid, can fuel in the TCA cycle46. Gln and Gly, which provide essential carbon and nitrogen sources for the nucleobase synthesis, are beneficial in the energy-consuming process via DNA/RNA synthesis in cells47. Although, Gly is one of the five top-ranked amino acids having lowest propensity scores, dipeptide containing Gly or Pro performed good cytotoxicity in vitro tumor human cell lines such as A549 lung cancer cell line48. After analyzing the FSCM-derived propensity scores, these results suggest that amino acids having high propensity scores could be important in exhibiting the anticancer activity via the oxidation protection process, while amino acids having low propensity scores could be important in serving as dietary source of the cancer cells as well as provide a contradictory effect on anticancer activity.

Characterization of anticancer activities using informative physicochemical properties

In this section, the iACP-FSCM method was utilized to provide a more in-depth understanding of the basis and important factors for the anticancer activity. In the previous studies, the physicochemical properties (i.e. amino acid sequence, length, net charge, secondary structure, amphipathicity, and hydrophobicity) of peptides play crucial role in their hemolytic activity, penetration ability and anticancer/antitumor activity1,19,49,50,51,52. The three importantly selected PCPs derived from iACP-FSCM consist of MITS020101 (Amphiphilicity index), QIAN88011 (Weights for alpha-helix at the window position of 6) and JOND750101 (Hydrophobicity) were showed in Table 6. In addition, Supplementary Table S4 presents further details of the top-twenty informative PCPs.

It is well-known that Trp with a propensity score of 328.20 is a common amino acid in amphiphilicity, alpha-helix, and hydrophobicity. Lee et al. investigated the relationship between the anticancer activities of Pep27 analogues and their hemolytic activity and hydrophobicity. They found that Pep27 analogue peptides substituting with Trp was increased hydrophobicity based on the RP-HPLC retention time. The substitutions of (11Ser → Trp) and (13Qln → Trp) in Pep27anal2 had the greatest hydrophobicity with a RP-HPLC retention time of 22.50 min as well as exhibited the most anticancer activity with the IC50 (10–28 μM) and IC90 (35–55 μM) in five cancer cell lines41. This observation was quite consistent with the previous work of53,54, implying that end-capping and cyclization of hexameric peptide sequences of RRWQWR and RRWWRF or end-tagging of short peptides KNK10 (KNKGKKNGKH) and GKH17 (GKHKNKGKKNGKHNGWK) with hydrophobic Trp or Phe stretches could enhance the stability of ACPs and against proteolytic degradation.

Table 6 shows that Lys, His and Arg (i.e. the cationic amino acids) provide acceptable propensity scores for both amphiphilicity index (MITS020101) and alpha-helix (QIAN88011) properties. These three amino acids are described by the amphipathic alpha-helical structure transformation that segregates Lys on one face and Ile on the opposite side to interact with the negatively-charged membrane that consequently gives rise to high anticancer activity53,55. Furthermore, the octahistidine-octaarginine (H8R8) peptide is a common cationic cell penetrating peptide with endosomal escape capabilities. The modified H8R8 as a lipid-modified cationic peptide (i.e. stearyl-H8R8 and vitamin E succinate-H8R8) with the functions of amphiphilic, biodegradable and lipid structure, can increase reactive oxygen species production, reduce cell bioenergetics and drug efflux, trigger apoptosis and G1 cell cycle arrest, and mitochondria depolarization thereby leading to cancer cell toxicity and death56. Owing to the fact that the indole side chains of Trp exhibits a preference to interact with the interfacial region of lipid bilayers while Lys and Arg side chains on peptides provide positive charges and hydrogen bonding capabilities to attract negatively-charged phospholipid headgroups of cell membranes54,57,58. Furthermore, side chains of aromatic residues (i.e. Trp and Phe) in which one side of the backbone ring forms a hydrophobic face to engage in interaction with the micelle6. Such interaction between ACPs containing Trp, Phe, Lys, His, or Arg and cancer cell membranes are often found in situations of cancer cell eradication. The aforementioned results as obtained from iACP-FSCM are in accordance with previous studies6,53,54,55,56,57,58,59 in which physicochemical properties of ACPs (i.e. amphiphilicity, helical structure and hydrophobicity) pertains to the interaction between ACPs and the cell surface. This interaction causes ACPs to transform into a helical structure to confer the spatial arrangement of aliphatic side chains for membrane insertion. The turn stabilization of the helical conformation promotes the intra-chain hydrogen-bonding and mediates the backbone hydrophobicity thereby causing a deeper insertion of peptides into the lipid bilayer59.

Case study

A key advantage of iACP-FSCM is its interpretability to biologists in which mechanistic insights into the origin of anticancer activity of investigated ACPs as deduced from the scoring function S(P) for ACPs that have not yet been experimentally verified26,27,37. The top 20 peptides with the highest and lowest scores are reported in Supplementary Tables S5 and S6, respectively. We noticed that scores for the top 20 ACPs with the highest ACP scores (S(P)) were in ranges of 636.59–700.64 whereby the threshold value was 311 (Table 1). Interestingly, the peptide sequence of KAKLF having an ACP score of 645 was found in the top 9 peptides having a high docking score of -29.75 kJ/mol towards the hypoxia inducible factor 1α (HIF-1α) as reported in the previous study60.

Inspired by this study60, the top 20 ACPs (ID: 1–20) derived from the iACP-FSCM were then docked with the predicted binding sites of HIF-1α in order to estimate their interaction energies (kcal/mol) for finding a new potential peptide-based drug for HIF-1α. In order to make a fair comparison, the same experimental setup was used for estimating interaction energies of the top 9 ACPs as proposed by the previous study60. In this study, HIF-1α was prepared for docking using the protein preparation features in the Chimera software, which was performed using the default protocol for PDB2PQR and Dock Prep. Protonation states were assigned using PROPKA at a pH of 7.0 and Gasteiger charges were assigned to the protein61. Protein-peptide similarity-based docking was performed using the GalaxyPepDock web server (http://galaxy.seoklab.org/pepdock) by utilizing the information provided by the database to perform the docking procedure that entails the search for suitable templates from a database of experimentally determined structures and building models using the energy-based optimization method that allows for structural flexibility. The calculation of protein-peptide binding and interaction energy were performed using the NOVA force field62 while the visualization of the structures was carried out using YASARA (Yet Another Scientific Artificial Reality Application; http://www.yasara.org/index.html) .

The three-dimensional complexed structure for the top 5 potential ACPs is provided in Fig. 4 while the interaction energy scores are listed in Table 7 and it was found that values ranged from -9.39 kcal/mol to -6.53 kcal/mol (i.e. consisting of peptides ID 10, 7, 20, 9 and 16). Particularly, the peptide sequence, ACP score and their corresponding interaction energy (i.e. as reported in parenthesis) for peptides ID 10, 7, 20, 9 and 16 are as follows: (FAKKLAKKLKKLAKKLAKKWKL, 655.29, − 9.39), (FAKKLKKLAKLAKKL, 663.93, − 8.71), (FALAAKALKKLAKKLKKLAKKAL, 636.59, − 7.21), (FAKKLAKKLKKLAKLALAK, 657.22, − 6.73) and (FAKKLAKKLKKLAKKLAKLALAL, 646.64, − 6.53), respectively. A visualization of the molecular surface of peptide ID: 10 (peptide sequence FAKKLAKKLKKLAKKLAKKWKL) that was found to exhibit maximal interaction energy of − 9.39 kcal/mol (i.e. and within 3 Å distance) with the HIF-1α receptor is depicted in Fig. 5. As seen from Table 5, the interaction energies of ACPs ID: 21–29 are ranging from − 4.81 kcal/mol to 11.98 kcal/mol. Amongst the 9 ACPs as reported by a previous study60, peptide ID: 25 (i.e. having a peptide sequence KAKLF) displayed the highest interaction energy score of − 4.81 kcal/mol with the HIF-1α receptor. These results indicated that peptide ID: 10 as derived from this study is a promising ACP with promising potential against breast cancer when compared to peptide ID: 10 as proposed by the previous study60. However, additional in vitro and in vivo approaches will be needed for further development of novel ACPs against breast cancer. It is highly anticipated that iACP-FSCM can serve as an important tool for the rapid screening of promising ACPs against breast cancer as well as other types of cancer cell prior to their synthesis.

Figure 4
figure 4

Three-dimensional complex structure of the top 5 ACPs having maximum interaction energies. The binding pocket was colored according to residue type by YASARA coloring scheme, where grey, green, blue, red and cyan colors represent non-polar, amidic, basic, acidic hydroxylic amino acids, respectively.

Table 7 Top 20 ACPs having high score derived from iACP-FSCM and top 9 ACPs having maximum docking scores derived from the work53 along with their ACP scores and interaction energies.
Figure 5
figure 5

Molecular surface of docking complex between the HIF-1α receptor (left) and the peptide ID: 10 (right), in stick model, where amino acids in 3 Å binding, where Phe, Ala, Leu and Trp are non-polar residues (grey) and Lysis basic residue (blue).

Conclusions

In this study, we have proposed for the first time a computational model called the iACP-FSCM for ACP identification and characterization via the use of propensity scores of local and global sequential information as obtained using the novel FSCM method. It was demonstrated that the iACP-FSCM could easily identify ACPs using only a weighted-sum score and a single threshold value. This was compared with the complex ensemble classifiers as developed using a large number of ML classifiers and feature descriptor schemes. Furthermore, the FSCM-derived propensity scores can be adopted to identify informative physicochemical properties that might provide crucial information relating to local and global properties of ACPs. Results from the benchmarked comparison validated the effectiveness and robustness of the proposed iACP-FSCM approach. We further applied the iACP-FSCM to identify potential peptide-based drugs against HIF-1α and obtained a list of potential peptides against HIF-1α. With these promising results, it is highly anticipated that iACP-FSCM can serve as an important tool for the rapid screening of promising ACPs against various types of cancer cells prior to their synthesis. In order to develop a convenient bioinformatics tool, the proposed model is deployed as a web server that is made publicly available at http://camt.pythonanywhere.com/iACP-FSCM. Owing to the high potential of the FSCM method as proposed in this study, the method could be easily applied for predicting and characterizing other therapeutic peptides without any major modifications, such as cell-penetrating peptides63, antiviral peptides20,23 and predicting antihypertensive20,23, hemolytic peptide31.