针对安全的半监督支持向量机( safe semi-supervised support vector machine,S4VM)存在参数选择盲目性、正负样本比例不平衡等问题,建立了基于改进的TF-IDF( term frequency-inverse document frequency, TF-IDF)、遗传算法( genetic algorithm, GA)和S4VM的蛋白质序列识别方法TIGA-S4VM。利用改进的TF-IDF算法提取出蛋白质序列中的特征项,将各个特征项在蛋白质序列中出现的频率归一化后作为识别模型的特征值,并结合GA以及S4VM对蛋白质序列进行识别。实验结果表明,TIGA_S4VM优于其它5个识别方法,即使在训练样本率较低时,也能有效地识别蛋白质序列。
In order to effectively deal with the choice blindness of parameters and unbalanced class sizes, TIGA-S4VM, a protein sequence identification model was developed and trained using safe semi-supervised support vector machine (S4VM) based on improved TF-IDF algorithm and Genetic Algorithm (GA).LBTF-IDF, the improved TF-IDF algo-rithm, was put forward in this model for extracting the protein sequences′features.After the normalization of features′frequencies, the results were taken as the characteristic values for classifier.Combining LBTF-IDF, GA and S4VM, the mixed strategy was used to identify the protein sequences.Experiment results showed that the method was superior to other five classification methods and could get good classification performance with reduced training set.