植物抗性基因识别中的从头预测方法可以看作机器学习中的分类问题。通常情况下,一个分类器的训练需要正确标记的正例和反例。然而,抗性基因识别中可用的信息仅有少数人工标记的抗性基因,且不具有抗性功能的基因也不明确。为了消除由于正例太少和错误的反例带来的抗性基因识别的影响,基于抗性基因和其他基因在蛋白质相互作用网中的距离,提出了一种新的样本选择方法,并对提出的样本选择方法和通常样本选择方法分别在四种分类器上进行了10倍交叉验证。结果表明,文中方法的SN值平均提高了6.9%,SP值平均提高了13.1%。因此,就敏感性和特异性而言,提出的方法获得了更高效、更可靠的结果。
The recognition of plant resistance gene with ab initio method can be formalized as a classification problem. Usually, both la- beled positive and negative samples are required to train the classifier. However, the available information is only about less manually curated R-genes. To eliminate the low recognition rate of the classifier brought by the fewer positive sample and the false negative samples, a novel sample selection method is proposed according to the distance between genes and the curated R-genes in the protein-protein interac- tion network. Compared with the general sample selection method, experimental results are validated by the 10-cross validation on four dif- ferent classifiers. In the results, the SN and the SP of the proposed method separately increase 6.9% and 13.1% on average. Thus the method in the paper achieves higher and more reliable classification results than general method in terms of both sensitivity and specificity.