随着生物医学文献的快速增长,在海量的生物医学文献中存在大量有关疾病、病症和治疗物质的信息,这些信息对疾病的治疗和药物的研制有着重要的意义。针对疾病与治疗物质之间的信息抽取,重点训练两个模型,即疾病与病症模型和病症与治疗物质模型。疾病与病症模型判断一种疾病是否会存在或者导致一种生理现象的产生;病症与治疗物质模型判断一种物质是否改变人的生理现象或者生理过程。使用半监督学习的Tri-training的方法,利用大量未标注数据辅助少量有标注数据进行训练提高分类性能。实验结果表明,Tri-training方法中利用未标注数据有助于提高实验结果;且在训练过程中使用集成学习的思想将三个分类器器集成在一起,提高了学习性能。
With the rapid growth of biomedical literature,the knowledge about diseases,symptoms and therapeutic substances in biomedical literature has contributed positively to drug discovery and disease therapy.This paper presents the method of constructing two models for extracting the relations between diseases and therapeutic substances,i.e.the disease and symptom model and symptom and therapeutic substance model.The disease and symptom model judges whether a disease can exist or cause a physiological phenomenon.The symptom and therapeutic substance model determines whether a substance changes a person’s physiological processes.In this method,a semi-supervised learning algorithm,Tri-training is applied to utilize the unlabeled data along with a few labeled examples to improve the classification performance.Experimental results show that exploiting unlabeled data with the Tri-training algorithms can enhance the experimental result.In the Tri-training process,this method uses ensemble learning to integrate three classifies,which can improve the learning performance.