生物医疗文本中的命名实体识别对于构建和挖掘大型临床数据库以服务于临床决策具有重要意义,而其中一个基础工作是疾病名称的识别。医疗文本中存在大量的复合疾病名称,难以分离抽取出其中的实体。针对这一问题,提出一种基于多标签的条件随机场算法,首先对数据标注多层标签,每层标签针对复合疾病名称中的不同疾病,然后用整合后的最终标签去训练模型,最后再对模型预测的标签进行分离。此方法能够识别传统条件随机场算法无法识别的复合疾病名称,实验结果验证了所提算法的有效性。
Named entity recognition in medical text for building and digging large clinical database to serve the clinical decision is of great significance, and one of the important basic work is to be able to accurately identify the name of the disease. There are a large number of compound disease name in the medical texts. In order to solve this problem, this paper proposed a kind of CRF algorithm based on multi-label, first of all, it put muhilayer labels to the data, labels on each floor for different diseases, and then integrated into an end label to training model, finally, it isolated each layer label from the model predicts result, and then identified the diseases. This method can recognize composite disease name which cannot be identified by the traditional CRF algorithm. The experimental results verify the effectiveness of the proposed algorithm.