具有结构化输出的学习任务(结构化学习)在自然语言处理领域广泛存在。近年来研究人员们从理论上证明了数据标记的噪声对于结构化学习的巨大影响,因此为适应结构化学习任务的去噪算法提出了需求。受到近年来表示学习发展的启发,本文提出将自然语言的子结构低维表示引入结构化学习任务的样本去噪算法中。这一新的去噪算法通过n元词组的表示为序列标注问题中每个节点寻找近邻,并根据节点标记与其近邻标记的一致性实现去噪。本文在命名实体识别和词性标注任务的跨语言映射上对上述去噪方法进行了验证,证明了这一方法的有效性。
Problems with structured predictions( structured learning) widely exist in natural language processing. Recent research found that compared to classification problems,structured learning problems were affected more seriously by label noises,suggesting the importance of noise removing algorithms for these problems. Inspired by the development of representation learning methods,the paper proposes a noise-removing algorithm for structured learning based on low-dimensional representations of sub-structures. The algorithm finds neighbors of each node in a sequential labeling task based on its associated n-gram representation,and then performs noise removing on the label of a node according to its consistency with the labels of its neighbors. Therefore the paper proves the effectiveness of the proposed algorithm on the cross-lingual projection of named entity recognition and POS tagging tasks.