针对传统共词分析中高频词共现矩阵的构建方法提出了一些疑问,包括:抽取高频词作为分析对象的可靠性、高频词矩阵对领域内重要共现关系的保留程度、关键词的语义类型特征和关键词缺失可能带来的影响。通过实证数据揭示了科技论文的关键词词频、共现关系、语义类型的分布特征,并分析了它们对共词分析方法的影响,包括:基于关键词的共词分析只能分析热门知识节点,共词网络实质上是建立在不稳定的单次关联基础之上,而高频词矩阵则会丢失大量重要的共现关系,这些问题是由关键词的语义类型特征决定的,该特征是实现词语间差异化乃至语义化处理的重要切入点。另外,本文在对比关键词增补前后的共词矩阵后发现,增补关键词实质上无法优化高频词矩阵对所分析领域的代表性。在结尾部分,提出了两种可尝试的思路:一是结合关键词频次和共现关系强度抽取分析对象;二是以关键词语义类型为维度构建多维共现矩阵以更好地挖掘多种语义关联。
This paper raises some doubts about the traditional co-word analysis methods, including the reliability of high-frequency keywords extraction, the retention rate of important co-occurrence relations in the high-frequent word matrix, the possible impact of keyword' s semantic feature and missing keywords. Through the analysis of a real scientific publication dataset, we revealed its word frequency distribution, co-occurrence distribution and semantic feature. We also find their impacts on co-word analysis, including: keywords-based co-word analysis can only show the research hotspots and relations among them; nearly a half of the important co-occurrence relations is lost if only using high-frequent keywords to generate matrix ; the semantic information of keyword could be an important feature for the differentiation and semantization of keywords. Considering additional keywords from publication title cannot help to improve the representativeness of the high-frequent word matrix to the whole knowledge network. In the conclusion, we propose two possible methods for improvement: one is to select keywords by combining their frequency and intensity of co-occurrence relationship; the other is to construct multi-dimensional co-occurrence matrix in order to differentiate multiple semantic associations.