针对词袋模型忽略了词条之间语义关系和概念结构的问题,提出一种基于句法分析的代码摘要技术。首先,该技术利用词性标注识别出最有可能体现代码特性的关键词;然后,通过块分析修正在词性标注过程中可能引入的错误;其次,对标识出的关键词进行降噪,以减少文本噪声带来的不利影响;最后,从关键词中选取若干个权值最高的词以组成代码摘要。实验结果表明,与基于词频-逆文档频率(TF-IDF)和基于TF-IDF扩展的代码摘要技术对比,所提技术生成的代码摘要与参考答案的重叠率(overlap)至少分别提高了9%和6%,说明该技术能够生成更加准确的代码摘要。
For overcoming the drawback of ignoring the semantic relationship between terms and concept structure in the bag of words model, a source code summarization technology based on syntactic analysis was proposed. Firstly, the part-of- speech tagging was utilized to recognize the keywords that characterized the code feature most. Secondly, the chunk parsing was used to revise the errors that could be introduced in the process of part-of-speech tagging. Thirdly, the noise reduction for those keywords was carried out to decrease the influence of text noise. Finally, several keywords with highest weights were selected to compose the summaries. Through the comparison with TF-IDF (Term Frequency-Inverse Document Frequency)- based and extended TF-IDF-based source code summarization technologies in the experiment, with respect to the overlap coefficient of the golden set, the summaries obtained by the proposed technology are improved by at least 9% and 6% respectively, which illuminates that the proposed technology is able to generate more precise source code summaries.