软件开发维护过程中产生的缺陷报告中常常出现大量的重复缺陷报告。自动准确地检测出重复缺陷报告,将为软件缺陷的分派、修正、再测试等工作节约大量宝贵的开发维护成本。文章基于传统的向量空间模型检测方法,提出一种新的基于N-gram模型的重复缺陷报告检测方法,文中第2小节中详细介绍了该方法的细节。通过在小数据集上的实验,明确了在使用该方法检测重复缺陷报告时,参数N取3/4/5,利用全句法仅针对缺陷报告的概要信息进行相似度计算将取得较好的效果。最终使用一个含有4 503条Firefox缺陷报告的数据集对该方法进行了验证。实验证明N-gram模型法与向量空间模型法相比,重复缺陷的查全率(Recall Rate)提高了25%~55%。
Aim.The introduction of the full paper points out what we believe to be the shortcomings of existing papers in the open literature.Hence we propose a new and better method.Subsection 1.2 briefs the N-gram model.Section 2 explains our new and better method of detecting duplicate defect reports using N-gram method.The titles of subsections 2.1,2.2,2.3,2.4,2.5,2.7 are respectively tokenization,word stemming,synonym replacement,stop word removal,N-gram similarity calculation and duplicate defect report detection accuracy measurement;in particular,Formula(6) in subsection 2.7 is very important for calculating the recall rate of our method.In section 3,we select the N-parameter,the complete-sentence syntax and the summary information on software defect report with a small subset of Firefox defect repository and evaluate our method with a large subset of Firefox defect repository including 4503 defect reports.The experimental results,presented in Figs.2 and 3,show preliminarily that the recall rate of our method increases by 25% to 55% compared with that of the traditional Vector Space Model method in detecting duplicate defect reports.