为提高基于信息检索的程序理解方法的准确性,提出了一种结合信息检索和概率状态机的两阶段程序理解方法.在该方法中使用概率有限自动机(probabilistic finite-stateautomata,PFA)解决了信息检索结果在程序理解中的不确定性,同时采用信息检索构建了多个简单的PFA,而不是单个复杂的PFA,提高了PFA分析的伸缩性.训练阶段先采用隐式语义分析对源代码进行聚类,然后在聚类结果上生成PFA.在识别阶段以词法处理后的程序作为检索项在程序模板库中进行信息检索,取检索结果中的最相关的”项作为候选模板,由候选模板对应得到相应的PFA,通过分析找到最大概率的PFA,完成对源码内容的语义标注.
To improve the accuracy of information retrieval (IR) based program comprehension method, a new two stages method was proposed, which consists of IR stage and probabilistic finite-state automata (PFA) recognition stage. This method uses, PFAs to address the problem of imprecise in applying IR in program comprehension directly. Meanwhile, applying IR makes it possible to construct many simple PFAs rather than a big complex one to greatly improve the scalability of recognition. PFAs are learned from clusters generated by latent semantic analysis (LSA) in training state. In recognition state, source code segment is processed in lexical, and then it is used as an IR query to retrieve n candidate plans. After that, the corresponding PFAs of the plans are found, and the PFA with maximum probability is chosen. Finally, the code segment is marked with the same semantic as the result PFA.