互联网规模的软件资源库正从根本上改变传统的软件开发模式,资源库中海量软件的高效层次分类对基于互联网资源的软件开发具有重要意义.传统软件分类方法基于软件源代码或字节码实现粗粒度的扁平分类,并且只在小规模数据集上进行了验证.文中提出了一种基于软件在线属性聚合的层次分类方法,设计了一个层次分类框架,基于跨资源库软件在线描述和标签的加权聚合,实现对海量软件的高效层次化分类.文中在超过18000个开源软件上进行交叉验证,实验结果表明文中提出的在线属性加权聚合方法能显著提高软件分类效果.在粗粒度扁平分类下文中方法能够达到基于源代码/字节码分类近似的性能,而且,与相关工作比较,文中方法实现了涵盖123个更细粒度类别的层次化分类,能够更有效地对海量软件进行分类.
The Internet-scale software repositories are fundamentally changing the paradigms of software development. Efficient categorization of the massive software these repositories is of vital importance for Internet-based software development. traditional projects in Traditional classification approaches do coarse-grained and flat categorization by analyzing source code or byte code, and most of them are only verified on relatively small collections of software projects. In this paper, we propose an efficient hierarchical categorization approach based on the aggregation of the software online attributes and design a hierarchical categorization framework. Based on the weighted aggregation of software descriptions and tags across multiple repositories, we cate- gorize the massive software hierarchically. Extensive experiments are carried out on more than 18,000 software projects. The results show that significant improvement can be achieved by using weighted aggregation of different online attributes. Compared to the previous work, our approach achieves/gains competitive performance with 123 hierarchical and finer-grained categories for which classification is much harder. In contrast to those using source code or byte code, our approach is more effective for large-scale categorization.