文件主观分析成为了网文章内容采矿的一个重要方面。这个问题类似于传统的文章分类,因此,许多相关分类技术能这里被改编。然而,有一重要差别,更多的语言或语义信息为更好估计一个文件的主观被要求。因此在这篇论文,我们的焦点主要在二个方面上。一个人是怎么提取有用、有意义的语言特点,并且其它是怎么为这项特殊任务高效地构造适当语言模型。为第一个问题,我们进行全球过滤、本地重量的策略与不同订单并且在各种各样的距离窗户以内在一系列 n 克选择并且评估语言特点。为第二个问题,我们采用最大平均信息量(MaxEnt ) 构造我们的语言模型框架的建模方法。除古典 MaxEnt 模型以外,我们也分别地与 Gaussian 和指数的 priors 构造了二种改进模型。在这给的详细实验糊与选择的井和加权的语言特点,有指数的 priors 的 MaxEnt 模型是的表演显著地对文章主观分析任务更合适。电子增补材料这篇文章(doi:10.1007/s11390-008-9125-z ) 的联机版本包含增补材料,它对授权用户可得到。
Document subjectivity analysis has become an important aspect of web text content mining. This problem is similar to traditional text categorization, thus many related classification techniques can be adapted here. However, there is one significant difference that more language or semantic information is required for better estimating the subjectivity of a document. Therefore, in this paper, our focuses are mainly on two aspects. One is how to extract useful and meaningful language features, and the other is how to construct appropriate language models efficiently for this special task. For the first issue, we conduct a Global-Filtering and Local-Weighting strategy to select and evaluate language features in a series of n-grams with different orders and within various distance-windows. For the second issue, we adopt Maximum Entropy (MaxEnt) modeling methods to construct our language model framework. Besides the classical MaxEnt models, we have also constructed two kinds of improved models with Gaussian and exponential priors respectively. Detailed experiments given in this paper show that with well selected and weighted language features, MaxEnt models with exponential priors are significantly more suitable for the text subjectivity analysis task.