Using linguistic information as features for text categorization

We report on some experiences using linguistic information as additional features in a classical Vector Space Model[10]. Extracted information of every word like the Part Of Speech and stem, lexical root have been combined in different ways for experimenting on a possible improvement in the classification performance and on several algorithms, like SVM [3], BBR [] and PLAUM [6]. Automatic Text Classification, or Automatic Text Categorization as is also known, tries to related documents to predefined set of classes. Extensive research has been carried out on this subject [11] and a wide range of techniques are appliable to solve this task: feature extraction [5], feature weighting, dimensionality reduction [4], machine learning algorithms and more. Besides, the classification task can be either binary (one out of two possible classes to select), multi-class (one out of set of possible classes) or multi-label (a set of classes from a larger set of potential candidates). In most cases, the latter two can be reduced to binary decisions [1], as the used algorithm does in our experiments [8]. In order to verify the contribution of the new features, we have combined them to be included into the vector space model by preprocessing the Reuters- 215781 collection, a well known set of data by the research community devoted to text categorization problems [2].