
I'd like to offer this jar containing 19 multi-class (1-of-n) text 
datasets, whose word count feature vectors have already been extracted. 
I thought it'd be good to 
post at the UCI repository and the WEKA datasets site, if you are 
interested.   It's 14MB compressed.

The problems come from LA Times, TREC, OHSUMED, etc. and the data were 
originally converted to word counts by   

Han, E. and Karypis, G.   Centroid-Based Document Classification: 
Analysis & Experimental Results. In Proc. of the 4th European Conf. on 
the Principles of Data Mining and Knowledge Discovery (PKDD): 424-431, 
2000.

I have found them quite useful for studies, e.g.

G. Forman & Ira Cohen.  Learning from Little: Comparison of Classifiers 
Given Little Training  ECML'04.  Hewlett-Packard Labs TR HPL-2004-19R1.

G. Forman.  A Pitfall and Solution in Multi-Class Feature Selection for 
Text Classification.  ICML'04. HPL-2004-86

G. Forman.  An Extensive Empirical Study of Feature Selection Metrics 
for Text Classification. Special Issue on Variable and Feature 
Selection, Journal of Machine Learning Research, 3(Mar):1289-1305, 2003. 
HPL-2002-147R1.  ((Their web-site has a subset of these datasets, but it 
only includes binary features--- word occurs 1 or more times.))

  George Forman
  
