Improving text categorization using hyperlinks in
|Forfattere||R.Rajendra Prasath and Sudeshna Sarkar|
|Institusjon||Norwegian University of Science and Technology|
|Publikasjon||Norwegian Artificial Intelligens Symposium (NAIS)|
|Nøkkelord||Machine Learning, Text Categorization, Feature|
|Redaktør||Anders Kofod-Petersen, Helge Langseth og Odd Eirik Gundersen|
|Utgiver||Tapir Akademisk Forlag|
|Adresse utgiver||Nardoveien 12, 7005 Trondheim|
AbstraktTraditional text categorization systems use Bag of Words (BoW)
approach which unable to achieve high categorization accuracy with text
documents, because categorization depends merely on the occurrence of
keywords. To understand the actual context behind these keywords,
it is essential to induce additional features from external knowledge
sources. In this work, we have made an attempt to enhance terms
in documents using hyperlink structures present in vast repositories of
human knowledge, in this case, Wikipedia. At first, for each featured
topic in Wikipedia, Hyperlink text vectors are extracted. Then the input
text documents are analyzed and selected features from them are mapped
into hyperlink text vectors that tries to generate features related to
the context of text fragments in a high dimensional space. The rerepresented
text documents are now classified with generated hyperlink
features. The simulation results show that computing associated word
relations in the given text fragment, using the extracted hyperlink text
vectors without explicit semantic analyzer, yields better improvements
in the categorization accuracy. The classifiers used in our experiments
are: Naive Bayes and k−Nearest Neighbor. The categorization accuracy
is measured against the standard Reuters - 21578 dataset.
Referanser D. Billsus and M. Pazzani. Learning probabilistic user models. In Proc. of
theWorkshop on Machine Learning for User Modeling, Chia Laguna, IT, 1997.
 S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A.
Harshman. Indexing by latent semantic analysis. Journal of the American
Society of Information Science, 41(6):391–407, 1990.
 Dumais, Platt, Heckerman, and Sahami. Inductive learning algorithms and
representations for text categorization. In CIKM: ACM CIKM International
Conference on Information and Knowledge Management. ACM, SIGIR, and
 J. Furnkranz, T. Mitchell, and E. Riloff. A case study in using linguistic phrases
for text categorization on the WWW. In AAAI/ICML Workshop on Learning
for Text Categorization, 1998.
 E. Gabrilovich. Feature Generation for Textual Information Retrieval Using
World Knowledge. PhD thesis, Israel Institute of Technology, Haifa, 2006.
 E. Gabrilovich and S. Markovitch. Feature generation for text categorization
using world knowledge. In Proc. of the19th International Joint Conference on
Artificial Intelligence, pages 1048–1053, Edinburgh, Scotand, Aug. 2005.
 J. Giles. Internet encyclopaedias go head to head. Nature, 438(1):900–901,
 S. Godbole. Inter-class relationships in text classification. PhD thesis, Indian
Institute of Technology, Bombay, 2006.
 M. Kr¨otzsch, D. Vrandecic, M. V¨olkel, H. Haller, and R. Studer. Semantic
wikipedia. J. Web Sem, 5(4):251–261, 2007.
 Lenat and Feigenbaum. On the thresholds of knowledge. Artificial Intelligence,
 D. D. Lewis. Reuters-21578 text categorization test collection. David D. Lewis
Consulting and Ornarose, Inc., May 2004.
 B. Liu, X. Li, W. S. Lee, and P. S. Yu. Text classification by labeling words. In
Proc. of theEighteenth National Conference on Artificial Intelligence, San Jose,
CA, July 2005. AAAI Press.
Improving text categorization 89
 M. E. Maron. Automatic indexing: An experimental inquiry. Journal of the
ACM, 8(3):404–417, July 1961.
 A. K. McCallum. Bow: A toolkit for statistical language
modeling, text retrieval, classification and clustering.
 D. Mladeni´c. Turning Yahoo! into an automatic Web page classifier. In
H. Prade, editor, Proc. of ECAI-98, 13th European Conference on Artificial
Intelligence, pages 473–474, Brighton, UK, 1998. John Wiley and Sons,
 K. J. Mock. Hybrid hill-climbing and knowledge-based methods for intelligent
news filtering. In AAAI/IAAI, Vol. 1, pages 48–53, 1996.
 M. Murata, Q. Ma, K. Uchimoto, H. Ozaku, H. Isahara, and M.Utiyama.
Automatic indexing: An experimental inquiry. Journal of the Association for
Natural Language Processing, 7(2), 2000.
 H. Raghavan, O. Madani, and R. Jones. Interactive feature selection. In Proc.
IJCAI’05, pages 841–846, 2005.
 M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. A bayesian approach
to filtering junk e-mail. In AAAI-98 Workshop on Learning for Text
Categorization, pages 55–62, 1998.
 G. Salton, A. Wong, and A. C. S. Yang. A vector space model for automatic
indexing. Communications of the ACM, 18:229–237, 1975.
 R. E. Schapire, M. Rochery, M. G. Rahim, and N. Gupta. Incorporating
prior knowledge into boosting. In C. Sammut and A. G. Hoffmann, editors,
Machine Learning, Proc. of theNineteenth International Conference (ICML
2002), University of New South Wales, Sydney, Australia, July 8-12, 2002,
pages 538–545. Morgan Kaufmann, 2002.
 Sebastiani. Machine learning in automated text categorization. ACM
Computing Surveys, 34:1–47, 2002.
 C.-M. Tan, Y.-F. Wang, and C.-D. Lee. The use of bigrams to enhance text
categorization. Information Processing and Management, 38(4):529–546, 2002.
 X. Wu and R. Srihari. Incorporating prior knowledge with weighted margin
support vector machines. In KDD’04, pages 326–333, Seattle, Washington,
 Y. Yang and J. O. Pedersen. A comparative study on feature selection in text
categorization. In Proc. of ICML-97, 14th International Conference on Machine
Learning, pages 412–420, 1997.
 J. Zobel and A. Moffat. Exploring the similarity space. SIGIR Forum, 32(1):18–