NIK2009 - Combining Latent Semantic Indexing and Clustering to Retrieve and Cluster Biomedical Infor
| Forfattere | Jon Rune Paulsen, Heri Ramampiaro |
| Institusjon | NTNU |
| Publikasjon | Norsk informatikkonferanse (NIK) |
| Publiseringsår | 2009 |
| Sidetall intervall | 131-142 |
| Generell lenke | http://www.nik.no |
| ISBN/ISBN2 | 9788251924917/ |
| ISSN/ISSN2 | 1892-0713 (trykk) / 1892-0721 (online)/ |
| Sjanger | Vitenskaplig publisering |
| Kategori | Informatikk |
| Redaktør | Trond Aalberg |
| Utgiver | Tapir Akademisk Forlag |
| Adresse utgiver | Nardoveien 12 7005 Trondheim |
| Språk | English |
Abstrakt
This paper presents document retrieval approach based on combination oflatent semantic index (LSI) and two different clustering algorithms. The idea
is to first retrieve papers and create initial clusters based on LSI. Then, we
use flat clustering method to further group similar documents in clusters.
The paper also presents a new algorithm for k-means clustering that aims
at dealing with the fact that the standard k-means algorithm is too greedy.
Our experiments show that in many of cases the two-step algorithm performs
better than standard k-means. The main advantage of our method is that it
forces the centroid vector towards the extremities, and consequently gets a
completely different starting point compared to the standard algorithm. This
also makes the algorithm less greedy than the standard one. We believe
our method can be used to retrieve relevant documents from a document
collection. Our experiments have revealed that it performs well in most cases,
but also failing in some cases.
Referanser
[1] M. W. Berry, S. T. Dumais, and G. W. O’Brien. Using linear algebra for intelligentinformation retrieval. SIAM(4), 37(4):573–595, 1995.
[2] A. Dasgupta, R. Kumar, P. Raghavan, and A. Tomkins. Variable latent semantic
indexing. In ACM SIGKDD 2005, 2005.
[3] Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas,
and Richard A. Harshman. Indexing by latent semantic analysis. Journal of the
American Society of Information Science, 41(6):391–407, 1990.
[4] C. H. Q. Ding. A probabilistic model for latent semantic indexing. Journal of the
American Society for Information Science and Technology, 56(6):597–608, 2005.
[5] Miles Efron. Eigenvalue-based model selection during latent semantic indexing:
Research articles. J. Am. Soc. Inf. Sci. Technol., 56(9):969–988, 2005.
[6] K. M. Faraoun and A Boukelif. Neural networks learning improvement using the
kmeans clustering algorithm to detect neural intrusions. IJCI, 3(2):161–168, 2006.
[7] C. Fraley and A. E. Raftery. How many clusters? which clustering method? answers
via model-based cluster analysis. Computer Journal, 41(8):578–588, 1998.
[8] David Gleich. Svd subspace projections for term suggestion ranking and clustering.
In In Technical Report, Yahoo! Research Labs, 2004.
[9] J. B. Hagen. The origin of bioinformatics. Nature Reviews: Genetics, (1):231–236,
2000.
[10] Ada Hamosh, Alan F. Scott, Joanna Amberger, Carol Bocchini, David Valle,
and Victor A. McKusick. Online Mendelian Inheritance in Man (OMIM), a
knowledgebase of human genes and genetic disorders. Nucleic Acids Research,
30(1):52–55, 2002.
[11] Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques. Kluwer
Academic Publisher, 2000.
[12] Erik Hatcher and Otis Gospodnetic. Lucene in Action. Manning Publications Co.,
209 Bruce Park Ave., Greenwich, CT 06830, 2005.
[13] B Hendrickson. Latent semantic analysis and fiedler embeddings. In Proceedings
of SIAM Workshop on Text Mining, 2006.
[14] L. Jing, M. K. Ng, X. Yang, and J. Z. Huang. A text clustering system based on
kmeans type subspace clustering and ontology. International Journal of Intelligent
Technology, 1(2):91–103, 2006.
[15] G. Lu. Multimedia Database Management Systems. Artech House, 1999.
[16] J. B. MacQueen. Some methods for classification and analysis of multivariate
observations. In Proceedings of 5th Berkley Symposium on Mathematical Statistical
and Probability, pages 281–297. University of California Press, 1967.
Forrige artikkel Neste artikkel



