Text mining is the automatic discovery of new information from collections of written material. Text mining goes beyond familiar information retrieval applications (e.g., web search engines like Google), and can allow users to find knowledge which is not explicitly contained in any one document or even known to the documents' authors. In this seminar, we will survey the state of the art for text mining technology, with a focus on practical applications such as analysis of affective language and construction of social networks.
Prerequisite: LING 571 or permission of instructor. (No previous knowledge of statistics is required!)
The goals of this course are for us to gain experience in:
Through the term, participants (including auditors!) will present and discuss articles from the reading list, which cover a number of aspects of text and web mining.
In addition to leading and participating in discussions, students taking the class for a grade will also prepare a final project. Projects should somehow involve text mining and NLP, but need not be restricted to the methods we cover in class. Ideally, the final project should be something that could be submitted to one of the many computational linguistics or data mining conferences.
The final grade will be based on class participation and on a project that applies text mining technology to a useful and interesting problem:
Working in groups (of 2 or 3) is strongly encouraged!
There is one required textbook for this course:
and one optional textbook:
Both of these books are available in the campus bookstore.
Beineke, Philip, Trevor Hastie and Shivakumar Vaithyanathan. 2004. "The sentimental factor: Improving review classification via human-provided information." In Proceedings of the ACL 2004. [pdf]
Boykin, P. Oscar and Vwani Roychowdhury. 2004. "Personal email networks: An effective anti-spam tool." Manuscript, UCSD. [pdf]
Caraballo, Sharon A. 1999. "Automatic acquisition of a hypernym-labeled noun hierarchy from text." In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pages 120–126. [pdf]
Farahat, Ayman and Francine Chen. 2004. "The generalized web surfer." In Proceedings of RIAO. pages 366–379. [pdf]
Gruhl, Daniel, R. Guha, David Liben-Nowell, and Andrew Tomkins. 2004. "Information diffusion through Blogspace." WWW 2004. pages 491–501. [pdf]
Hatzivassiloglou, Vasileios. and Kathleen R. McKeown. 1997. "Predicting the semantic orientation of adjectives." In Proceedings of the 35th Annual Meeting of the ACL. 1997, 174–181. [pdf]
Lin, Dekang. 1998. "Automatic retrieval and clustering of similar words." In Proceedings of COLING/ACL '98, pages 768–774, Montreal, Canada, August. [pdf]
Maedche, Alexander and Steffen Staab. "Discovering conceptual relations from text." In W.Horn (ed.), ECAI 2000. Proceedings of the 14th European Conference on Artificial Intelligence, Berlin, August 21–25. [pdf]
Mullen, Tony and Nigel Collier. 2004. "Sentiment analysis using support vector machines with diverse information sources." In Proceedings of EMNLP 2004. [pdf]
Nakov, Preslav I., Ariel S. Schwartz, and Marti A. Hearst. 2004. "Citances: Citation sentences for semantic analysis of bioscience text." In Proceedings of the SIGIR'04 Workshop on Search and Discovery in Bioinformatics. [pdf]
Pang, Bo, Lillian Lee, and Shivakumar Vaithyanathan. 2002. "Thumbs up? Sentiment classification using machine learning techniques." In Proceedings of EMNLP 2002 79–86. [pdf]
Pang, Bo and Lillian Lee. 2004. "A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd ACL. 271–278. [pdf]
Parekh, Viral, Jack P. Gwo, and Tim Finin. 2004. "Mining domain specific texts and glossaries to evaluate and enrich domain ontologies." In Proceedings of the 2004 International Conference on Information and Knowledge Engineering, Las Vegas, June 21–24. [pdf]
Richardson, Matthew and Pedro Domingos. 2002. "Mining knowledge-sharing sites for viral marketing." In Proceedings of the Eighth International Conference on Knowledge Discovery and Data Mining. pages 61–70. [pdf]
Riloff, Ellen and Janyce Wiebe. 2003. "Learning extraction patterns for subjective expressions." In Proceedings of EMNLP 2003. 105–112. [pdf]
Rosario, Barbara and Marti A. Hearst. 2004. "Classifying semantic relations in bioscience text." In Proceedings of the ACL 2004. [pdf]
Rosario, Barbara, Marti A. Hearst, and Charles Fillmore. 2002. "The descent of hierarchy, and selection in relational semantics." In Proceedings 40th Annual Meeting of the Association for Computational Linguistics (ACL'02). [pdf]
Turney, Peter D. 2002 "Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews." In Proceedings 40th Annual Meeting of the Association for Computational Linguistics (ACL'02). 417–424. [pdf]
Wilson, Theresa, Janyce Wiebe and Rebecca Hwa. 2004. "Just how mad are you? Finding strong and weak opinion clauses." In Proc. 19th National Conference on Artificial Intelligence (AAAI-2004). [pdf]
Week 2 Information retrieval and extraction
Week 3 Clustering and vector space models
Week 5 Machine learning
Week 12 Social networks
Week 13 Thanksgiving
Week 16 Presentations