San Diego State University logo
Seminar in linguistics: Text mining

Linguistics 795


Fall 2004
W 7:00–9:40
Room COM-206

Text mining is the automatic discovery of new information from collections of written material. Text mining goes beyond familiar information retrieval applications (e.g., web search engines like Google), and can allow users to find knowledge which is not explicitly contained in any one document or even known to the documents' authors. In this seminar, we will survey the state of the art for text mining technology, with a focus on practical applications such as analysis of affective language and construction of social networks.

Prerequisite: LING 571 or permission of instructor. (No previous knowledge of statistics is required!)


Rob Malouf
Office: BA 310A
Office Hours: Mondays 3:30-5:00, or by appointment
Phone: (619) 594-7111


The goals of this course are for us to gain experience in:

  • exploring the state of the art of linguistically motivated text mining methods,
  • reading and evaluating the primary literature,
  • presenting and discussing research material with peers,
  • identifying open research questions,
  • and designing and carrying out our own experiments.

Through the term, participants (including auditors!) will present and discuss articles from the reading list, which cover a number of aspects of text and web mining.

In addition to leading and participating in discussions, students taking the class for a grade will also prepare a final project. Projects should somehow involve text mining and NLP, but need not be restricted to the methods we cover in class. Ideally, the final project should be something that could be submitted to one of the many computational linguistics or data mining conferences.

The final grade will be based on class participation and on a project that applies text mining technology to a useful and interesting problem:

Project proposal (<1 page)Sep 2910%
Annotated bibliographyOct 1310%
Data setDec 130%
Final projectDec 1735%
Class participation15%

Working in groups (of 2 or 3) is strongly encouraged!


There is one required textbook for this course:

Mining the web cover   Soumen Chakrabarti. 2003. Mining the Web. Elsevier. (website)

and one optional textbook:

FSNLP cover   Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. MIT Press. (website)

Both of these books are available in the campus bookstore.

Additional readings:

Beineke, Philip, Trevor Hastie and Shivakumar Vaithyanathan. 2004. "The sentimental factor: Improving review classification via human-provided information." In Proceedings of the ACL 2004. [pdf]

Boykin, P. Oscar and Vwani Roychowdhury. 2004. "Personal email networks: An effective anti-spam tool." Manuscript, UCSD. [pdf]

Caraballo, Sharon A. 1999. "Automatic acquisition of a hypernym-labeled noun hierarchy from text." In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pages 120–126. [pdf]

Farahat, Ayman and Francine Chen. 2004. "The generalized web surfer." In Proceedings of RIAO. pages 366–379. [pdf]

Gruhl, Daniel, R. Guha, David Liben-Nowell, and Andrew Tomkins. 2004. "Information diffusion through Blogspace." WWW 2004. pages 491–501. [pdf]

Hatzivassiloglou, Vasileios. and Kathleen R. McKeown. 1997. "Predicting the semantic orientation of adjectives." In Proceedings of the 35th Annual Meeting of the ACL. 1997, 174–181. [pdf]

Lin, Dekang. 1998. "Automatic retrieval and clustering of similar words." In Proceedings of COLING/ACL '98, pages 768–774, Montreal, Canada, August. [pdf]

Maedche, Alexander and Steffen Staab. "Discovering conceptual relations from text." In W.Horn (ed.), ECAI 2000. Proceedings of the 14th European Conference on Artificial Intelligence, Berlin, August 21–25. [pdf]

Mullen, Tony and Nigel Collier. 2004. "Sentiment analysis using support vector machines with diverse information sources." In Proceedings of EMNLP 2004. [pdf]

Nakov, Preslav I., Ariel S. Schwartz, and Marti A. Hearst. 2004. "Citances: Citation sentences for semantic analysis of bioscience text." In Proceedings of the SIGIR'04 Workshop on Search and Discovery in Bioinformatics. [pdf]

Pang, Bo, Lillian Lee, and Shivakumar Vaithyanathan. 2002. "Thumbs up? Sentiment classification using machine learning techniques." In Proceedings of EMNLP 2002 79–86. [pdf]

Pang, Bo and Lillian Lee. 2004. "A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd ACL. 271–278. [pdf]

Parekh, Viral, Jack P. Gwo, and Tim Finin. 2004. "Mining domain specific texts and glossaries to evaluate and enrich domain ontologies." In Proceedings of the 2004 International Conference on Information and Knowledge Engineering, Las Vegas, June 21–24. [pdf]

Richardson, Matthew and Pedro Domingos. 2002. "Mining knowledge-sharing sites for viral marketing." In Proceedings of the Eighth International Conference on Knowledge Discovery and Data Mining. pages 61–70. [pdf]

Riloff, Ellen and Janyce Wiebe. 2003. "Learning extraction patterns for subjective expressions." In Proceedings of EMNLP 2003. 105–112. [pdf]

Rosario, Barbara and Marti A. Hearst. 2004. "Classifying semantic relations in bioscience text." In Proceedings of the ACL 2004. [pdf]

Rosario, Barbara, Marti A. Hearst, and Charles Fillmore. 2002. "The descent of hierarchy, and selection in relational semantics." In Proceedings 40th Annual Meeting of the Association for Computational Linguistics (ACL'02). [pdf]

Turney, Peter D. 2002 "Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews." In Proceedings 40th Annual Meeting of the Association for Computational Linguistics (ACL'02). 417–424. [pdf]

Wilson, Theresa, Janyce Wiebe and Rebecca Hwa. 2004. "Just how mad are you? Finding strong and weak opinion clauses." In Proc. 19th National Conference on Artificial Intelligence (AAAI-2004). [pdf]


Week 1 Introduction
Chakrabarti Ch. 1, Hearst (1999) [Rob]
slides, handout

Week 2 Information retrieval and extraction
Ch. 3 [Henry]

Week 3 Clustering and vector space models
GATE/ANNIE [Rob], Ch. 4 [Hannah]

Week 4 Clustering and vector space models (cont.)
infomap [Lucien], cluto [Rebecca], Ch. 5 [Emily]

Week 5 Machine learning
Ch. 5 [Emily], rainbow [Carol]

Week 6 Machine learning (cont.)
pafi [Erin], weka [Mei], Ch. 6 [Melissa]

Week 7 Data collection
Ch. 2 [Yoon Hee], HarvestMan [Lucien/Rebecca], web scraping [Rob]

Week 8 Data collection (cont.), Ontologies
Google API [Rob], Lin (1998), Caraballo (1999) [Erin], Maedche and Staab (2000), Parekh et al. (2004) [Rebecca]

Week 9 Sentiment analysis
Hatzivassiloglou and McKeown (1997) [Emily], Turney (2002) [Carol], Pang, et al. (2002) [Hannah]

Week 10 Sentiment analysis (cont.)
Riloff (2003) [Mei], Pang and Lee (2004) [Lucien], Beineke, et al. (2004) [Henry]

Week 11 Sentiment analysis (cont.)
Mullen and Collier (2004) [Yoon Hee], Wilson, et al. (2004) [Melissa]

Week 12 Social networks
Chakrabarti Ch. 7 [Erin]

Week 13 Thanksgiving
Enjoy your turkey (or suitable alternative)!

Week 14 Social networks (cont.)
Farahat and Chen (2004) [Carol], Boykin and Roychowdhury (2004) [Mei], Richardson and Domingos (2002) [Emily], Gruhl, et al. (2004) [Hannah]

Week 15 Biotext applications
Rosario, Hearst, and Fillmore (2002) [Henry], Nakov, Schwartz, and Hearst (2004) [Lucien], Rosario and Hearst (2004) [Yoon Hee]

Week 16 Presentations

  • Melissa Daggett
    Emily Wilson

    Our project design takes military emails as input, and attempts to classify them into separate email types (CASREP, GENADMIN, VOI OPTELINT,USCGC etc) using Rainbow. Latent semantic indexing will also be applied to that data, using InfoMap. A visualization graph will hopefully illuminate unknown relationships between institutions involved in this email process, and will reveal a better plan to format and route these types of email messages. Data sparseness is an issue with this study; future work will involve duplicating this study, and hopefully building on lessons learned during this pilot study.
  • Website Bias Estimation with Combined Maximum Entropy Modeling and Network Analysis
    Lucien Carroll
    Erin Stevenson
    Rebecca Sinclair Colavin

    The most popular search engines answer queries using a combination of relevance and prestige rankings. However, these two measures have a fundamental flaw; they do not specifically distinguish between sites that attempt to give a balanced viewpoint and those that are making a informational choice for the reader. We describe a two-pronged approach to classifying websites according to their bias. The network analysis system uses a selection of in- and out-links from websites to construct a network where websites are clustered according to their bias. Because some websites are insufficiently linked to the network, a maximum entropy lexical model is used to grade the subjectivity of each site.
Last modified: Sat Nov 8 15:53:41 PST 2008