Ph.D. Dissertation Defense
Mining Commonsense Knowledge from the Web:
Towards Inducing Script-like Structures From Large-scale Text Sources
Niels Kasch
10:00am Friday, March 9th, 2012, ITE 325B
Knowing the sequences of events in situations such as eating at a restaurant is an example of commonsense knowledge needed for a broad range of cognitive tasks (e.g., language understanding). This thesis outlines an approach to mine information about sequential, every day situations in a topic-driven fashion to produce declarative, script-like representations (c.f., Schank's scripts). Given a topic such as eating at a restaurant, we produce graphs of temporally ordered events involved with the activity referenced by the topic. Our work utilizes large-scale data sources (e.g., the Web) to avoid data sparseness issues of narrow corpora.
We describe steps that address the scale and noisiness of the Web to make it accessible for script extraction. Boilerplate elements (e.g., navigation bars and advertising) on web pages skew distributional statistics of words and obstruct information retrieval tasks. To make the web usable as a corpus, we introduce a machine learning technique to separate boilerplate elements from content in arbitrary web pages.
A key element for commonsense knowledge extraction is the generation of a topic-specific corpus that facilitates script extraction in a topic-driven manner. We introduce Concept Modeling for Scripts as an efficient method to induce concepts containing script elements (e.g., events, people, and objects) from topic-specific corpora. Our experiments and user studies conducted on the 2011 ICWSM Spinn3r dataset show that our method outperforms state of the art topic-modeling approaches such as Latent Dirichlet Allocation (LDA) on this task when applied to unbalanced (topic-specific) corpora.
Concept Modeling serves as a starting point for automated methods to discover events relevant to a script. We demonstrate event detection methods in topic-specific corpora based on (1) learned dependency paths indicative of individual event structures, (2) semantic cohesiveness of event pairs, and (3) surface structures indicative of golden sentences containing sequential information. Events extracted for a given topic can be arranged in a graph. The detection methods exploit graph analysis methods to identify strongly connected components to prune the event set such that related and central events are predominant in the structure. User studies demonstrate that (1) the Web is suitable for mining script-like knowledge and (2) the resulting graph structures portray events strongly related to a given topic.
Script-like structures, by definition, impose temporal ordering on the events contained within the structure. This work also presents a novel method to induce ordering information from topic-specific corpora based on a counting framework to judge the presence and strength of a temporal happens-before relation. The framework is extensible to several counting methods, where a counting method provides co-occurrence and ordering statistics. We present, among others, a novel naive counting methods that uses a simple sentence position assumption for temporal order. Comparisons to existing temporal resources show that our naive method, in conjunction with connected components analysis, induces temporal relationship with similar accuracy than more sophisticated methods, yet with a smaller computational footprint.
Committee
- Dr. Tim Oates (chair)
- Dr. Ronnie W. Smith
- Dr. Matt Schmill
- Dr. Tim Finin
- Dr. Charles Nicholas