(Nominated for Best Student Paper)
Textual data ranging from corpora of digitized historic documents to large collections of news feeds provide a rich source for temporal and geographic information. Such types of information have recently gained a lot of interest in support of different search and exploration tasks, e.g., by organizing news along a timeline or placing the origin of documents on a map. However, for this, temporal and geographic information embedded in documents is often considered in isolation. We claim that through combining such information into (chronologically ordered) event-like features interesting and meaningful search and exploration tasks are possible.
In this paper, we present a framework for the extraction, exploration, and visualization of event information in document collections. For this, one has to identify and combine temporal and geographic expressions from documents, thus enriching a document collection by a set of normalized events. Traditional search queries then can be enriched by conditions on the events relevant to the search subject. Conditions can be efficiently evaluated using index structures for the temporal and geographic components. Most important for our event-centric approach is that a search result consists of a sequence of events relevant to the search terms and not just a document hit-list. Such events can originate from different documents and can be further explored, in particular events relevant to a search query can be ordered chronologically. We demonstrate the utility of our framework by different (multilingual) search and exploration scenarios using a Wikipedia corpus.
A Large-Scale Comparison of Methods for Determining Topical Similarity for Scientific Documents
Mario Lipinski; Wolf-Tilo Balke; Bela Gipp
Bibliometrics is an important topic for establishing document similarity in building up digital libraries and retrieving documents from large digital collections. However, there is a wide variety of techniques to measure document similarity and an actual compari-son of different methods proves difficult: since many methods are rather similar, the choice of the respective evaluation corpora and the used ground truth may strongly influence each method’s per-formance. Since there is no established gold standard to evaluate competing methods in terms of their respective precision and recall, comparisons are mostly performed over small collections only. In this paper we conduct a large-scale comparison of text-based and citation-based similarity measures for a broad class of scientific documents from the area of bio-medical research taken from the PubMed corpus. Using the respective MeSH classifica-tions as ground truth our results prove that although text-based similarity measure show superior performance over more hetero-geneous collections, citation-based methods are still valuable for corpora with certain characteristics.
Concept Chaining utilizing Meronyms in Text Characterization
Lori Watrous-Deversterre; Chong Wang; Min Song
For most, the web is the first source to answer a question formulated by curiosity, need, or research reasons. This phenomenon is due to the internet's ubiquitous access, ease of use, and the extensive and ever expanding content. The problem is no longer the need to acquire content to encourage use, but to provide organizational tools to support content categorization that will facilitate improved access methods. This paper presents the results of a new text characterization algorithm that combines semantic and linguistic techniques utilizing domain-based ontology background knowledge. It explores the combination of meronym, synonym, and hypernym linguistic relationships to create a set of concept chains used to represent concepts found in a document. The experiments show improved accuracy over bag-of-words based term weighting methods and reveal characteristics of the meronym contribution to document representation.