IR+NLP:
My main line of research lies in the intersection between information retrieval, semantic annotations, knowledge graphs, and machine learning. Much of my research and of the Ph.D. students in the “TREMA” lab evolves around the vision of generating comprehensive articles for user-provided topics. This vision requires solving open research questions such as: How to identify which concepts (called entities) to mention? How to find supporting text in a large collection? How to identify whether a text that mentions a relevant concept is noteworthy for the user-provided topic? How to predict whether relations between concepts (as extracted from text or obtained from a knowledge graph) are relevant in the context? How to arrange relevant text passages into meaningful subtopics?
This line of work was honored with an NSF CAREER Award on “Utilizing Fine-grained Knowledge Annotations in Text Understanding and Retrieval” (January 2019 – December 2023).
Ongoing research efforts have led to a full-paper and three short-papers at the IR flagship conference ACM SIGIR (Dietz 2019; Chatterjee and Dietz 2021; Litschko et al. 2019; Kadry and Dietz 2017), prime venues such as CIKM (Ramsdell and Dietz 2020), ICTIR (Chatterjee and Dietz 2019; Weiland et al. 2016), and ECIR (Dalton et al. 2019), journals (Dietz and Dalton 2020; Nanni, Ponzetto, and Dietz 2020; Weiland et al. 2018, 2017). Our work won a best-paper award at the JCDL (an A* conference) in 2018 (Nanni, Ponzetto, and Dietz 2018). Our ideas were discussed at several conference workshops (Oza and Dietz 2021; Kashyapi and Dietz 2021; Magnusson and Dietz 2019; Basu, Dietz, and Fellbaum 2018).
I presented this work during several conference keynotes (ECIR, AKBC, SPIRE) and invited talks at renowned universities. I presented conference tutorials at ICTIR 2016, WSDM 2017, SIGIR 2018, and organized the workshop series “KG4IR” which was held at the IR flagship conference ACM SIGIR twice, as well as the workshop at NAACL on Extracting Structured Knowledge from Scientific Publications (ESSP), and an edited special issue in the Journal for Information Retrieval (IRJ).
The testbed for this vision, with benchmarks, evaluation protocols, and strong reference methods, has been developed within the TREC Complex Answer Retrieval challenge that I was coordinating between 2017–2019 with advice from members of the National Institute for Standards and Technology (NIST). It is a great honor that my track was selected by the TREC evaluation venue, since empirical system evaluation is central in my research field and TREC is a highly selective venue.
Watershed Data Science:
I have been developing a parallel research initiative on data science for studying storm events in watersheds with Adam Wymore and other faculty from the department of Natural Resources and the Environment (NRESS). The work of MS/PhD student Sepideh Koohfar and two undergrad capstone projects have led to a rigorous data processing pipeline with automatic storm event detection and a method for forecasting the solute concentration response to expected storm events.
The work has been awarded a seed grant from the NSF-funded Northeast Big Data Innovation Hub. Joint work is under submission to the AGU Fall Meeting.
Other Interests:
In general, I am interested in developing machine learning methods for analyzing, classifying, predicting, and tagging sequential data. The underlying technology impacts both my work on IR+NLP as well as Watershed Data Science. It is also why I like to work with students who are interested in various data domains such as music or social media.
Because of my expertise in algorithms and empirical system evaluation, I am consulted by the members of open-source community that supports the compiler for a functional programming language Haskell (GHC). This has led to some research publications on non-moving garbage collectors for the Haskell runtime system.
Together with my students I am working on methods to automatically, and in a query-driven manner, retrieve materials from the Web and compose Wikipedia-like articles. Especially for information needs, where the user has very little prior expert knowledge about, the web search paradigm of 10 blueToe hyperlinks is not sufficient. Instead we envision to provide a synthesis of the Web materials that strives to mimick the comprehensiveness of Wikipedia articles. We limit ourselves to a content-only setting where query-log, click, or session information is not available. Consequently, we aim to maximize the utility of information retrieval models in combination with methods from natural language processing. A particular emphasis is to utilize information from structured knowledge resources such as Wikipedia, Freebase, or DBpedia together with text-based reasoning on general document and Web corpora.
An early feasibility study was presented at AKBC 2014, a later demo presented at the ESAIR workshop at CIKM 2015 (demo). The method paper for the demo is under submission (information available on request).
Closely related work on reranking entities for web queries was presented at CIKM 2015 (appendix) as well as work on using relation extraction in information retrieval presented at ECIR 2016 (supervised relations) and SIGIR 2017 (OpenIE)
The project was awarded with an Amazon AWS in education research grant and a stipdend by the Eliteprogramm for Postdoktorandinnen und Postdoktoranden of the Baden-Württemberg Stiftung.
With Federico Nanni, I am working on building document collections for events. We found that entity links are too unspecific, as the same entity can be mentioned in different contexts (we call them entity aspects). In our JCDL 18 paper on entity aspect linking, we demonstrated that such aspects can be harvested section headings of the entity’s Wikipedia article. To post-process entity links, we propose a method for entity-aspect linking to refine the entity link with aspect information. When applied to retrieval problems, aspect linking improved the accuracy of rankings and classifications. This work received a best paper award at JCDL 2018.
We provide a large benchmark for training and evaluation of entity aspect linking (ramsdell2020?). In our latest SIGIR paper, we demonstrate the added benefits of using entity-aspects for entity-oriented search tasks (Chatterjee and Dietz 2021).
For many years I am interested in unsupervised algorithms for identifying shared aspects and quantifying influence in social networks. Work on symmetric networks is published at ICWSM 2012 ( Code & Supplement ) and asymmetric networks at ICML 2007 (talk – Supplement).
In my work at SIGIR 2019 (Dietz 2019), I propose a method for incorporating enity, neighbor and text information into an entity ranking task. The underlying framework represents neighbor and text information to predict edges weights in an entity-relation graph, optimizing for a list-wise learning-to-rank criterion. Paper – appendix – video
My PhD thesis was focused on topic models and other generative models for data with link structure.
From 2017-2019, I coordinated the Complex Answer Retrieval track at the Text Retrieval Conference (TREC). It is an international evaluation track on how can retrieve the most best passages and and entities on topics about popular science and society. For more information about the data, task and evaluation, please see the official TREC Complex Answer Retrieval site.
Track overview papers:
L.Dietz, M.Verma, F.Radlinski, N.Craswell (2017). TREC Complex Answer Retrieval Overview. In TREC. year 1
L.Dietz, B.Gamari, J.Dalton, N.Craswell (2018), TREC Complex Answer Retrieval Overview. In TREC. year 2
L.Dietz, B.Gamari, J.Foley (2019), TREC CAR Y3: Complex Answer Retrieval Overview. In TREC. year 3