CS 953 Adv Top / Data Science for Knowledge Graphs and Text

Implementation-intensive Graduate-level Seminar

This course counts towards your implementation-intensive requirements.

This means that your grade will be split as follows:

As a running example of a data science application, we will focus on a shared-task called Complex Answer Retrieval [2] hosted by the Text Retrieval Conference (TREC) [1].

The purpose of complex answer retrieval is to automatically compose Wikipedia articles by understanding the meaning of text and drawing connections between meanings through the use of machine learning, graph analysis and natural language processing tools.

Heads up: there will be programming homework in the first week.

You need to be comfortable writing large-scale programs on your own in order to take this course. If you can’t program, you need to learn programming before taking this course.

[1] trec.nist.gov

[2] trec-car.cs.unh.edu

Overview

This course covers basic and advanced algorithms and techniques for data science with knowledge graph and text data.

During this course you will learn about a wide range of algorithms for graph processing, text processing, and information retrieval with a focus of knowledge graphs such as Wikipedia, DBpedia, Freebase, and Yago and text from knowledge articles such as Wikipedia and the world-wide Web.

You will be selecting some of these methods to solve the task given by TREC CAR. You will be implementing some of these algorithms yourself or you will be using implementations of those algorithms in your own code to produce a fully-automatic prototype for complex answer retrieval. You will use your prototype to make a submission to the shared task (competing with researchers world-wide). Forming teams of up to three people is highly encouraged.

Before the submission, you will be implementing an evaluation framework for assessing which of these approaches work best. Evaluation data will be provided by the TREC CAR organizers, but you will need to develop a test framework which can evaluate not just your methods, but also methods of your competing teams. This further includes statistical analysis of experimental evaluation, which is the bread & butter of all data-centric research and a highly demanded skill by industry.

We will be using tools for software development in a team, as well as publication and distribution of software artifacts in a research setting.

During the course we will be discussing introductory and advanced research papers on various topics of natural language processing, knowledge graph inference, semantic web, and information retrieval. These include entity linking and relation extraction, graph walk algorithms, graph clustering, text-based similarity measures, information retrieval models, text clustering methods and topic models as well as other machine learning methods.

We discuss different methods and how they make use of data and training signals, how they integrate with each other and how they contribute to an approach for the example application of TREC CAR. We discuss how to obtain required training signals automatically from data or through manual annotations by human judges.

Prerequisites: CS 853 Topics/Information Retrieval or permission of instructor. Knowledge of data structures and basic algorithms (such as CS 515). Ability to independently write programs in a language of your choice.

Schedule

See class calendar on mycourses.

Every week we will discuss one topic from the Reading list. Alternative topics can be proposed by students.

Implementation-level issues are discussed during prototype clinic sessions. All students are expected to be present during class sessions and make fruitful contributions.

Cancelled classes will be made up in the form of “Hackathon” classes on select dates from 5.10-8pm.

Important Dates

First class: Jan 22

No class on - Feb 12, Feb 14 - March 12, March 14 (spring break)

Hackathon classes (5:10 - 8:00 pm) - bring computers and food! - to be determined

Final project presentations on - to be determined

Grading Policy

Your grade will be based the quality of the implemented prototype (70%) and class participation (30%). You need to obtain a passing grade in both to pass this course.

Prototype: The prototype will be implemented in teams. The project will be implemented in a programming language of your choice. There will be three submissions of the prototype (every month). The project need to be presented in class and will be graded based on: - Performance on the given task - Correctness of the implemented methods - Code quality, legibility, documentation, and use of software-development tools (version control, dependency-management, documentation) - Organization of the team and team spirit - Understandability of the final report

Reading: Research methods will be studied as a Journal club. Every week all students read assigned papers. Reading notes are submitted before 8am on the day of the class. Each paper will be discussed by several students in roles of “author”, “opponent”, “inquisitor”, and “scribe”. The participation grade will be based on - Quality of reading notes - Activity in the discussion (in class as well as on Piazza)

Expert Topics: Each student will select two expertise fields from the topic list. The intention is to let students dive deep into topics, and implement some of these approaches for the team project. A one-page literature survey over the topic will be submitted and graded. Furthermore, “Expert” students are expected to complement the paper discussion with their gained knowledge on this topic.

Excellent contributions will be rewarded with an upgrade of the final grade.

Late homework and project report submissions will generally be excluded. Any missed activity due to medical or families emergencies requires supporting documentation from the dean of students.

Mutual Expectations

Students are expected to:

The instructor is expected to:

Note that is not sufficient to just be present in class and submit reading notes. If stuck or lost, please see the instructor immediately.

Academic Integrity

The instructor is strongly committed to upholding the standards of academic sintegrity. These standards, at the minimum, require that students never present the work of others as their own. Any dishonest behavior, once discovered, will be penalized according to the University’s Student Code of Conduct.

Textbooks

The lecture is not based on a book. The following books are recommended for further study and background reading.

Book on Information Retrieval (both available in the library)

C. D. Manning, P. Raghavan and H. Schütze, Introduction to Information Retrieval, Cambridge University Press, 2008 (available at http://nlp.stanford.edu/IR-book).

Book on Text Data Management:

C. Zhai and S. Massung, Text Data management and Analysis: A Practical Introduction to Information Retrieval and Text Mining“, ACM and Morgan & Claypool Publishers, 2016. (obtain through http://www.morganclaypoolpublishers.com/catalog_Orig/product_info.php?products_id=944 )

More online text books

Eisenstein. Natural Language Processing: https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf

Zhang et al. Dive into Deep Learning - an interactive deep learning book: <htttp://d2l.ai/Deep learning with exampls http://d2l.ai/