CS 953 Adv Top / Data Science for Knowledge Graphs and Text

Implementation-intensive Graduate-level Seminar

This course counts towards your implementation-intensive requirements.

This means that your grade will be split as follows:

70% of the grade will be determined by a prototype you implement, evaluate, and submit.
30% of the grade are determined by reading notes, literature surveys, and presentations.

As a running example of a data science application, we will focus on a single shared-task throughout the semester. Options:

Complex Answer Retrieval 2 hosted by the Text Retrieval Conference (TREC) 1.
- The purpose of complex answer retrieval is to automatically compose Wikipedia articles by understanding the meaning of text and drawing connections between meanings through the use of machine learning, graph analysis and natural language processing tools.
Games of Thrones Dataset 5 / Westeros site 4 / Fandom Wiki 3 / Fandom wiki in trec-car page format 6
- The purpose is to extract information from the text and sites to predict the chance of survival of characters

Heads up: there will be programming homework in the first week.

You need to be comfortable writing large-scale programs on your own in order to take this course. If you can’t program, you need to learn programming before taking this course.

Overview

This course covers basic and advanced algorithms and techniques for data science with knowledge graph and text data.

During this course you will learn about a wide range of algorithms for graph processing, text processing, and information retrieval with a focus of knowledge graphs such as Wikipedia, DBpedia, Freebase, and Yago and text from knowledge articles such as Wikipedia and the world-wide Web.

You will be selecting some of these methods to solve the task given by the chosen shared task. You will be implementing some of these algorithms yourself or you will be using implementations of those algorithms in your own code to produce a fully-automatic prototype for the shared task. You are encouraged to use your prototype to make a submission to the shared task (competing with researchers world-wide).

Before the submission, you will be implementing an evaluation framework for assessing which of these approaches work best. Evaluation data will be provided with the dataset, but you will need to develop a test framework which can evaluate not just your methods, but also methods of your competing teams. This further includes statistical analysis of experimental evaluation, which is the bread & butter of all data-centric research and a highly demanded skill by industry.

We will be using tools for software development in a team, as well as publication and distribution of software artifacts in a research setting.

During the course we will be discussing introductory and advanced research papers on various topics of natural language processing, knowledge graph inference, semantic web, and information retrieval. These include entity linking and relation extraction, graph walk algorithms, graph clustering, text-based similarity measures, information retrieval models, text clustering methods and topic models as well as other machine learning methods.

We discuss different methods and how they make use of data and training signals, how they integrate with each other and how they contribute to an approach for the example application. We discuss how to obtain required training signals automatically from data or through manual annotations by human judges.

Prerequisites: CS 853 Topics/Information Retrieval or permission of instructor. Knowledge of data structures and basic algorithms (such as CS 515). Ability to independently write programs in a language of your choice.

Teams

The formation of teams with three people is highly encouraged. You have the option to work in a time of any size (or by yourself). But each team will be graded on the same standards. You can change your team after any submission or split/merge teams.

Team members who do not make a significant contribution to the team’s submission, will receive a reduced grade (in extreme cases, F).

Every team member must make a unique contribution to the code base. It is not sufficient to “just help out”. Since this is an implementation-intensive class, every students had to contribute significant amount of source code.

Schedule

See class calendar on mycourses.

Every week we will discuss one topic from the Reading list. Alternative topics can be proposed by students.

Implementation-level issues are discussed during prototype clinic sessions. All students are expected to be present during class sessions and make fruitful contributions.

Cancelled classes will be made up in the form of “Hackathon” classes on select dates from 5.10-8pm.

Important Dates

See course catalog CS953 for times/locations

See MyCourses for dates and homework submissions.

We might have some Hackathon classes (5:10 - 8:00 pm) - bring computers and food! - to be determined

Grading Policy

Your grade will be based the quality of the implemented prototype (70%) and class participation (30%). You need to obtain a passing grade in both to pass this course.

Prototype: The prototype will be implemented in teams. The project will be implemented in a programming language of your choice. There will be three submissions of the prototype (every month). The project need to be presented in class and will be graded based on:

Performance on the given task
Correctness of the implemented methods
Code quality, legibility, documentation, and use of software-development tools (version control, dependency-management, documentation)
Individual contribution through the implementation of methods (expected 3000 lines of new code per submission; code copied from other sources does not count).
Organization of the team and team spirit
Understandability of the final report

Reading: Research methods will be studied as a Journal club. Every week all students read assigned papers. Reading notes are submitted before 8am on the day of the class. Each paper will be discussed by several students in roles of “narrator”, “author”, “opponent”, “inquisitor”, and “scribe”. The participation grade will be based on

Quality of reading notes
Activity in the discussion (in class as well as on Piazza)

Expert Topics: Each student will select two expertise fields from the topic list. The intention is to let students dive deep into topics, and implement some of these approaches for the team project. A one-page literature survey over the topic will be submitted and graded. Furthermore, “Expert” students are expected to complement the paper discussion with their gained knowledge on this topic.

Excellent contributions will be rewarded with an upgrade of the final grade.

Late homework and project report submissions will generally be excluded. Any missed activity due to medical or families emergencies requires supporting documentation from the dean of students.

Mutual Expectations

Students are expected to:

be present in class (physically and mentally),
ask at least one question every session,
present papers and report on progress
do their own work and contribute significantly in team activities,
study and repeat necessary class materials independently.
ask instructor for help when material is not understood.

The instructor is expected to:

make teaching materials available before the class
provide feedback on reading notes and literature surveys
be available for questions regarding class material during class, online, and if necessary by appointment,
notify students that are in danger of not meeting the class goals early on,

Note that is not sufficient to just be present in class and submit reading notes. If stuck or lost, please see the instructor immediately.

Academic Integrity

The instructor is strongly committed to upholding the standards of academic sintegrity. These standards, at the minimum, require that students never present the work of others as their own. Any dishonest behavior, once discovered, will be penalized according to the University’s Student Code of Conduct.

Textbooks

The lecture is not based on a book. The following books are recommended for further study and background reading.

Book on Information Retrieval (both available in the library)

C. D. Manning, P. Raghavan and H. Schütze, Introduction to Information Retrieval, Cambridge University Press, 2008 (available at http://nlp.stanford.edu/IR-book).

Book on Text Data Management:

C. Zhai and S. Massung, Text Data management and Analysis: A Practical Introduction to Information Retrieval and Text Mining”, ACM and Morgan & Claypool Publishers, 2016. (obtain through http://www.morganclaypoolpublishers.com/catalog_Orig/product_info.php?products_id=944 )

More online text books

Eisenstein. Natural Language Processing: https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf

Zhang et al. Dive into Deep Learning - an interactive deep learning book: <htttp://d2l.ai/Deep learning with exampls http://d2l.ai/

Laura Dietz