CS 753/853 Topics / Information Retrieval and Generation Systems

Note

This course will be managed on MyCourses. This course qualifies as an Implementation-Intensive Elective, it also counts as an upper level Writing-Intensive requirement.

Prereq: CS 515 for undergrads. None for grad students.

(This course serves as a prereq for CS 781/881 which is a project-based research seminar taught every Spring.)

Overview

Fundamental algorithms and techniques for text processing and text-based information retrieval, synthesis, and generation systems. Topics include how to build an end-to-end information access system, such as a Web search engine or a chat agent.

After this course you will be able to develop your own retrieval-augmented generation system or own web search engine. Every week we will carefully examine a different component of a web search engine system.

The course focuses on index building, query processing, document ranking, and generation of natural language for those purposes. We will further touch on text-based machine learning methods, such as classification and clustering, as well as crawling and link-based algorithms such as Google’s PageRank.

The course will cover several algorithms and data structures with application to web search, thereby building on CS 515 “Data Structures”. Both theoretical analyses of run-time performance as well as hands-on programming assignments and a class project are part of the course.

Information retrieval methods are an essential component in any text-based data analytics system, ranging from text mining and machine learning, to natural language processing and knowledge management applications.

Prerequisites: Data Structures (CS 515) or permission of instructor. Ability to independently write programs in either Java, Python, Haskell, or Scala.

Grading Policy

The project will be carried out in teams of up to four people. The project will be implemented in a programming language of your choice. The projects need to be presented at the end of the semester and will be graded based on a final report. It is necessary to document the individual contribution of each team member.

Bi-weekly homework assignments for programming and theory.

The same policy applies to both students taking the course as CS 753 and CS 853. Of course, expectations for students taking the course for graduate credits under CS 853 are higher.

Late homework and project report submissions will generally be excluded. Any missed activity due to medical or families emergencies requires supporting documentation.

Academic Integrity

The instructor is strongly committed to upholding the standards of academic integrity. These standards, at the minimum, require that students never present the work of others as their own. Any dishonest behavior, once discovered, will be penalized according to the University’s Student Code of Conduct.

Mutual Expectations

Students are expected to:

The instructor is expected to:

Note that is not sufficient to just be present in class and submit homeworks. Obtaining an A requires that you study and review materials from lecture notes, assignments, and discussions with the help of the book. If stuck, please see the instructor.

Textbooks

The lecture is based on “Introduction to Information Retrieval”. Other books are recommendations for further reading.

C. D. Manning, P. Raghavan and H. Schütze, Introduction to Information Retrieval, Cambridge University Press, 2008 (available at http://nlp.stanford.edu/IR-book).

B. Croft, D. Metzler, T. Strohman, Search Engines: Information Retrieval in Practice, Addison-Wesley, 2009 (available at http://ciir.cs.umass.edu/irbook/ ).

C. Zhai and S. Massung, Text Data management and Analysis: A Practical Introduction to Information Retrieval and Text Mining”, ACM and Morgan & Claypool Publishers, 2016. (obtain through http://www.morganclaypoolpublishers.com/catalog_Orig/product_info.php?products_id=944 )

R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley, 2011 (2nd Edition).

Schedule

Note that this schedule is preliminary and will possibly change as the course progresses.