Reading List for Graph Walk

Graph walks are the first topic we will dive into. Graphs are an abstract way to represent many data sources:

A knowledge graph is a graph where entities (think Wikipedia entries) are nodes, and relations such as “married-to” or “born-in” are edges (e.g. Freebase , DBpedia , Yago , WikiData )
A web graph is a graph where web pages are nodes, and hyperlinks from one page to another are edges.
A Wikipedia hypertext graph is a graph where each Wikipedia page is a node, and when one page links to another, they have an edge
An entity link graph is a graph where both texts and Wikipedia entities are nodes, and a text with an entity link to a Wikipedia entity, is represented by an edge. (Entity linking tools are: TagMe! , Dexter , etc – more on the project resource page)
A social friend network (e.g., Facebook) is a graph where users are nodes, and there is an edge, if they are friends
A social follower network (e.g., Twitter) is a graph were users are nodes, and there is an edge when a user follows another user (i.e., subscribes to their content)
A citation graph is a graph where each piece of scientific literature is a node, and a citation is an edge in the graph
A word co-occurrence graph is a graph, where words are nodes, and words that occur in the same text have an edge between them
A word-net graph is a graph, where words are nodes, and they have an edge if they are in a syntactic relationship (e.g., synonym, hyponym, homonym, …)
A word2vec graph is a graph were words are nodes, and all nodes have an edge between each other. However, the strength of the edge is the similarity of both word’s word vectors.

In the programming homework you are going to be working with a CAR-Hypertext graph like this:

Every CAR page is a node, and every paragraph represents an edge between all CAR pages it links to.

Graph walk algorithms are based on an thought experiment: Say a random surfer would pick a node at random, and would hop from node to node along edges for all eternity. Which node would it visit most often?

Graph walk algorithms can be used to rank nodes, by placing the node that is visited most often first, the second-most visited second, etc. This is the rough idea behind the PageRank algorithm, which made the Google search engine famous in the late 90’ies.

The purpose of this reading assignment is for you to become familiar with different graph walk algorithms and their variation, their underlying theory and applications.

Mandatory reading

(everyone has to read this)

Haveliwala, Taher H. “Topic-sensitive pagerank.” Proceedings of the 11th international conference on World Wide Web. ACM, 2002. http://ilpubs.stanford.edu:8090/573/1/2002-6.pdf

Additional papers

(mandatory if you choose this as your expertise topic)

Lempel, Ronny, and Shlomo Moran. “SALSA: the stochastic approach for link-structure analysis.” ACM Transactions on Information Systems (TOIS) 19.2 (2001): 131-160. http://delab.csd.auth.gr/~manolopo/oikonomiko/salsa.pdf
Erkan, Günes, and Dragomir R. Radev. “Lexrank: Graph-based lexical centrality as salience in text summarization.” Journal of Artificial Intelligence Research 22 (2004): 457-479. http://www.jair.org/media/1523/live-1523-2354-jair.pdf
Richardson, Matthew, and Pedro Domingos. “The intelligent surfer: Probabilistic combination of link and content information in pagerank.” Advances in neural information processing systems. 2002. http://papers.nips.cc/paper/2047-the-intelligent-surfer-probabilistic-combination-of-link-and-content-information-in-pagerank.pdf
Berkhin, Pavel. “A survey on pagerank computing.” Internet Mathematics 2.1 (2005): 73-120. https://projecteuclid.org/download/pdf_1/euclid.im/1128530802 – This is a very long paper. I don’t expect you to read all of it. But choose a few sections into which you are diving deeper.
Flake, Gary William, Steve Lawrence, and C. Lee Giles. “Efficient identification of web communities.” Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining. 2000.

Introductory Reading

(if you are completely lost, start here!)

Page, Lawrence, et al. The PageRank citation ranking: Bringing order to the web. Stanford InfoLab, 1999. http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf
Kleinberg, Jon M. “Authoritative sources in a hyperlinked environment.” Journal of the ACM (JACM) 46.5 (1999): 604-632. https://www.cs.bgu.ac.il/~snean151/wiki.files/7-AuthoritativeSourcesinaHyperlinkedEnvironment.pdf
Chapter 21 in Book Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. Introduction to information retrieval. Vol. 1. No. 1. Cambridge: Cambridge university press, 2008. https://nlp.stanford.edu/IR-book/pdf/21link.pdf

Laura Dietz

Reading List for Graph Walk

Mandatory reading

Additional papers

Introductory Reading

Further reading beyond this assignment