1. What is Data Science?
2. What do I need to know to be a Data Scientist?
3. Task, Evaluation, System, Methods and how to read papers
4. Task for prototype: TREC CAR (Complex Answer Retrieval track at the Text Retrieval Conference)
Definition from:
- Wikipedia: wiki-data-science.pdf
- Berkeley: berkeley-what-is-datascience.pdf
- NYU: nyu-what-is-datascience.pdf
- Several people on Quora: quora-what-is-data-science.pdf
Related terms (from Wikipedia)
- Data Mining: wiki-data-mining.pdf
- Data Journalism: wiki-data-journalism.pdf
Many online courses focus on programming python and R, on particular machine learning toolkits, statistical methods, and visualization.
Someone came up with a roadmap on different topics associated with Data Science:
It is impossible to discuss all these topics within a single course. In this course, an emphasis is placed on methods for a science on textual data and knowledge graph data - the orange branch in the map, and beyond. Our journey through this road map will also include fundamentals (blue), machine learning (yellow), and toolboxes (brown). By implementing your prototype you will automatically learn about topics in data munging (pink), data ingestions (green), and programming (yellowish green). Topics of quantitative evaluation (statistics, light blue) and presentation (visualization, red) will be used to assess the performance of your prototype.
slides-week1-task-evaluation-system-methods-papers.pdf
See the website of the Complex Answer Retrieval track that is hosted at TREC this year for a detailed task and data description.
Everyone must read both mandatory papers and a third one from the list below.
Everyone must submit reading notes by 8am of the discussion day through mycourses "Assignments". Use the reading-notes-template.mkd .
We go through the first half of the TREC CAR task presentation given at the planning session at TREC in November 2016. trec-car-planning.svg
1. 10 minute introduction to the topic
2. Discussion of reading notes
3. Questions and "not understood" parts
4. Paper discussion (Section-by-section)
5. Final research paper deconstruct
Introduction: The presenter should give a 10 minute introduction to the topic. Roughly: what is it about? What are critical definitions? How is this area roughly evaluated?
Reading notes: The presenter will talk about her/his submitted reading notes, and other members of the audience are asked to talk about their reading notes as well.
Questions: At this point any question or parts that are not understood need to be listed by the presenter and the audience. (You better ask the question before I ask you.)
Paper discussion: This is to be followed by section-by-section paper discussion. This discussion is facilitated by the presenter but everyone is expected to contribute. In this discussion, we walk through some of the papers - section by section - and recap the most important points. This is another opportunity of the presenter and the audience to ask questions and point out connections to other papers.
Research paper deconstruct: One outcome of this discussion is a better "research paper deconstruct" (cf. my last lecture). The reading notes which are due before class are already one attempt at a paper deconstruct. But often a second attempt is better than the first.
Haveliwala, Taher H. "Topic-sensitive pagerank." In Proceedings of the 11th international conference on World Wide Web, pp. 517-526. ACM, 2002.
http://ilpubs.stanford.edu:8090/573/1/2002-6.pdf (Links to an external site.)
Navigli, Roberto, and Mirella Lapata. "Graph Connectivity Measures for Unsupervised Word Sense Disambiguation." In IJCAI, pp. 1683-1688. 2007.
http://www.aaai.org/Papers/IJCAI/2007/IJCAI07-272.pdf (Links to an external site.)
Farahat, Ayman, et al. "Authority rankings from HITS, PageRank, and SALSA: Existence, uniqueness, and effect of initialization." SIAM Journal on Scientific Computing 27.4 (2006): 1181-1201. https://s3.amazonaws.com/academia.edu.documents/44429954/Authority_Rankings_from_HITS_PageRank_an20160405-30697-jbtv0b.pdf?AWSAccessKeyId=AKIAIWOWYYGZ2Y53UL3A&Expires=1516668455&Signature=%2BvGT2mn0qR%2FvGj3kG5OnP6cgYOM%3D&response-content-disposition=inline%3B%20filename%3DAuthority_Rankings_from_HITS_PageRank_an.pdf
Hulpus, Ioana, Conor Hayes, Marcel Karnstedt, and Derek Greene. "Unsupervised graph-based topic labelling using dbpediaEveryone must read both mandatory papers and a third one from the list below.
Everyone must submit reading notes by 8am of the discussion day through mycourses "Assignments". Use the reading-notes-template.mkd .
We go through the first half of the TREC CAR task presentation given at the planning session at TREC in November 2016. trec-car-planning.svg
1. 10 minute introduction to the topic
2. Discussion of reading notes
3. Questions and "not understood" parts
4. Paper discussion (Section-by-section)
5. Final research paper deconstruct
Introduction: The presenter should give a 10 minute introduction to the topic. Roughly: what is it about? What are critical definitions? How is this area roughly evaluated?
Reading notes: The presenter will talk about her/his submitted reading notes, and other members of the audience are asked to talk about their reading notes as well.
Questions: At this point any question or parts that are not understood need to be listed by the presenter and the audience. (You better ask the question before I ask you.)
Paper discussion: This is to be followed by section-by-section paper discussion. This discussion is facilitated by the presenter but everyone is expected to contribute. In this discussion, we walk through some of the papers - section by section - and recap the most important points. This is another opportunity of the presenter and the audience to ask questions and point out connections to other papers.
Research paper deconstruct: One outcome of this discussion is a better "research paper deconstruct" (cf. my last lecture). The reading notes which are due before class are already one attempt at a paper deconstruct. But often a second attempt is better than the first.
Haveliwala, Taher H. "Topic-sensitive pagerank." In Proceedings of the 11th international conference on World Wide Web, pp. 517-526. ACM, 2002.
http://ilpubs.stanford.edu:8090/573/1/2002-6.pdf (Links to an external site.)
Navigli, Roberto, and Mirella Lapata. "Graph Connectivity Measures for Unsupervised Word Sense Disambiguation." In IJCAI, pp. 1683-1688. 2007.
http://www.aaai.org/Papers/IJCAI/2007/IJCAI07-272.pdf (Links to an external site.)
Hulpus, Ioana, Conor Hayes, Marcel Karnstedt, and Derek Greene. "Unsupervised graph-based topic labelling using dbpedia." In Proceedings of the sixth ACM international conference on Web search and data mining, pp. 465-474. ACM, 2013.
Chakrabarti, Soumen. "Dynamic personalized pagerank in entity-relation graphs." In Proceedings of the 16th international conference on World Wide Web, pp. 571-580. ACM, 2007.
https://www.cse.iitb.ac.in/~soumen/doc/www2007/www324-chakrabarti.pdf (Links to an external site.)
Yeh, Eric, Daniel Ramage, Christopher D. Manning, Eneko Agirre, and Aitor Soroa. "WikiWalk: random walks on Wikipedia for semantic relatedness." In Proceedings of the 2009 Workshop on Everyone must read both mandatory papers and a third one from the list below.
Everyone must submit reading notes by 8am of the discussion day through mycourses "Assignments". Use the reading-notes-template.mkd .
We go through the first half of the TREC CAR task presentation given at the planning session at TREC in November 2016. trec-car-planning.svg
1. 10 minute introduction to the topic
2. Discussion of reading notes
3. Questions and "not understood" parts
4. Paper discussion (Section-by-section)
5. Final research paper deconstruct
Introduction: The presenter should give a 10 minute introduction to the topic. Roughly: what is it about? What are critical definitions? How is this area roughly evaluated?
Reading notes: The presenter will talk about her/his submitted reading notes, and other members of the audience are asked to talk about their reading notes as well.
Questions: At this point any question or parts that are not understood need to be listed by the presenter and the audience. (You better ask the question before I ask you.)
Paper discussion: This is to be followed by section-by-section paper discussion. This discussion is facilitated by the presenter but everyone is expected to contribute. In this discussion, we walk through some of the papers - section by section - and recap the most important points. This is another opportunity of the presenter and the audience to ask questions and point out connections to other papers.
Research paper deconstruct: One outcome of this discussion is a better "research paper deconstruct" (cf. my last lecture). The reading notes which are due before class are already one attempt at a paper deconstruct. But often a second attempt is better than the first.
Haveliwala, Taher H. "Topic-sensitive pagerank." In Proceedings of the 11th international conference on World Wide Web, pp. 517-526. ACM, 2002.
http://ilpubs.stanford.edu:8090/573/1/2002-6.pdf (Links to an external site.)
Navigli, Roberto, and Mirella Lapata. "Graph Connectivity Measures for Unsupervised Word Sense Disambiguation." In IJCAI, pp. 1683-1688. 2007.
http://www.aaai.org/Papers/IJCAI/2007/IJCAI07-272.pdf (Links to an external site.)
Hulpus, Ioana, Conor Hayes, Marcel Karnstedt, and Derek Greene. "Unsupervised graph-based topic labelling using dbpedia." In Proceedings of the sixth ACM international conference on Web search and data mining, pp. 465-474. ACM, 2013.
Chakrabarti, Soumen. "Dynamic personalized pagerank in entity-relation graphs." In Proceedings of the 16th international conference on World Wide Web, pp. 571-580. ACM, 2007.
https://www.cse.iitb.ac.in/~soumen/doc/www2007/www324-chakrabarti.pdf (Links to an external site.)
Yeh, Eric, Daniel Ramage, Christopher D. Manning, Eneko Agirre, and Aitor Soroa. "WikiWalk: random walks on Wikipedia for semantic relatedness." In Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing, pp. 41-49. Association for Computational Linguistics, 2009.
http://www.anthology.aclweb.org/W/W09/W09-32.pdf#page=53 (Links to an external site.)
Baluja, Shumeet, Rohan Seth, D. Sivakumar, Yushi Jing, Jay Yagnik, Shankar Kumar, Deepak Ravichandran, and Mohamed Aly. "Video suggestion and discovery for youtube: taking random walks through the view graph." In Proceedings of the 17th international conference on World Wide Web, pp. 895-904. ACM, 2008.
http://www.esprockets.com/papers/adsorption-yt.pdf (Links to an external site.)
Agirre, Eneko, Oier López de Lacalle, and Aitor Soroa. "Random Walks for Knowledge-Based Word Sense Disambiguation." Computational Linguistics 40, no. 1, 2014, pp 57-84.
http://anthology.aclweb.org/J/J14/J14-1003.pdf (Links to an external site.)
Book "Text Data Management and Analysis" Chapter 10.3
Page, Lawrence, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank citation ranking: Bringing order to the web. Stanford InfoLab, 1999.
http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf (Links to an external site.)
Mihalcea, Rada, and Dragomir Radev. Graph-based natural language processing and information retrieval. Cambridge University Press, 2011. ISBN:0521896134 9780521896139
Eppstein, David. "Finding the k shortest paths." SIAM Journal on computing 28, no. 2 (1998): 652-673.
http://www.ics.uci.edu/~eppstein/pubs/Epp-SJC-98.pdf (Links to an external site.)
Backstrom, Lars, and Jure Leskovec. "Supervised random walks: predicting and recommending links in social networks." In Proceedings of the fourth ACM international conference on Web search and data mining, pp. 635-644. ACM, 2011.
http://www-cs-faculty.stanford.edu/people/jure/pubs/linkpred-wsdm11.pdf (Links to an external site.)
Bahmani, Bahman, Kaushik Chakrabarti, and Dong Xin. "Fast personalized pagerank on mapreduce." In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pp. 973-984. ACM, 2011.
Talukdar, P. P., Reisinger, J., Paşca, M., Ravichandran, D., Bhagat, R., & Pereira, F. (2008, October). Weakly-supervised acquisition of labeled class instances using graph random walks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 582-590). Association for Computational Linguistics.
http://www.anthology.aclweb.org/D/D08/D08-1.pdf#page=612
Graph-based Methods for Natural Language Processing, pp. 41-49. Association for Computational Linguistics, 2009.
http://www.anthology.aclweb.org/W/W09/W09-32.pdf#page=53 (Links to an external site.)
Baluja, Shumeet, Rohan Seth, D. Sivakumar, Yushi Jing, Jay Yagnik, Shankar Kumar, Deepak Ravichandran, and Mohamed Aly. "Video suggestion and discovery for youtube: taking random walks through the view graph." In Proceedings of the 17th international conference on World Wide Web, pp. 895-904. ACM, 2008.
http://www.esprockets.com/papers/adsorption-yt.pdf (Links to an external site.)
Agirre, Eneko, Oier López de Lacalle, and Aitor Soroa. "Random Walks for Knowledge-Based Word Sense Disambiguation." Computational Linguistics 40, no. 1, 2014, pp 57-84.
http://anthology.aclweb.org/J/J14/J14-1003.pdf (Links to an external site.)
Book "Text Data Management and Analysis" Chapter 10.3
Page, Lawrence, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank citation ranking: Bringing order to the web. Stanford InfoLab, 1999.
http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf (Links to an external site.)
Mihalcea, Rada, and Dragomir Radev. Graph-based natural language processing and information retrieval. Cambridge University Press, 2011. ISBN:0521896134 9780521896139
Eppstein, David. "Finding the k shortest paths." SIAM Journal on computing 28, no. 2 (1998): 652-673.
http://www.ics.uci.edu/~eppstein/pubs/Epp-SJC-98.pdf (Links to an external site.)
Backstrom, Lars, and Jure Leskovec. "Supervised random walks: predicting and recommending links in social networks." In Proceedings of the fourth ACM international conference on Web search and data mining, pp. 635-644. ACM, 2011.
http://www-cs-faculty.stanford.edu/people/jure/pubs/linkpred-wsdm11.pdf (Links to an external site.)
Bahmani, Bahman, Kaushik Chakrabarti, and Dong Xin. "Fast personalized pagerank on mapreduce." In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pp. 973-984. ACM, 2011.
Talukdar, P. P., Reisinger, J., Paşca, M., Ravichandran, D., Bhagat, R., & Pereira, F. (2008, October). Weakly-supervised acquisition of labeled class instances using graph random walks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 582-590). Association for Computational Linguistics.
http://www.anthology.aclweb.org/D/D08/D08-1.pdf#page=612
." In Proceedings of the sixth ACM international conference on Web search and data mining, pp. 465-474. ACM, 2013.
Chakrabarti, Soumen. "Dynamic personalized pagerank in entity-relation graphs." In Proceedings of the 16th international conference on World Wide Web, pp. 571-580. ACM, 2007.
https://www.cse.iitb.ac.in/~soumen/doc/www2007/www324-chakrabarti.pdf (Links to an external site.)
Yeh, Eric, Daniel Ramage, Christopher D. Manning, Eneko Agirre, and Aitor Soroa. "WikiWalk: random walks on Wikipedia for semantic relatedness." In Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing, pp. 41-49. Association for Computational Linguistics, 2009.
http://www.anthology.aclweb.org/W/W09/W09-32.pdf#page=53 (Links to an external site.)
Baluja, Shumeet, Rohan Seth, D. Sivakumar, Yushi Jing, Jay Yagnik, Shankar Kumar, Deepak Ravichandran, and Mohamed Aly. "Video suggestion and discovery for youtube: taking random walks through the view graph." In Proceedings of the 17th international conference on World Wide Web, pp. 895-904. ACM, 2008.
http://www.esprockets.com/papers/adsorption-yt.pdf (Links to an external site.)
Agirre, Eneko, Oier López de Lacalle, and Aitor Soroa. "Random Walks for Knowledge-Based Word Sense Disambiguation." Computational Linguistics 40, no. 1, 2014, pp 57-84.
http://anthology.aclweb.org/J/J14/J14-1003.pdf (Links to an external site.)
Book "Text Data Management and Analysis" Chapter 10.3
Page, Lawrence, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank citation ranking: Bringing order to the web. Stanford InfoLab, 1999.
http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf (Links to an external site.)
Mihalcea, Rada, and Dragomir Radev. Graph-based natural language processing and information retrieval. Cambridge University Press, 2011. ISBN:0521896134 9780521896139
Eppstein, David. "Finding the k shortest paths." SIAM Journal on computing 28, no. 2 (1998): 652-673.
http://www.ics.uci.edu/~eppstein/pubs/Epp-SJC-98.pdf (Links to an external site.)
Backstrom, Lars, and Jure Leskovec. "Supervised random walks: predicting and recommending links in social networks." In Proceedings of the fourth ACM international conference on Web search and data mining, pp. 635-644. ACM, 2011.
http://www-cs-faculty.stanford.edu/people/jure/pubs/linkpred-wsdm11.pdf (Links to an external site.)
Bahmani, Bahman, Kaushik Chakrabarti, and Dong Xin. "Fast personalized pagerank on mapreduce." In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pp. 973-984. ACM, 2011.
Talukdar, P. P., Reisinger, J., Paşca, M., Ravichandran, D., Bhagat, R., & Pereira, F. (2008, October). Weakly-supervised acquisition of labeled class instances using graph random walks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 582-590). Association for Computational Linguistics.
http://www.anthology.aclweb.org/D/D08/D08-1.pdf#page=612
Everyone must read both mandatory papers and a third one from the list below.
Everyone must submit reading notes by 8am of the discussion day through mycourses "Assignments". Use the reading-notes-template.mkd .
We go through the first half of the TREC CAR task presentation given at the planning session at TREC in November 2016. trec-car-planning.svg
1. 10 minute introduction to the topic
2. Discussion of reading notes
3. Questions and "not understood" parts
4. Paper discussion (Section-by-section)
5. Final research paper deconstruct
Introduction: The presenter should give a 10 minute introduction to the topic. Roughly: what is it about? What are critical definitions? How is this area roughly evaluated?
Reading notes: The presenter will talk about her/his submitted reading notes, and other members of the audience are asked to talk about their reading notes as well.
Questions: At this point any question or parts that are not understood need to be listed by the presenter and the audience. (You better ask the question before I ask you.)
Paper discussion: This is to be followed by section-by-section paper discussion. This discussion is facilitated by the presenter but everyone is expected to contribute. In this discussion, we walk through some of the papers - section by section - and recap the most important points. This is another opportunity of the presenter and the audience to ask questions and point out connections to other papers.
Research paper deconstruct: One outcome of this discussion is a better "research paper deconstruct" (cf. my last lecture). The reading notes which are due before class are already one attempt at a paper deconstruct. But often a second attempt is better than the first.
Haveliwala, Taher H. "Topic-sensitive pagerank." In Proceedings of the 11th international conference on World Wide Web, pp. 517-526. ACM, 2002.
http://ilpubs.stanford.edu:8090/573/1/2002-6.pdf (Links to an external site.)
Navigli, Roberto, and Mirella Lapata. "Graph Connectivity Measures for Unsupervised Word Sense Disambiguation." In IJCAI, pp. 1683-1688. 2007.
http://www.aaai.org/Papers/IJCAI/2007/IJCAI07-272.pdf (Links to an external site.)
Hulpus, Ioana, Conor Hayes, Marcel Karnstedt, and Derek Greene. "Unsupervised graph-based topic labelling using dbpedia." In Proceedings of the sixth ACM international conference on Web search and data mining, pp. 465-474. ACM, 2013.
Chakrabarti, Soumen. "Dynamic personalized pagerank in entity-relation graphs." In Proceedings of the 16th international conference on World Wide Web, pp. 571-580. ACM, 2007.
https://www.cse.iitb.ac.in/~soumen/doc/www2007/www324-chakrabarti.pdf (Links to an external site.)
Yeh, Eric, Daniel Ramage, Christopher D. Manning, Eneko Agirre, and Aitor Soroa. "WikiWalk: random walks on Wikipedia for semantic relatedness." In Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing, pp. 41-49. Association for Computational Linguistics, 2009.
http://www.anthology.aclweb.org/W/W09/W09-32.pdf#page=53 (Links to an external site.)
Baluja, Shumeet, Rohan Seth, D. Sivakumar, Yushi Jing, Jay Yagnik, Shankar Kumar, Deepak Ravichandran, and Mohamed Aly. "Video suggestion and discovery for youtube: taking random walks through the view graph." In Proceedings of the 17th international conference on World Wide Web, pp. 895-904. ACM, 2008.
http://www.esprockets.com/papers/adsorption-yt.pdf (Links to an external site.)
Agirre, Eneko, Oier López de Lacalle, and Aitor Soroa. "Random Walks for Knowledge-Based Word Sense Disambiguation." Computational Linguistics 40, no. 1, 2014, pp 57-84.
http://anthology.aclweb.org/J/J14/J14-1003.pdf (Links to an external site.)
Book "Text Data Management and Analysis" Chapter 10.3
Page, Lawrence, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank citation ranking: Bringing order to the web. Stanford InfoLab, 1999.
http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf (Links to an external site.)
Mihalcea, Rada, and Dragomir Radev. Graph-based natural language processing and information retrieval. Cambridge University Press, 2011. ISBN:0521896134 9780521896139
Eppstein, David. "Finding the k shortest paths." SIAM Journal on computing 28, no. 2 (1998): 652-673.
http://www.ics.uci.edu/~eppstein/pubs/Epp-SJC-98.pdf (Links to an external site.)
Backstrom, Lars, and Jure Leskovec. "Supervised random walks: predicting and recommending links in social networks." In Proceedings of the fourth ACM international conference on Web search and data mining, pp. 635-644. ACM, 2011.
http://www-cs-faculty.stanford.edu/people/jure/pubs/linkpred-wsdm11.pdf (Links to an external site.)
Bahmani, Bahman, Kaushik Chakrabarti, and Dong Xin. "Fast personalized pagerank on mapreduce." In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pp. 973-984. ACM, 2011.
Talukdar, P. P., Reisinger, J., Paşca, M., Ravichandran, D., Bhagat, R., & Pereira, F. (2008, October). Weakly-supervised acquisition of labeled class instances using graph random walks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 582-590). Association for Computational Linguistics.
http://www.anthology.aclweb.org/D/D08/D08-1.pdf#page=612
Agenda: Text Clustering1. Evaluation measures. (slide deck: dskgt-eval.pdf ) 2. Text Clustering: Introduction and discussion 3. Discussion of Graph Walk papers.
Mandatory Reading AssignmentsEveryone must read (and summarize) these:
Chapter 14 in Text Data Management and Analysis
Make yourself familiar with
Scikit-learn's
Clustering package
Further reading, everyone must read (and summarize) one from the following list:
Navigli, Roberto, and Giuseppe Crisafulli. "Inducing word senses to improve web search result clustering." In Proceedings of the 2010 conference on empirical methods in natural language processing, pp. 116-126. Association for Computational Linguistics, 2010. http://clair.eecs.umich.edu/aan/paper.php?paper_id=D10-1012#pdf (Links to an external site.) Rosenberg, Andrew, and Julia Hirschberg. "V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure." In EMNLP-CoNLL, vol. 7, pp. 410-420. 2007. http://clair.eecs.umich.edu/aan/paper.php?paper_id=D07-1043#pdf (Links to an external site.) McCreadie, Richard, Craig Macdonald, Iadh Ounis, Miles Osborne, and Sasa Petrovic. "Scalable distributed event detection for twitter." In Big Data, 2013 IEEE International Conference on, pp. 543-549. IEEE, 2013. http://eprints.gla.ac.uk/89118/7/89118.pdf (Links to an external site.)
Haghighi, Aria, and Dan Klein. "Simple coreference resolution with rich syntactic and semantic features." In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3-Volume 3, pp. 1152-1161. Association for Computational Linguistics, 2009.
McCallum, Andrew, Kamal Nigam, and Lyle H. Ungar. "Efficient clustering of high-dimensional data sets with application to reference matching." In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 169-178. ACM, 2000.
Baker, L. Douglas, and Andrew Kachites McCallum. "Distributional clustering of words for text classification." In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 96-103. ACM, 1998.
Background Reading (optional, introductory)Berkhin, Pavel. "A survey of clustering data mining techniques." In Grouping multidimensional data, pp. 25-71. Springer Berlin Heidelberg, 2006.
Advanced Reading (Continue here if this was too easy)Basu, Sugato, Mikhail Bilenko, and Raymond J. Mooney. "A probabilistic framework for semi-supervised clustering." In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 59-68. ACM, 2004. Bekkerman, Ron, and Koby Crammer. "One-class clustering in the text domain." In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 41-50. Association for Computational Linguistics, 2008. http://management.haifa.ac.il/images/info_people/ron_bekkerman_files/emnlp08.pdf (Links to an external site.) |
Attendees |
---|
\
Scribe + Discussion presenters
Next week: Prof Marinov (Talk Monday 5pm and discussion Tuesday morning - prepare questions)
Code submission - Issues?
Machine Learning & toolkits
What is a joint probability distribution?
Overfitting
Cross validation
Point estimation
Curse of dimensionality
Bias and Variance
Starting simple and working your way up in model complexity
Preparation for Code Submission 2
Topics:
- Mongo DB and data munging/massaging/wrangling, Hadoop & Map-reduce
- Question Answering (Agichtein)
- Intro NLP, Russel and Norvig Chapter 18/19 "Intro to NLP" (maybe with entity linking)
- Summarization (Barzilay & Sauper)
- Visualization .... of what exactly?
Book I. Goodfellow, Y. Bengio, A. Courville "Deep Learning", MIT Press, 2017, ISBN 9780262035613
Chapter 5.1, 5.2, and 5.3 (Links to an external site.)
Familiarize yourself with Scikit-Learn (Links to an external site.)
Martin Zinkevich. Rules of Machine Learning: Best Practices for ML Engineering (from Google) http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf (Links to an external site.)
A Visual Introduction to Machine Learning. http://www.r2d3.us/visual-intro-to-machine-learning-part-1/ (Links to an external site.)
Talk: Nathan Taggart. Machine Learning with Ponies (also used python)https://www.youtube.com/watch?v=xeAB10QgDW8 (Links to an external site.)
Scribe + Presenters
Reading notes: Please include a detailed discussion of how it relates to the TREC CAR Prototype.
Information Retrieval Paper discussion
Prototype planning
Book Text Data Management and Analysis
Chapter 6 - 6.3.1 and 6.4
Metzler, Donald, and W. Bruce Croft. "A Markov random field model for term dependencies." In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 472-479. ACM, 2005. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.61.1097&rep=rep1&type=pdf (Links to an external site.)
Raiber, Fiana, and Oren Kurland. "Ranking document clusters using markov random fields." In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pp. 333-342. ACM, 2013. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.474.781&rep=rep1&type=pdf (Links to an external site.)
Fang, Hui, Tao Tao, and ChengXiang Zhai. "A formal study of information retrieval heuristics." In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 49-56. ACM, 2004.
http://sifaka.cs.uiuc.edu/taotao/publications/sigir04.pdf (Links to an external site.)
Lavrenko, Victor, and W. Bruce Croft. "Relevance based language models." In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 120-127. ACM, 2001. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.193.3687&rep=rep1&type=pdf (Links to an external site.)
Familiarize yourself with one of the following toolkits:
- Lucene (Links to an external site.)
- Terrier (Links to an external site.)
- Galago (Links to an external site.) (Secret Galgo Docs (Links to an external site.))
Discussion Entity Linking
Tools: TagMe + AIDA
Implementation plan for next code submission
Shen, Wei, Jianyong Wang, and Jiawei Han. "Entity linking with a knowledge base: Issues, techniques, and solutions." IEEE Transactions on Knowledge and Data Engineering 27, no. 2 (2015): 443-460.
http://www.gntsuntechnologies.com/Projects/2015_java_ieee/10.pdf (Links to an external site.)
Ferragina, P. and Scaiella, U., 2010, October. Tagme: on-the-fly annotation of short text fragments (by wikipedia entities). In Proceedings of the 19th ACM international conference on Information and knowledge management (pp. 1625-1628). ACM.
http://www.di.unipi.it/~ferragin/cikm2010.pdf (Links to an external site.)
Ratinov, Lev, Dan Roth, Doug Downey, and Mike Anderson. "Local and global algorithms for disambiguation to wikipedia." In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp. 1375-1384. Association for Computational Linguistics, 2011.
http://cogcomp.cs.illinois.edu/papers/ChengRo13.pdf (Links to an external site.)
Mihalcea, R. and Csomai, A., 2007, November. Wikify!: linking documents to encyclopedic knowledge. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management (pp. 233-242). ACM.
Hasibi, F., Balog, K. and Bratsberg, S.E., 2016, March. On the reproducibility of the TAGME entity linking system. In European Conference on Information Retrieval (pp. 436-449). Springer International Publishing.
Yaghoobzadeh, Y. and Schütze, H., 2016. Corpus-level fine-grained entity typing using contextual information. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 715–725, 2015.
https://www.aclweb.org/anthology/D/D15/D15-1083.pdf (Links to an external site.)
Wang, H., Zheng, J., Ma, X., Fox, P. and Ji, H., 2015. Language and Domain Independent Entity Linking with Quantified Collective Validation. In EMNLP (pp. 695-704).
http://www.aclweb.org/website/old_anthology/D/D15/D15-1081.pdf (Links to an external site.)
Liu, Xiaohua, Yitong Li, Haocheng Wu, Ming Zhou, Furu Wei, and Yi Lu. "Entity Linking for Tweets." In ACL (1), pp. 1304-1311. 2013.
http://www.aclweb.org/old_anthology/P/P13/P13-1128.pdf (Links to an external site.)
Edgar Meij, Krisztian Balog and Dann Odijk. 2014. Entity Linking and Retrieval. (Links to an external site.) Tutorial at WSDM2014, SIGIR2013, YSS2013 and WWW2013.
http://ejmeij.github.io/entity-linking-and-retrieval-tutorial/ (Links to an external site.)
Roth, Dan, Heng Ji, Ming-Wei Chang, and Taylor Cassidy. "Wikification and Beyond: The Challenges of Entity and Concept Grounding." In ACL (Tutorial Abstracts), p. 7. 2014.
Familiarize yourself with TagMe (Links to an external site.) and/or the AIDA (Links to an external site.) entity linkers
Order of next topics?
Paper Discussion
How does this relate to TREC CAR?
Dalton, Jeffrey, Laura Dietz, and James Allan. "Entity query feature expansion using knowledge base links." In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, pp. 365-374. ACM, 2014.
Brandão, Wladmir C., Rodrygo LT Santos, Nivio Ziviani, Edleno S. Moura, and Altigran S. Silva. "Learning to expand queries using entities." Journal of the Association for Information Science and Technology 65, no. 9 (2014): 1870-1883.
Blanco, Roi, Giuseppe Ottaviano, and Edgar Meij. "Fast and space-efficient entity linking for queries." In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pp. 179-188. ACM, 2015.
Liu, Xitong, and Hui Fang. "Latent entity space: a novel retrieval approach for entity-bearing queries." Information Retrieval Journal 18, no. 6 (2015): 473-503.
http://xtliu.com/pub/inrj15-les.pdf (Links to an external site.)
Raviv, Hadas, David Carmel, and Oren Kurland. "A ranking framework for entity oriented search using Markov random fields." In Proceedings of the 1st Joint International Workshop on Entity-Oriented and Semantic Search, p. 1. ACM, 2012.
http://sme.technion.ac.il/~kurland/entityMRF.pdf (Links to an external site.)
Hasibi, Faegheh, Krisztian Balog, and Svein Erik Bratsberg. "Entity linking in queries: Tasks and evaluation." In Proceedings of the 2015 International Conference on The Theory of Information Retrieval, pp. 171-180. ACM, 2015.
http://krisztianbalog.com/files/ictir2015-erd.pdf (Links to an external site.)
Zhiltsov, Nikita, Alexander Kotov, and Fedor Nikolaev. "Fielded sequential dependence model for ad-hoc entity retrieval in the web of data." In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 253-262. ACM, 2015.
Dietz, Laura, Alexander Kotov, and Edgar Meij. "Utilizing Knowledge Graphs in Text-centric Information Retrieval." In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pp. 815-816. ACM, 2017.
Slides are available online: http://github.com/laura-dietz/tutorial-utilizing-kg
Scribe
Late homework submissions
Paper discussion
Prepare for Thursday: First batch of evaluation results.
Read two papers of your choice.
You have the choice between a very long but complete and very easy to follow introductory read, as well as graph clustering works from the database, NLP, and machine learning communities.
Zhou, Yang, Hong Cheng, and Jeffrey Xu
Yu. "Graph clustering based on structural/attribute
similarities." Proceedings of the VLDB Endowment 2, no.
1 (2009):
718-729.
http://www1.se.cuhk.edu.hk/~hcheng/summer2010/paper/vldb09-175.pdf (Links
to an external site.)
Biemann, Chris. "Chinese whispers:
an efficient graph clustering algorithm and its application to
natural language processing problems." In Proceedings of the
first workshop on graph based methods for natural language
processing, pp. 73-80. Association for Computational
Linguistics,
2006.
https://www.lt.informatik.tu-darmstadt.de/fileadmin/user_upload/Group_LangTech/publications/pre-langtech/Biemann_CW_TextGraph06.pdf (Links
to an external site.)
Flake, Gary William, Robert E. Tarjan, and Kostas Tsioutsiouliklis. "Graph clustering and minimum cut trees." Internet Mathematics 1, no. 4 (2004): 385-408.
http://projecteuclid.org/download/pdf_1/euclid.im/1109191029 (Links to an external site.)
Kulis, Brian, Sugato Basu, Inderjit Dhillon, and Raymond Mooney. "Semi-supervised graph clustering: a kernel approach." Machine learning 74, no. 1 (2009): 1-22.
Prototype 1 submission + Plan for Prototype 2 (submit by Wednesday)
Discussion: Relation Extraction
Please read:
1x Schema-based Relation Extraction
1x Open Relation Extraction
2x additional paper of your choice.
Bach, Nguyen, and Sameer Badaskar. "A review of relation extraction." Literature review for Language and Statistics II (2007). http://orb.essex.ac.uk/CE/CE807/Readings/A-survey-on-Relation-Extraction.pdf (Links to an external site.)
Pantel, Patrick, and Marco Pennacchiotti. "Espresso: Leveraging generic patterns for automatically harvesting semantic relations." In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pp. 113-120. Association for Computational Linguistics, 2006. http://www.anthology.aclweb.org/P/P06/P06-1.pdf#page=153 (Links to an external site.)
Lin, D. and Pantel, P., 2001. Discovery of inference rules for question-answering. Natural Language Engineering, 7(04), pp.343-360. http://courses.cs.washington.edu/courses/cse573/08au/papers/pantel.pdf (Links to an external site.)
Mintz, Mike, Steven Bills, Rion Snow, and Dan Jurafsky. "Distant supervision for relation extraction without labeled data." In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, pp. 1003-1011. Association for Computational Linguistics, 2009. https://www.aclweb.org/anthology/P/P09/P09-1113.pdf (Links to an external site.)
Riedel, Sebastian, Limin Yao, Andrew McCallum, and Benjamin M. Marlin. "Relation extraction with matrix factorization and universal schemas." (2013). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.378.5619&rep=rep1&type=pdf#page=112 (Links to an external site.)
Etzioni, Oren, Michele Banko, Stephen Soderland, and Daniel S. Weld. "Open information extraction from the web." Communications of the ACM 51, no. 12 (2008): 68-74. http://www.cs.washington.edu/research/projects/aiweb/media/papers/tmpcLeDnr.pdf (Links to an external site.)
Del Corro, Luciano, and Rainer Gemulla. "Clausie: clause-based open information extraction." In Proceedings of the 22nd international conference on World Wide Web, pp. 355-366. ACM, 2013. http://www2013.wwwconference.org/proceedings/p355.pdf
Scribe
Discussion: How to use Data Wrangling techniques to solve TREC CAR?
Paper presentation by Bahram
Discussion Prototype 2 Implementation Plans.
You can keep your reading notes brief.
NoSQL databases: a step to database scalability in web environment http://dl.acm.org/citation.cfm?id=2095583&dl=ACM&coll=DL (Links to an external site.)
NoSQL databases: MongoDB vs Cassandra http://dl.acm.org/citation.cfm?id=2494447&dl=ACM&coll=DL (Links to an external site.)
About MapReduce https://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/ (Links to an external site.)
First half of Spark Streaming Programming Guide (everything
before "Caching / Persistence")
https://spark.apache.org/docs/latest/streaming-programming-guide.html (Links
to an external site.)
(first half -
Scribe
Group discussion: what in these papers can be used in TREC CAR?
Paper Presentation. Presenter: Colin
Questions regarding code submission Prototype 2
Upcoming events:
Monday: code submission Prototype 2
Tuesday: implementation plan Prototype 3
Voskarides, Nikos, Edgar Meij, Manos Tsagkias, Maarten de Rijke, and Wouter Weerkamp. "Learning to Explain Entity Relationships in Knowledge Graphs.", Proceedings of the 53rd Annual Meeting of the Association for Computational Linguisticsand the 7th International Joint Conference on Natural Language Processing, pages 564–574, Beijing, China, July 26-31, 2015.
http://anthology.aclweb.org/P/P15/P15-1055.pdf (Links to an external site.)
Schuhmacher, Michael, Benjamin Roth, Simone Paolo Ponzetto, and Laura Dietz. "Finding relevant relations in relevant documents." In European Conference on Information Retrieval, pp. 654-660. Springer International Publishing, 2016.
I originally pasted a URL to a different paper.
Here is the correct paper: https://ub-madoc.bib.uni-mannheim.de/41295/1/schuhmacher16a.pdf (Links to an external site.)
Reinanda, Ridho, Edgar Meij, and Maarten de Rijke. "Mining, ranking and recommending entity aspects." In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 263-272. ACM, 2015. https://pdfs.semanticscholar.org/e667/c31119b6e56ea73cfeda8752bc5031025fd2.pdf
Blei, David M. "Probabilistic topic models." Communications of the ACM 55, no. 4 (2012): 77-84. http://www.cs.princeton.edu/~blei/papers/Blei2011.pdf (Links to an external site.)
Kataria, Saurabh S., Krishnan S. Kumar, Rajeev R. Rastogi, Prithviraj Sen, and Srinivasan H. Sengamedu. "Entity disambiguation with hierarchical topic models." In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1037-1045. ACM, 2011.
Chang, Jonathan, Jordan L. Boyd-Graber, Sean Gerrish, Chong Wang, and David M. Blei. "Reading tea leaves: How humans interpret topic models." In Nips, vol. 31, pp. 1-9. 2009. https://papers.nips.cc/paper/3700-reading-tea-leaves-how-humans-interpret-topic-models.pdf (Links to an external site.)
Heinrich, Gregor. "Parameter estimation for text analysis." University of Leipzig, Tech. Rep (2008). http://faculty.cs.byu.edu/~ringger/CS601R/papers/Heinrich-GibbsLDA.pdf (Links to an external site.)
Chapter 17 of book "Text Data Management and Analysis", Zhai & Massung, 2016.
Also see appendix A of the same book.
Li, Wei, and Andrew McCallum. "Pachinko allocation: DAG-structured mixture models of topic correlations." In Proceedings of the 23rd international conference on Machine learning, pp. 577-584. ACM, 2006. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.86.8142&rep=rep1&type=pdf (Links to an external site.)
Dietz, Laura, Steffen Bickel, and Tobias Scheffer. "Unsupervised prediction of citation influences." In Proceedings of the 24th international conference on Machine learning, pp. 233-240. ACM, 2007. http://machinelearning.wustl.edu/mlpapers/paper_files/icml2007_DietzBS07.pdf (Links to an external site.)
Dietz, Laura, Ben Gamari, John Guiver, Edward Snelson, and Ralf Herbrich. "De-Layering Social Networks by Shared Tastes of Friendships." In ICWSM. 2012. http://ciir.cs.umass.edu/~dietz/delayer/dietz-cameraready.pdf (Links to an external site.)
Rosen-Zvi, Michal, Thomas Griffiths, Mark Steyvers, and Padhraic Smyth. "The author-topic model for authors and documents." In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pp. 487-494. AUAI Press, 2004. https://arxiv.org/pdf/1207.4169 (Links to an external site.)
Newman, David, Chaitanya Chemudugunta, and Padhraic Smyth. "Statistical entity-topic models." In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 680-686. ACM, 2006. http://datalab.ics.uci.edu/papers/rtpp331_newman.pdf (Links to an external site.)
Balasubramanyan, Ramnath, and William W. Cohen. "Block-LDA: Jointly modeling entity-annotated text and entity-entity links." In Proceedings of the 2011 SIAM International Conference on Data Mining, pp. 450-461. Society for Industrial and Applied Mathematics, 2011. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.673.2347&rep=rep1&type=pdf (Links to an external site.)
Chang, Jonathan, Jordan Boyd-Graber, and David M. Blei. "Connections between the lines: augmenting social networks with text." In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 169-178. ACM, 2009. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.684.3500&rep=rep1&type=pdf (Links to an external site.)
Li, Wei, and Andrew McCallum. "Pachinko allocation: DAG-structured mixture models of topic correlations." In Proceedings of the 23rd international conference on Machine learning, pp. 577-584. ACM, 2006. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.86.8142&rep=rep1&type=pdf (Links to an external site.)
Dietz, Laura, Steffen Bickel, and Tobias Scheffer. "Unsupervised prediction of citation influences." In Proceedings of the 24th international conference on Machine learning, pp. 233-240. ACM, 2007. http://machinelearning.wustl.edu/mlpapers/paper_files/icml2007_DietzBS07.pdf (Links to an external site.)
Dietz, Laura, Ben Gamari, John Guiver, Edward Snelson, and Ralf Herbrich. "De-Layering Social Networks by Shared Tastes of Friendships." In ICWSM. 2012. http://ciir.cs.umass.edu/~dietz/delayer/dietz-cameraready.pdf (Links to an external site.)
Rosen-Zvi, Michal, Thomas Griffiths, Mark Steyvers, and Padhraic Smyth. "The author-topic model for authors and documents." In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pp. 487-494. AUAI Press, 2004. https://arxiv.org/pdf/1207.4169 (Links to an external site.)
Newman, David, Chaitanya Chemudugunta, and Padhraic Smyth. "Statistical entity-topic models." In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 680-686. ACM, 2006. http://datalab.ics.uci.edu/papers/rtpp331_newman.pdf (Links to an external site.)
Balasubramanyan, Ramnath, and William W. Cohen. "Block-LDA: Jointly modeling entity-annotated text and entity-entity links." In Proceedings of the 2011 SIAM International Conference on Data Mining, pp. 450-461. Society for Industrial and Applied Mathematics, 2011. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.673.2347&rep=rep1&type=pdf (Links to an external site.)
Chang, Jonathan, Jordan Boyd-Graber, and David M. Blei. "Connections between the lines: augmenting social networks with text." In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 169-178. ACM, 2009. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.684.3500&rep=rep1&type=pdf (Links to an external site.)
Chapter 19 of book "Text Data Management and Analysis", Zhai & Massung, 2016.
Chapter 19 of book "Text Data Management and Analysis", Zhai & Massung, 2016.
Presenter: Reazul
I will be joining the discussion online.
Mandatory ReadingChapter 3 in the book of Zhai and Massung. "Text Data Management and Analysis". Bird, S., 2006, July. NLTK: the natural language toolkit. In Proceedings of the COLING/ACL on Interactive presentation sessions (pp. 69-72). Association for Computational Linguistics. http://www.aclweb.org/anthology/P06-4#page=79 (Links to an external site.) Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. "The stanford corenlp natural language processing toolkit." In ACL (System Demonstrations), pp. 55-60. 2014. http://www.aclweb.org/website/old_anthology/P/P14/P14-5.pdf#page=67 (Links to an external site.) Toolkits
|
Alonso, Omar, Daniel E. Rose, and Benjamin Stewart. "Crowdsourcing for relevance evaluation." In ACM SigIR Forum, vol. 42, no. 2, pp. 9-15. ACM, 2008. http://www.cs.northwestern.edu/~pardo/courses/mmml/papers/collaborative_filtering/crowdsourcing_for_relevance_evaluation_SIGIR08.pdf (Links to an external site.)
Kazai, Gabriella, and Natasa Milic-Frayling. "On the evaluation of the quality of relevance assessments collected through crowdsourcing." In SIGIR 2009 Workshop on the Future of IR Evaluation, p. 21. 2009. https://pdfs.semanticscholar.org/d631/31633e630d7d14d3d18d6ad0caf456c86cf7.pdf (Links to an external site.)
Azzopardi, Leif, Maarten De Rijke, and Krisztian Balog. "Building simulated queries for known-item topics: an analysis using six european languages." In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 455-462. ACM, 2007. http://eprints.gla.ac.uk/3864/1/azzopardi3864.pdf (Links to an external site.)
Savenkov, Denis, Scott Weitzner, and Eugene Agichtein. "Crowdsourcing for (almost) real-time question answering." In Workshop on Human-Computer Question Answering, NAACL. 2016. http://www.aclweb.org/anthology/W/W16/W16-0102.pdf (Links to an external site.)
Amazon Mechanical Turk https://www.mturk.com/mturk/welcome (Links to an external site.)
Crowdflower https://www.crowdflower.com/
Nenkova, Ani, and Kathleen McKeown. "A survey of text summarization techniques." In Mining text data, pp. 43-76. Springer US, 2012. https://pdfs.semanticscholar.org/8d7f/6dc8b0b9101580cc96f1f303d1eba3d590af.pdf (Links to an external site.)
Blanco, R. and Lioma, C., 2012. Graph-based term weighting for information retrieval. Information retrieval, 15(1), pp.54-92. http://www.diku.dk/~c.lioma/publications/irj2012.pdf (Links to an external site.)
Ouyang, You, Wenjie Li, Sujian Li, and Qin Lu. "Applying regression models to query-focused multi-document summarization." Information Processing & Management 47, no. 2 (2011): 227-237. https://www.researchgate.net/profile/Qin_Lu3/publication/220229610_Applying_regression_models_to_query-focused_multi-document_summarization/links/00b7d52f33e9ceb4f8000000.pdf (Links to an external site.)
Chapter 16 of book "Text Data Management and Analysis", Zhai & Massung, 2016.
Bryan, feel free to suggest another paper.
Allam, Ali Mohamed Nabil, and Mohamed Hassan Haggag. "The question answering systems: A survey." International Journal of Research and Reviews in Information Sciences (IJRRIS) 2, no. 3 (2012). http://www.aliallam.net/upload/598575/documents/ECFF549932079694.pdf (Links to an external site.)
Savenkov, Denis, and Eugene Agichtein. "When a knowledge base is not enough: Question answering over knowledge bases with external text data." In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 235-244. ACM, 2016. https://pdfs.semanticscholar.org/ee8e/a5af5cb957a912331c3fb0fd6f169ad79630.pdf (Links to an external site.)
Gondek, D. C., Adam Lally, Aditya Kalyanpur, J. William Murdock, Pablo Ariel Duboué, Lei Zhang, Yue Pan, Z. M. Qiu, and Chris Welty. "A framework for merging and ranking of answers in DeepQA." IBM Journal of Research and Development 56, no. 3.4 (2012): 14-1. https://pdfs.semanticscholar.org/c094/4b6759e2e1a4026ef43936ee00c0ddb3d79a.pdf (Links to an external site.)
Fan, James, Aditya Kalyanpur, David C. Gondek, and David A. Ferrucci. "Automatic knowledge extraction from documents." IBM Journal of Research and Development 56, no. 3.4 (2012): 5-1. http://brenocon.com/watson_special_issue/05%20automatic%20knowledge%20extration.pdf (Links to an external site.)
Oh, J.H., Torisawa, K., Hashimoto, C., Iida, R., Tanaka, M. and Kloetzer, J., 2016, February. A semi-supervised learning approach to why-question answering. In Proceedings of the thirtieth aaaI Conference on artificial Intelligence (pp. 3022-3029). AAAI Press. http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/download/12208/12056 (Links to an external site.)
Tsur, Gilad, Yuval Pinter, Idan Szpektor, and David Carmel. "Identifying web queries with question intent." In Proceedings of the 25th International Conference on World Wide Web, pp. 783-793. International World Wide Web Conferences Steering Committee, 2016. http://www.cc.gatech.edu/~ypinter3/papers/2016_prefex-www-proc.pdf (Links to an external site.)