image/svg+xml
Slide
Laura Dietz Retrieving Knowledge from the Web
Retrieving Knowledge from the Web
Laura Dietzdietz@cs.unh.edu
Retrieving Knowledge from the Web
xkcd.com/1592/
What is Knowledge? (Pragmatic Definition)
What is an Entity?
Person, Place, Gene, Protein, Event, Thing Anything with an entry in a knowledge base(here: anything with a Wikipedia article)
Related entities
Types and categories
Name aliases
Cacao, Guerana
alkaloid, cocoa,caffeine
Theobromine
Bitter_compounds
United Kingdom
Theresa May
Article
Entity Index
United Kingdom
Conservative Party
20th-century women politicians
Theresa May
Prime MinisterMay
European Union
Euroscepticism in the United Kingdom
Brexit
Withdrawal from the European Union
United Kingdom
prime_minister_of
in_favor_of
untyped relationshipe.g. article link
What is a Knowledge Graph?
United Kingdom
Country
Open Information Needs
Requiring long, complex answers Intended queries:- drink water good - dark chocolate health benefits- causes conflict Middle East- UK leaving Europe- Spent nuclear fuel If yes, why? If not, why not?Causes? Involvements? Controversy? Backstory?What do I need to know to understand the answer?
xkcd.com/1592/
Open Information Needs
Requiring long, complex answers Intended queries:- drink water good - dark chocolate health benefits- causes conflict Middle East- UK leaving Europe- Spent nuclear fuel If yes, why? If not, why not?Causes? Involvements? Controversy? Backstory?What do I need to know to understand the answer?
xkcd.com/1592/
Health effects of chocolate
Desperation
,
Pacification
,
Expectation
,
Acclamation
,
Realization
,
It's Fry's
. Advertisement of Fry’s ‘Five
Boys’
milk chocolate
From Wikipedia, the free encyclopedia
The
health effects of chocolate
refer to the possible positive and negative effects
on health of eating
chocolate
.
Unconstrained consumption of large quantities of any energy-rich food, such as
chocolate, without a corresponding increase in activity, increases the risk of
obesity
.
Raw chocolate is high in
cocoa butter
, a fat removed during chocolate refining, then
added back in varying proportions during manufacturing. Manufacturers may add other
fats,
sugars
, and
powdered milk
as well.
Although considerable research has been conducted to evaluate the potential health
benefits of consuming chocolate, there are insufficient
studies
to confirm any effect and
no medical or regulatory authority has approved any health claim.
Contents
[
hide
]
1
Research
1.1
Acne
1.2
Addiction
1.3
1.4
Heart and blood vessels
1.5
Stimulant
1.6
Weight gain
2
Lead content
3
Polyphenol content
4
Other animals
5
References
Provide More!
7 Proven
Health Benefits of Dark Chocolate
(No. 5 is Best)
authoritynutrition.com/7-
health
-
benefits
-
dark
-
chocolate
Dark chocolate
is loaded with nutrients that can positively affect your
health
. Made from
the seed of the cocoa tree, it is one of the best sources of antioxidants on ...
Six
Health Benefits of Dark Chocolate
/ Nutrition ...
www.fitday.com/.../6-
health-benefits-of-dark-chocolate
.html
Dark chocolate
has recently been discovered to have a number of healthy
benefits
.
While eating
dark chocolate
can lead to the
health benefits
described below ...
Pick
Dark Chocolate for Health Benefits
- WebMD - Better ...
www.webmd.com/diet/20120424/pick-
dark-chocolate-health-benefits
24/04/2012
·
Chocolate
and
Health Benefits
: Study Details. Hong compared white
chocolate
, which has no cocoa solids, to regular
dark chocolate
containing 70% …
Dark Chocolate Is Healthy Chocolate
- WebMD - Better ...
www.webmd.com/diet/20030827/
dark-chocolate-is-healthy-chocolate
27/08/2003
·
Dark Chocolate Is Healthy Chocolate
. By Daniel J. DeNoon on August 27,
2003. WebMD News Archive
Dark Chocolate
Has
Health Benefits
Not Seen in …
Health Benefits
of
Dark Chocolates
-
Mercola
.com
articles.
mercola
.com/.../03/31/
dark-chocolate-health-benefits
.aspx
31/03/2014
·
Video embedded
· By Dr.
Mercola
. The
health benefits of dark chocolate
are all the rage right now, with increasing numbers of studies pointing to its …
dark chocolate health benefits
Synthesize!
Bing News Search Results
Synthesize!
More explanations...
More explanations...
More explanations...
More explanations...
More explanations...
Chocolate
View Web Source
facts about chocolate and health. how much chocolate is good
for your health?
United_States
View Web Source
milk solids in europe dark chocolate must contain at least 35%
chocolate liquor and have cacao or cocoa content of at least
43%. In the united states however the govenment requires a
minimum of only 15%
More explanations...
Circulatory_system
View Web Source
cocoa flavanols have also been shown to have potential anti
inflammatory activities that are relevant to cardiovascular
health with inflammation substances are formed which can
produce adverse cardiovascular effects now dr shock will
never let a change to promote chocolate consumption slip
More explanations...
Theobromine
View Web Source
chocolate could alleviate some blood circulation problems in
the body also increasing blood flow to the brain which could
have benefits for memory and dementia theobromine is the
main alkaloid in cocoa and dark chocolate some people say
that the theobromine in dark chocolate works better for them
American_Heart_ A
ssociation
View Web Source
of course this is not a joke. a study was made in the US by
specialists and published in the journal of the american heart
association
C-reactive_protein
View Web Source
cocoa and chocolate can modulate platelet function through a
multitude of pathways. chocolate and c-reactive protein levels
dark chocolate effect on platelet activity c-reactive protein and
lipid profile
More explanations...
More explanations...
Italy
View Web Source
eating dark chocolate could help control diabetes and blood
pressure, italian experts say
Yale_University
View Web Source
a research study in 2008 at yale university suggests that
consumption by pregnant women of chocolate rich in the
chemical could help prevent pre eclampsia
Health_effects_of_
chocolate
View Web Source
chocolate has been a treat for thousands of years and in
ancient civilizations was thought to be medicinal
Query 234
dark chocolate health benefits
Demo available:
Complex Answer Retrieval
1. Introduction2. Complex Answer Retrieval3. Approaches: Utilizing KGs for Text IR4. Machine Learning for Latent Entities5. Conclusion
Given query Q (= open domain topic),automatically compose an encyclopedic article.
predominant facts about topic
Heading
more details and stories
Query
Query
Knowledge Graph+Web Materials
Advantage:Access to near-infinite material on the Web
Vision: Query-specific Wikipedia Construction
Task: TREC Complex Answer Retrieval
Given: query Q (= open domain topic)and outline of headings (H1, .. Hn).For every heading, return ranking of passages.
predominant facts about topic
Heading
more details and stories
Query
Query
Heading
Query
Heading
TREC Complex Answer Retrieval Data Set
Task:For each heading,rank paragraphs Eval 1:Article reconstruction Eval 2:Relevance judgments (by NIST)
Original article(Wikipedia)
Outline
Paragraphcorpus
Ground truth(qrels)
Data online: http://trec-car.cs.unh.edu
Held out from participants
held out for evaluation
Datasets for Download
Paragraph Corpus: paragraphcorpus-v1.5: Dedup. paragraphs with links 20 mioTest topics: benchmarkY1test.public-v1.5: 1843 queries (test set) Training data (Articles, Truth, Outlines)benchmarkY1train-v1.5: 1583 queries 5-foldstest200-v1.5: 1680 queries (fold 0 only)train-v1.5: 2,608,000 queries 5-folds unprocessedtrain-v1.5: Raw data from all 5 folds
Text-based Approachs
Simple approach:Q' = Candidate Method - Rank passages with keywords search method (Okapi BM25). Supervised Re-Ranking - DuetModel (Neural Approach) - Learning to Rank
Query + Heading 1 + Heading 1.1 + ...
Query
Heading 1
Heading 1.1
Outline
Training
MAP on holdout fold
Number of training samples(experimentation environment)
Rocchio: Classification on Headings (see WikiKreator [Banerjee15])
Candidate + LTR(Reranking, Candidate) best
End-to-end Performance
TREC CAR v1.5 http://trec-car.cs.unh.edu
keyword expansion
word embeddings
deep learning
keyword match
Training
MAP on holdout fold
Number of training samples(experimentation environment)
Rocchio: Classification on Headings (see WikiKreator [Banerjee15])
Candidate + LTR(Reranking, Candidate) best
Candidate + Reranking worse than Candidate
End-to-end Performance
TREC CAR v1.5 http://trec-car.cs.unh.edu
keyword expansion
word embeddings
deep learning
keyword match
Training
MAP on holdout fold
Number of training samples(experimentation environment)
Rocchio: Classification on Headings (see WikiKreator [Banerjee15])
Training data + time essential!
Candidate + LTR(Reranking, Candidate) best
Candidate + Reranking worse than Candidate
End-to-end Performance
TREC CAR v1.5 http://trec-car.cs.unh.edu
Train inMinutes
keyword expansion
word embeddings
deep learning
keyword match
Great Problem, but: How to solve it?
Issue: Many relevant passages do not contain query terms.Reminder: we want long answers! For complex answers, helpful:- deeper understanding of text- relevant concepts, entities, and relations- explain why the something is relevant
Approaches: Utilizing KGs for Text IR
1. Introduction2. Complex Answer Retrieval3. Approaches: Utilizing KGs for Text IR4. Machine Learning for Latent Entities5. Conclusion
Mention
Entity
Task: Entity Linking (aka Wikification)
Entity linking algorithms detect entity mentions in text and align them to their knowledge base entry.
Link
Category: Food
sweetbrowndark
Chocolate
Theobromine
chocolate
Query: dark chocolate health benefits
Bag-of-words -> Bag-of-entities
Keyword search, where entities are keywords Advantages:- Resolve synonyms- Resolve ambiguityBut: misses relevant entities
Q'=
(Query Entities)
Many Important Entities are not Mentioned
Query EU UK relations
dark chocolatehealth benefits
Queryentities
Relevantentities
EU
Brexit
Theresa May
chocolate
health
Theobromine
circulatorysystem
dementia
Named Entities
Concept Entities
UK
Document Retrieval with Entities
Query
Documents
Entities
Entities believed ->to be relevant
Text we ->want to rank
Document Retrieval with Entities
Q: dark chocolate health benefits
Document Retrieval with Entities
Q
Q: dark chocolate health benefits
Document Retrieval with Entities
Q
Q: dark chocolate health benefits
Q
pretend these are relevant
Theobromine
Category: Food
sweetbrowndark
Chocolate
Theobromine
Document Retrieval with Entities
Q
Q: dark chocolate health benefits
Q
pretend these are relevant
Document Retrieval with Entities
Q
Q: dark chocolate health benefits
Q
pretend these are relevant
Theobromine
...name ........ query term .... article term ...name ........
theobromine benefiits ....sweet.......cocoa bean
...name ........ query term .... article term ...name ........
Category: Food
sweetbrowndark
Chocolate
Theobromine
Machine Learning for Latent Entities
1. Introduction2. Complex Answer Retrieval3. Approaches: Utilizing KGs for Text IR4. Machine Learning for Latent Entities5. Conclusion
Machine Learning / Probabilistic Models
Three approaches based on similar ideas:- Dalton: Entity Query Feature Expansion- Xiong: EsdRank- Liu: Latent Entity Space
An edge representsa measure of compatability or similarity.
One possible value for E ->no ground truth!
<- One possible value for D ground truth available (TREC)
Probabilistic model with random variables Q,E,D.
Latent Entity Space [Liu IRJ15]
Wide range of experiments on which similaritymeasure / data source combination works best.
similarity ofLM(q) and LM(e)
similarity ofLM(e) and LM(d)
Entity Query Feature Expansion [Dalton SIGIR14]
Combine features then use standard learning to rank (MAP)
n x m features!
n different ways to compute p(q|e)
m different ways to compute p(e|d)
EsdRank [Xiong CIKM15]
Discriminative probabilistic model based onGeneralized linear models + EM Algorithmfor learning weights w1, w2. Only n+m features! But needs custom learning code.
Relation to Query / Latent Concept Expansion
Various vocabularies, but all represented by sets
entity link
W rm20
W Q
E rm
E rm20
E rm20
A kb20
A kb10
E kb5
E ecm50
A kb5
W kb1
A ecm50
E ecm8
C rm20
A ecm8
A kb1
E kb50
E KB20
A rm
A rm20
E kb10
M rm20
A rm20
T rm20
A ecm8
E kb
M rm10
C kb1
A ecm50
E rm20
E rm20
E rm10
E rm10
E rm20
E rm10
T kb1
E rm1
E rm1
E rm1
0.00
0.05
0.10
0.15
0.20
0.25
0.30
map
robust
sdm
rm
wikiRm1
EQFE
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
map
Entity Query Feature Expansion [Dalton SIGIR14]
Results on Robust04 benchmark (ad hoc document retrieval)
combined
features
Conclusion
1. Introduction2. Complex Answer Retrieval3. Approaches: Utilizing KGs for Text IR4. Machine Learning for Latent Entities5. Conclusion
Future & Ongoing Work
- Collection building for historic events Nanni, Ponzetto, Dietz. Building entity-centric event collections. JCDL '17.- Finding relevant relations Kadry, Dietz. Open Relation Extraction for Support Passage Retrieval: Merit and Open Issues. SIGIR '17.- Understanding the message of images Weiland, Hulpus, Ponzetto, Dietz. Using object detection, NLP, and knowledge bases to understand the message of images. MMM '17.- Domain-specific entity linking Nanni, Zhao, Ponzetto, Dietz. Enhancing Domain-Specific Entity Linking in DH. DH '17.- Query-specific wikipedia constructionIn preparation.
Conclusion: Retrieving Knowledge from the Web
Many "prob-pportunities" when retrieving detailed answers- Relevant KG edges/elements?- Relevant contexts of entities?- Relevant entity aspects? Slides online: www.cs.unh.edu/~dietz
xkcd.com/1592/
KG4IR Workshop at SIGIR (+mailinglist)TREC Complex Answer Retrieval trackTutorial Utilizing KGs for Text-centric IR Looking for collaborators!
http://kg4ir.github.iohttp://trec-car.cs.unh.edugithub.com/laura-dietz/tutorial-utilizing-kg dietz@cs.unh.edu
Entity Ranking Evaluation on ClueWeb12
Evaluation Data: http://rewq.dswlab.de/
full
wiki
docs
entity-8
entity-50
types
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
MAP
Context Model
RMdoc
Obj IRwiki
L2R
Types
Combineseverything!
Query Expansion with Uncertainties
Taking uncertainty and confidences into account.
Ambiguity of names
uncertainty of links
[Raviv SIGIR16, Liu IRJ15]
uncertainty of words
Query: dark chocolate health benefits
Category: Food
sweetbrowndark
Chocolate
Theobromine
Query Entities through Entity Linking
Retrieve entities from knowledge graphto obtain ranking of entities (with score)
Q
1st
3rd
2nd
Cocoa_bean
Theobromine
Relevant Entities through Object Retrieval
Notation:Search Index
[Pound10, Balog11, Zhiltsov15, Dalton14, Xiong15]
1. Retrieve document ranking2. Entity link documents in top K3. Derive distribution over (bag of entities)(see Relevance Model / RM3) Issue: entities not necessarily near query terms.
1st
3rd
2nd
Relevant Entities through Pseudo-Rel. Feedback
[Lavrenko01; Dalton14, Liu15, Schuhmacher15]
Q
pretend these are relevant
Relev. Entities through Proximity to Query Words
Using distance between entity mentions and query words q as a measure for relevance.
q
[Petkova & Croft 07; Liu & Fang 15]
1. Collect contexts of entity links to 2. Concat link context into one pseudodoc per entity 3. Given query Q, retrieve pseudo- docs, thereby ranking entities
Theobromine
darkchocolatechocolatehealthchocolate
Relevant Entities through Entity Context Model
q
q
Q
[Dalton 14, Liu 15]
Retrieval Models over Terms, Names, Links
use your favoriteretrieval model here!
Entity Link
...name ........ query term .... article term ...name ........
Category: Food
sweetbrowndark
Chocolate
Theobromine
Query: dark chocolate health benefits
Category: Food
sweetbrowndark
Chocolate
Theobromine
So Far: Entities as Tags
But knowledge graphs contain so much more information!
names
types and categories
links and relations
How can we make use of the information?
Knowlegde Graph Expansion
1. Introduction2. Vision3. Approaches: Utilizing KGs for Text IR4. Knowledge Graph Expansion5. Relation Extraction6. Entity Aspects7. Conclusion
inferred as relevant because of link
originallyrelevant
Link
Brexit
UK
Using Relations and Types with Entity Links
inferred as relevant because of link
Brexit
Using Relations and Types with Entity Links
inferred as relevant because of link
inferred as relevant because of same type
originallyrelevant
Link
Brexit
UK
France
Using Relations and Types with Entity Links
inferred as relevant because of link
inferred as relevant because of same type
originallyrelevant
Should this docbe promoted in the ranking?
Entity Link
Link
Brexit
UK
France
Brexit
France
[Hasibi 16, Wordnet: Kotov 12]
Using Knowledge Graph Structure
inferred as relevant because of link
inferred as relevant because of same type
originallyrelevant
Link
Brexit
UK
France
Document Retrieval with (More) Entities
Query
Documents
Entities
Entities known or ->assumed to be relevant
Docs we ->want to rank
UK
EU
Brexit
France
Aberdeen
Boston et al 2013: Wikimantic: Toward effective ...
Weight entities by:M: How well Es article content matches the queryMR: How often E is linked by others (PageRank)
Method F1 on TREC QAcontent 76.92 content +d*graph 79.47 d=0.0001
KG expansion: A Potential Issue
Example query: EU UK relationsConsider: Correct connection, but:The connection is not relevant in the context of "UK" as in "EU relations".If we would promote docs because they talk about The Beatles, we are hurting the ranking quality.
UK
The Beatles
General Approach: Graph Expansion
Many connectionsin a knowledge graph. Only few are relevant! Expanding with non-relevant entities leads to low precisionrankings.
UK
EU
Brexit
France
The Beatles
EU law
Weight Edges / Nodes in the Knowledge Graph
Popularity measures:- Graph walks: PageRank / HITS- Degree Connectivity measures (seeds): - Shortest paths- Entity relatedness Graph clustering Issue: Do not consider the query.
Entity Aspects and the Graph Structure
An open issue remains:- Entities have multiple aspects- Graph = overlay of all aspects Growth of KGs leads to - better coverage of relevant facts- many more spurious facts :-( When relevant != popular:How to tell which edges are relevant?
Relation Extraction (for Relevant Relations)
1. Introduction2. Vision3. Approaches: Utilizing KGs for Text IR4. Knowledge Graph Expansion5. Relation Extraction6. Entity Aspects7. Conclusion
Fix: RM3 + Graph Walk
UK - The Beatles:Can be solved with Pseudo-relevance Feedback.1. Retrieve documents for Q2. Delete edges to entities that are not mentioned But: non-relevant relations remainlead to erroneous entity expansions
Relation Extraction: Research question:relevant documents + extraction = relevant relations?
Task: Extracting Relevant Relations
works_for
works_for
[Schuhmacher 16]
Q
rf:founded_by
Eben_Upton
Premier_Farnell
United_Kingdom
Broadcom
University_of_Cambridge
rf:member_of
rf:member_of
rf:headquarters
England
Harriet_Green
dbp:membership
rf:member_of
rf:headquarters
dbop:almaMater
Reuters
rf:headquarters
Raspberry_Pi_Foundation
rf:member_of
Goal: Relations need to be relevant and correct Query: Raspberry Pi
Relevant Relations through Relevant Documents
not relevant
relevant
dbp knowledge base
rf relation extraction
Big Question: Edge Relevance
How to infer which other connected entities / nodesare relevant for the information need Q? ...and therefore safe for- expansion- and promotion in entity ranking? Not just those with - many connections (PageRank)- mentioned in feedback docs- extracted with relation extraction
Big Question: Context Relevance
How to infer which contexts of entity links are relevant for the information need Q? ...and therefore safe for- expansion and - promotion in psg ranking? Not just those with - popular words (RM3)- frequent entity mentions
Entity Aspects
1. Introduction2. Vision3. Approaches: Utilizing KGs for Text IR4. Knowledge Graph Expansion5. Relation Extraction6. Entity Aspects7. Conclusion
Entity Aspects
Danger: An entity is relevant, but:only because of one aspect=> many non-relevant aspects of relevant entities. Example aspects about UK:- still a member of the European Union- is a constitutional monarchy- the Raspberry Pi was invented in the UK- there are many great UK bands Depending on query, some are relevant, some not.
How to Represent Entity Aspects?
As terms? As types? As is-a? Related entities? Relations? Language Model
UK bandsbrexitUK member of "European Union"UK as a European countryUK Theresa_May Theresa_May prime_minister_of UKp(brexit)=0.4p(leave)=0.25 p(immigration)=0.10
[Reinanda SIGIR15, Liu IRJ15, Prasojo CIKM15]
Entity Aspects: Using KG ...
UK
Theresa May
prime_minister_of
bands
EuropeanUnion
UK Theresa_MayTheresa_May prime_minister_of UK
UK bands
UK member_of European UnionUK europe
TM
Entity Aspects: Using KG and Text
UK
is a member of the
UK
UK
UK
Theresa May
prime_minister_of
bands
Manybands are very good
is the Prime Minister of the
EuropeanUnion
UK bands
TM
EU
UK Theresa_MayTheresa_May prime_minister_of UK
UK member_of European_UnionUK europe
Entity Aspects: Infer Relevance, Match, Extract
Use KG + text to model for each relevant entity:- what are different aspects of the entity?- which aspects are relevant?- how are relevant aspects best represented? Generic pattern:1. Information extraction2. Relevance prediction3. Matching (inverse extraction)
Entity Aspects as Terms
UK
UK
bands
Manybands are very good
UK bands
Passage-Language Model- Pseudo relev. feedback- Context of entity links- Proximity to query terms[Blanco10, Dalton14, Liu15, Petkova07]
Language model from article / descr.[Bendersky12, Dalton14, Liu15]
UK
prime_minister_of
Relation Extraction:- Supervised Extraction from Text[Schuhmacher ECIR16] Infer & Extract Aspects
Entity Aspects through Relations (Triples)
Theresa_May prime_minister_of UK
Feature-based retrieval:- Relation terms- Cosine of word vectors[Voskarides ACL15] Match Aspects
UK
TheresaMay
movies
UK
is the Prime minister of the
TM
UK
Conclusion
1. Introduction2. Vision3. Approaches: Utilizing KGs for Text IR4. Knowledge Graph Expansion5. Relation Extraction6. Entity Aspects7. Conclusion
Using Types and Categories
1. Matching entities in documents2. Find relevant entities3. Graph expansion4. Entity types5. Combination of multiple sources6. Machine learning7. Entity aspects
Entity Types (inferred through entities)
Which typesare relevant?
How to match types to documents?
majority typesamong entities
prefer docs with entities of this type
a) same-type entities [Kaptein CIKM10]
Entity Types Inferred through Entity Links
Which typesare relevant?
How to match types to documents?
majority typesamong entities
prefer docs with entities of this type
a) same-type entities [Kaptein CIKM10]
MethodFull TextLinkType+Link
MAP on INEX0.03 0.090.13
Entity Types through Text Classification
Which typesare relevant?
How to match types to documents?
classify query termswith naive Bayes
classify documents with naive Bayes
b) term classifier [Xiong CIKM15]