Entity Aspect Link collection of Wikipedia Paragraphs (EAL v0.9)

We provide a large dataset of entity aspect annotations for the English Wikipedia (compatible with TREC CAR v2.0).

Download the archive here: trec-car-aspect-linked-corpus.tar (8.3GB)

License

The dataset is made available under Creative Commons Attribution Share Alike.

Entity Aspect Link collection of Wikipedia Paragraphs by Laura Dietz is licensed under CC BY-SA 4.0

This dataset is based on CC-SA work of

What are Entity Aspects and Entity Aspect Links?

Many entity linking tools, can identify mentions of Wikipedia-entities in text. Knowing which entities are mentioned in text, provides explicit semantics to reason about the text’s meaning. Hence, entity links have been shown to lead to many significant improvements in a wide range of tasks. However, many entities have different aspects that can be referred to, for example a “Oysters” can be referenced in the context of food, aquatic species, or ecosystem restoration. An entity link refers to a Wikipedia page, and hence, will not differentiate which of these aspects are refered to in the context.

In contrast, entity aspect refer to different aspects of an entity, and hence differentiates between Oysters/as_food versus Oysters/ecosystem_services. Entity aspect links provide links from contexts of an entity mention to the closest matching entity aspects that is referenced in the context. For example, if a text passage discusses oysters, entity aspect linking will associate the surface text “oyster” with one if its aspects, e.g. Oysters/as_food. By using an explicit catalog of entity aspects, entity aspect linking identified fine-grained semantics of text, which is able to differentiate which aspect of the mentioned entities are discussed.

Entity Aspect Catalog and Training Data for Entity Aspect Linking

In this work, we build on a large catalog of entity aspects, which is harvested from top-level sections of the entity’s Wikipedia. At CIKM 2020, we provided an entity aspect catalog harvested from an English Wikipedia dump of January 2020, after cleaning and quality control.

The CIKM 2020 resource also includes training and evaluation data for an entity aspect linking tool. The dataset is derived from hyperlinks on Wikipedia that point to a section on another Wikipedia article.

Furthermore, we conducted experiments using entity aspect linking features, described by Nanni et al (2018), and provide reference baselines for further research. The efficacy study, demonstrate that a simple feature-based model, trained with list-wise learning-to-rank can reliably predict entity aspect links.

Our entity aspect linking resource released at CIKM 2020, consists of

This resource: Entity Aspect Link annotations for TREC CAR.

Here we provide a novel resource of entity aspect link annotations of all Wikipedia pages, using the aspect catalog and trained aspect linking approach.

Annotated TREC CAR v2.0 / Wikipedia Collection

This release was derived from the v2.3 data release of the TREC Complex Answer Retrieval track, which is based on a Wikipedia dump from December 2016.

This resource is intended to support the development of novel retrieval methods, within the TREC CAR passage retrieval, entity retrieval, and article arrangement tasks.

As there is a high degree of copy-and-pasted paragraphs across Wikipedia pages, we provide aspect link annotations for all paragraphs in the paragraph corpus (paragraphCorpus.v2.0.tar.xz). Each paragraph has a unique ID, which is referenced in the Wikipedia articles of various TREC CAR collections, qrels, or ranking-submissions. To obtain entity-aspect-linked Wikipedia articles, iterate through articles of the complete unfiltered Wikipedia dump unprocessedAllButBenchmark.v2.1.tar.xz, and replace paragraphs (by matching paragraph ids).

Note that this dump does not contain pages representing CAR queries - these are provided separately with the

Example paragraph originating from the Wikipedia article on William Ayscough.

{000001ae0be1235060c10e6edc99f7791e86a04c} Ayscough was murdered on 29 June 1450 by an angry mob during Jack Cade’s rebellion, as he had married Henry VI and the deeply unpopular Margaret of Anjou.

Entity Aspect Links:

The dataset can be read with the “trec-car-tools” library which is available for java/maven and python/pypimore info.

Open the archive as a paragraph corpus.

Aspect link information is represented in the “Link Section” field of an entity link.

Here some java code snippet on how to access the information.


// a paragraph is represented a a list of text or links (called bodies)
 for (Data.ParaBody body : paragraph.getBodies()) { 
            if (body instanceof Data.ParaLink) { // found a link
                Data.ParaLink paraLink = (Data.ParaLink) body;
                String entityAspect = paraLink.getLinkSection(); // entity aspect (as in section name) <---
                String entityPageId = paraLink.getPageId(); // entity id
                String entityPageName = paraLink.getPage(); // title name of entity
                String anchorText = paraLink.getAnchorText(); // surface text of the entity mention
             }
            if (body instanceof Data.ParaText) { // plain text
                // ... 
             }                             
}      

The python code follows analogously.

Used Entity Aspect Linking Model Details

Our entity aspect linking model is based on features described in Nanni et al (2018) and Ramsdell and Dietz (2020).

We train the model on the “train-small” aspect link training set using rank-lips (list-wise learning to rank, with coordinate ascent, optimizing for MAP).

The trained aspect linking model uses text, word embeddings, and entity similarity features between context (sentence and paragraph) and aspect content and aspect name (i.e., section heading).

P@1 on the validation dataset: 0.70

Hyperparameters ( tuned on the validation dataset):

Publications that use this collection

Acknowledgement

This material is based upon work supported by the National Science Foundation under Grant No. 1846017. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.