Authors: Jordan Ramsdell and Laura Dietz.
The test collection and all associated data are released under a Creative Commons Attribution-ShareAlike 4.0 International License. .
The collection of 1 million EAL instances is provided in the following (disjoint) partitions.
We only provide EAL instances from section hyperlinks that meet our quality criteria.
We provide the EAL collection as gzipped JSONL files. Each line will contain one EAL instance as JSON. The json format is documented here.
Instead of unzipping the files, we recommend to open jsonl.gz
with a GZipped stream.
We provide each feature in form of a trec run, where the EAL-instance ID is the query
and the aspect ID is the document
. In this work we generate one feature from the score
field of the run file.
These feature run files are located in baselines/features-paragraph
and baselines/features-sentence
.
Features are combined with learning to rank, training on the train-small
subset.
Trained models are located in baselines/experiment-*/trained-models/
. The file format is $train--$featureset--$model.model
.
The resulting run files are located in baselines/experiment-*/runs-paragraph
and baselines/runs-sentence
. The file format is $train--$test--$model.run
We include results of two list-wise learning-to rank toolkits:
The quality of the resulting run files are evaluated with trec_eval -c -q -m all_trec
, including query-by-query results to compute standard error (and other analyses)
The evaluation files are located in baselines/experiment-*/eval-paragraph
and baselines/experiment-*/eval-sentence
. The file format is $train--$test--$model.eval
We provide the fielded corpus statistics used for our BM25 and TF-IDF models.
These were created out of random 200k Wiki-2020 pages, tokenized and lematized with Stanford’s CoreNLP version 3.9.2. (Same tokenizer used by Nanni et al.)
The first two lines of the corpus statistics contain the following meta information:
The corpus statistics are located in baselines/corpus_stats.csv
To facilitate comparison, we offer a re-released version of Nanni’s 201 EAL benchmark using our jsonl.gz format. Information not available in the original nanni-201 datasets is left empty (e.g., entity offsets).
Corpus statistics for Nanni’s 201 test set, are created from all EAL instances.
The dataset can be used in one of the following experimental setups.
nanni-201
train-small
and/or train-remaining
; then tested on nanni-201
train-small
and train-remaining
; then tested on nanni-201
train-remaining
; then tested on nanni-201
Entity-aspect-linking-2020
by Jordan Ramsdell, Laura Dietz
is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at http://trec-car.cs.unh.edu/datareleases/v2.4-release.html,
work at www.wikipedia.org,
and on a work at https://federiconanni.com/entity-aspect-linking/.