Rank-lips is designed to combine different retrieval models in a supervised way, aka “Learning to rank”. We interface the the most commonly used “trec_eval run file format”, which represents rankings for a set of benchmark queries in one large file.
Our implementation of the MAP (mean-average precision) evaluation metric is identical to the one in trec_eval.
A model is trained from features (as input run files) and a ground truth (as qrels). The filename of the input run will be used as a feature name. Up to two features are derived from one line in the run file: the rank score (feaure variant FeatScore),and/or the inverse rank (FeatRecipRank).
Command: rank-lips train -d $TRAIN_FEATURE_DIR -f RUN_A -f RUN_B -f RUN_C -q QRELS -O $OUT_DIR -o OUT_PREFIX -e "experiment 1"
Missing features will be set to 0, please see below for fine-grained control over default feature values.
By default both feature variants are included. You can choose the feature variant with:
--feature-variant FeatScore
(use only the score)--feature-variant FeatRecipRank
(use only the inverse of the rank)You can enable z-score normalization with --z-score
. If z-score normalization is activated during training, it will automatically be applied during prediction. The mean and stdev of the train set will be applied.
In general, whichever settings were enabled during training, it will automatically applied during prediction (these are stored in the model file).
If cross-validation is enabled, the data will be automatically split into five folds, training a model per fold and predicting it on the test data of the fold.
Command: rank-lips train [...] --train-cv
You can specify the number of folds used with --folds K
(default is 5).
If you intend to re-use the trained per-fold models, you will have to take precautions to only apply them to held-out queries (i.e., those not used for training). For verification, rank-lips can export the holdout queries along with the model using the option --save-heldout-queries-in-model
Once a rank-lips model is trained, you can use it to predict rankings on a separate test set.
The model file format has changed compared to version 1.0. To load v1.0 models, include the command line flag --is-v10-model
.
Rank-lips’ training procedure can be adjusted to your needs through command line parameters. The setting of parameters is included for archival purposes in the JSON of a rank-lips model
By default, rank-lips will use 5 restarts and 5 folds. You can change this with
-r,--restarts N
number of restarts per fold/model (model with best training performance will be chosen), default: 5--folds K
number of folds (cross-validation only), default: 5Rank-lips uses coordinate ascent to find the the parameters of a linear model that yield the best MAP score on the training set. To detect convergence, we use the relative change from epoch to epoch in MAP score, and stop when the relative change is less than FACTOR. An initial number of epochs can be dropped and an upper limit on iterations can be set.
--convergence-threshold FACTOR
: being converged means that relative change in MAP between iterations is less than FACTOR, default: 0.1--convergence-max-iter ITER
max number of iterations after which training is stopped (use to avoid loops), default: 1000--convergence-drop-initial-iterations ITER
: number of initial iterations to disregard before convergence is monitored, default: 0Whenever the training data is large (e.g. beyond 100 queries with 1000 documents), one training epoch may take a very long time, without offering much utility to the parameter optimization. By default rank-lips will use mini-batches of SIZE
training queries to determine the gradient, every STEPS
epochs, a new batch of queries will be chosen. (This technique is a form of stochastic gradient descent.) To diagnose convergence, the MAP score on the full training set will be used. Since this evaluation can be potentially expensive, we will skip dong this for EVAL
many mini-batch iterations.
--mini-batch-size SIZE
: number of mini-batch training queries, default: 100--mini-batch-steps STEPS
: iterations per mini-batch, default: 1--mini-batch-eval EVAL:
number of mini-batches before next training evaluation, default: 0We ask to set the a friendly name of the experiment being conducted. This experiment name will be archived in the model file.
-e,--experiment FRIENDLY_NAME
: experiment name (will be archived in the model file)Multi-threading is enabled with
--threads J
: enable multi-threading with J threads, default: 1If you train a model on features and then use it to predict on the identical feature set, you will obtain the Training MAP score, please do not use such numbers in your paper. Only report numbers on a holdout test set.
Command: rank-lips predict -m MODEL -d FEATURE_DIR -f RUN_A -f RUN_B -f RUN_C [-q QRELS] -O OUT_DIR -o OUT_PREFIX
For the case where not all features are defined for all query/document combinations, it is strongly recommended to set a default feature value, which will be used to fill in missing feature values.
We offer three mutually exclusive default feature modes:
--default-any-feature-value VALUE
: When any feature is missing for a query/doc pair, this value will be used as feature value--default-feature-variant-value KEY=VALUE
: default values for each feature variant in KEY=VALUE format without spaces, example: –default-feature-variant-value FeatureScore=-9999.999--default-feature-value KEY=VALUE
default values for each feature in FNAME-FVariant=VALUE format without spaces, example: –default-feature-value FeatureA-FeatureScore=-9999.999To set default values for multiple features / feature-variants, please repeat the parameter, e.g. --default-feature-value FeatureA-FeatScore=-10.0 --default-feature-value FeatureB-FeatScore=0.0
If no default feature option is passed on the command line, rank-lips will set any missing feature to 0.0.