Rank-Lips: Usage

Rank-lips is designed to combine different retrieval models in a supervised way, aka “Learning to rank”. We interface the the most commonly used “trec_eval run file format”, which represents rankings for a set of benchmark queries in one large file.

Our implementation of the MAP (mean-average precision) evaluation metric is identical to the one in trec_eval.

Training

A model is trained from features (as input run files) and a ground truth (as qrels). The filename of the input run will be used as a feature name. Up to two features are derived from one line in the run file: the rank score (feaure variant FeatScore),and/or the inverse rank (FeatRecipRank).

Input: Features that will be read from input run files (each runfile defines a score/reciprank feature)
Input: The ground truth of relevance will be read from a single qrels file
Output: A model will be trained and stored our rank-lips JSON format
Output: Training Run: The best achievable ranking (obtained with the model)
Output: Training MAP score: The MAP score on the training run will be reported. (Note: this only for diagnostic purposes, this number should not be reported in your paper.)

Command: rank-lips train -d $TRAIN_FEATURE_DIR -f RUN_A -f RUN_B -f RUN_C -q QRELS -O $OUT_DIR -o OUT_PREFIX -e "experiment 1"

Missing features will be set to 0, please see below for fine-grained control over default feature values.

By default both feature variants are included. You can choose the feature variant with:

--feature-variant FeatScore (use only the score)
--feature-variant FeatRecipRank (use only the inverse of the rank)

You can enable z-score normalization with --z-score. If z-score normalization is activated during training, it will automatically be applied during prediction. The mean and stdev of the train set will be applied.

In general, whichever settings were enabled during training, it will automatically applied during prediction (these are stored in the model file).

Training with Cross Validation

If cross-validation is enabled, the data will be automatically split into five folds, training a model per fold and predicting it on the test data of the fold.

Input: the same as in normal training
Output: A model per fold is trained and stored in JSON format
Output: Cross-validated ranking (produced by concatenating predictions across all folds)
Output: Test MAP: MAP score of the cross-validated ranking (You can report this number in you research paper, since we will not test on the train set.)

Command: rank-lips train [...] --train-cv

You can specify the number of folds used with --folds K (default is 5).

If you intend to re-use the trained per-fold models, you will have to take precautions to only apply them to held-out queries (i.e., those not used for training). For verification, rank-lips can export the holdout queries along with the model using the option --save-heldout-queries-in-model

Prediction

Once a rank-lips model is trained, you can use it to predict rankings on a separate test set.

Input: features will be read from input run files (these must be different from the ones used for training, but identical file names must be used)
Input: Trained rank-lips model
Optional Input: Ground truth on test set (as a qrel file)
Output: Test run file, as produced with the model on the given features
Output: Test MAP score (if a qrels file was provided)

The model file format has changed compared to version 1.0. To load v1.0 models, include the command line flag --is-v10-model.

Optimization Parameters for Training

Rank-lips’ training procedure can be adjusted to your needs through command line parameters. The setting of parameters is included for archival purposes in the JSON of a rank-lips model

By default, rank-lips will use 5 restarts and 5 folds. You can change this with

-r,--restarts N number of restarts per fold/model (model with best training performance will be chosen), default: 5
--folds K number of folds (cross-validation only), default: 5

Rank-lips uses coordinate ascent to find the the parameters of a linear model that yield the best MAP score on the training set. To detect convergence, we use the relative change from epoch to epoch in MAP score, and stop when the relative change is less than FACTOR. An initial number of epochs can be dropped and an upper limit on iterations can be set.

Convergence parameters:
- --convergence-threshold FACTOR: being converged means that relative change in MAP between iterations is less than FACTOR, default: 0.1
- --convergence-max-iter ITER max number of iterations after which training is stopped (use to avoid loops), default: 1000
- --convergence-drop-initial-iterations ITER: number of initial iterations to disregard before convergence is monitored, default: 0

Whenever the training data is large (e.g. beyond 100 queries with 1000 documents), one training epoch may take a very long time, without offering much utility to the parameter optimization. By default rank-lips will use mini-batches of SIZE training queries to determine the gradient, every STEPS epochs, a new batch of queries will be chosen. (This technique is a form of stochastic gradient descent.) To diagnose convergence, the MAP score on the full training set will be used. Since this evaluation can be potentially expensive, we will skip dong this for EVAL many mini-batch iterations.

Mini-batch parameters:
- --mini-batch-size SIZE: number of mini-batch training queries, default: 100
- --mini-batch-steps STEPS: iterations per mini-batch, default: 1
- --mini-batch-eval EVAL: number of mini-batches before next training evaluation, default: 0

We ask to set the a friendly name of the experiment being conducted. This experiment name will be archived in the model file.

-e,--experiment FRIENDLY_NAME : experiment name (will be archived in the model file)

Multi-threading is enabled with

--threads J: enable multi-threading with J threads, default: 1

If you train a model on features and then use it to predict on the identical feature set, you will obtain the Training MAP score, please do not use such numbers in your paper. Only report numbers on a holdout test set.

Command: rank-lips predict -m MODEL -d FEATURE_DIR -f RUN_A -f RUN_B -f RUN_C [-q QRELS] -O OUT_DIR -o OUT_PREFIX

Default Feature Values

For the case where not all features are defined for all query/document combinations, it is strongly recommended to set a default feature value, which will be used to fill in missing feature values.

We offer three mutually exclusive default feature modes:

--default-any-feature-value VALUE: When any feature is missing for a query/doc pair, this value will be used as feature value
--default-feature-variant-value KEY=VALUE: default values for each feature variant in KEY=VALUE format without spaces, example: –default-feature-variant-value FeatureScore=-9999.999
--default-feature-value KEY=VALUE default values for each feature in FNAME-FVariant=VALUE format without spaces, example: –default-feature-value FeatureA-FeatureScore=-9999.999

To set default values for multiple features / feature-variants, please repeat the parameter, e.g. --default-feature-value FeatureA-FeatScore=-10.0 --default-feature-value FeatureB-FeatScore=0.0