Rank-Lips: Example

A small example to illustrate the use of (rank-lips)[index.html].

Download and unpack the archive rank-lips-example.tar.gz or:

  1. Create a data directory and download the training/test qrels here,

  2. Create a subdirectory for train-features ( FeatureA, FeatureB, FeatureC).

  3. Create a subdirectory for test-features ( FeatureA, FeatureB, FeatureC).

Since the feature filenames need to be consistent, you have to place train/test features in different directories.

This example has three training queries, however, only queries with positive and negative training data can be used for determining a gradient. In this example, Q3 only has one positive example, but no negative training data, which is why it will be removed by rank-lips. (During prediction, such queries will be included.)

The Qrel file does not need to be complete, missing entries are considered as non-relevant (negative), just as in trec_eval.

The following commands are provided in run.sh.

Training

  1. Test with the command rank-lips train. For a full explanation of command line parameters see rank-libs train --help.

    Command: rank-lips train -d ./train-features -q train.qrel -e 'my first rank-lips experiment' -O ./out -o train-try1 --z-score --default-any-feature-value 0.0 --convergence-threshold 0.001 --mini-batch-size 1000 --folds 2 --restarts 10 --save-heldout-queries-in-model --feature-variant FeatScore

  2. You will see information about training progress and restarts as well as the final training MAP score in the output.

    Output:
    [...]
    full restart 3 iteration 3, score 0.9166666666666666 -> 0.9166666666666666 rel 0.0
    full restart 4 iteration 1, score 0.725 -> 0.85 rel 0.14705882352941177
    full restart 4 iteration 2, score 0.85 -> 0.9166666666666666 rel 7.272727272727271e-2
    full restart 4 iteration 3, score 0.9166666666666666 -> 0.9166666666666666 rel 0.0
    Model train train metric 0.9166666666666666 MAP.
    Written model train to file "./out/train-try1-model-train.json" .
    Model train test metric 0.9166666666666666 MAP.
    dumped all models and rankings
    
  3. The $OUT directory will contain new files: Model train-try1-model-train.json and predicted run on train set train-try1-run-train.run.

  4. Explore trained model weights by inspecting the model JSON.

      "rankLipsTrainedModel": {
        "FeatureC-FeatScore": -9.533877827456385,
        "FeatureB-FeatScore": 2.102726478972383,
        "FeatureA-FeatScore": -0.14793830970606084
      }

Predicting

  1. Test with the command rank-lips predict. Note that a slightly different set of parameters apply (See rank-libs predict --help)

    Command: rank-lips predict -d ./test-features -q test.qrel -O ./out -o predict-try1 -m ./out/train-try1-model-train.json --feature-variant FeatScore

  2. In the output you will see information about the test MAP score.

    Model predict test metric 0.41666666666666663 MAP.
  3. The $OUT directory will contain a predicted test run file test-run-predict.run,

    Q10 Q0 doc2 1 1.731316541796714 l2r predict
    Q10 Q0 doc3 2 1.7202359601271957 l2r predict
    Q10 Q0 doc1 3 0.6309246349602737 l2r predict
    Q11 Q0 doc6 1 1.8836096930688682 l2r predict
    Q11 Q0 doc7 2 1.8808550789709522 l2r predict
    Q11 Q0 doc8 3 1.8359348755751346 l2r predict
    Q11 Q0 doc9 4 1.7806539721177292 l2r predict
    Q11 Q0 doc10 5 1.7072597968074712 l2r predict
    Q11 Q0 doc5 6 1.4889951940566715 l2r predict
    Q11 Q0 doc4 7 1.4024090651371584 l2r predict
  4. Verify the MAP score with trec_eval test.qrel out/test-run-predict.run -m map:

    map                     all     0.4167

Cross-Validation

The example has only two queries, therefore only two folds are being used (one query each).

  1. You can also train with cross validation enabled using the command rank-lips train -O $DTA_PATH -o "cv-example" -d $TRAIN_FEATURES -q $QREL --train-cv — this will perform normal training AND cross-validated training

Command: rank-lips train --train-cv -d ./train-features -q train.qrel -e 'my first rank-lips experiment' -O ./out -o cv-try1 --z-score --default-any-feature-value 0.0 --convergence-threshold 0.001 --mini-batch-size 1000 --folds 2 --restarts 10 --save-heldout-queries-in-model --feature-variant FeatScore

  1. The output contains information about both the regular training (training MAP) and the cross-validated map (“test-test”)
[...]
full restart 4 iteration 1, score 0.875 -> 0.9166666666666666 rel 4.5454545454545414e-2
full restart 4 iteration 2, score 0.9166666666666666 -> 0.9166666666666666 rel 0.0
FoldIdx 1 restart 4 iteration 2, score 1.0 -> 1.0 rel 0.0
Model fold-1-best test metric 0.45 MAP.
Model fold-1-best train metric 1.0 MAP.
Written model fold-1-best to file "./out/cv-try1-model-fold-1-best.json" .
full restart 4 iteration 3, score 0.9166666666666666 -> 0.9166666666666666 rel 0.0
Model train test metric 0.9166666666666666 MAP.
Model test test metric 0.475 MAP.
Model train train metric 0.9166666666666666 MAP.
Written model train to file "./out/cv-try1-model-train.json" .
dumped all models and rankings
  1. The predicted ranking:
Q1 Q0 doc3 1 0.8124722217132947 l2r test
Q1 Q0 doc2 2 0.3361474803069571 l2r test
Q1 Q0 doc1 3 -2.3958635914467803 l2r test
Q2 Q0 doc9 1 0.6736768036795666 l2r test
Q2 Q0 doc8 2 0.5940152842882087 l2r test
Q2 Q0 doc7 3 0.5721800884716158 l2r test
Q2 Q0 doc6 4 0.34189284390689123 l2r test
Q2 Q0 doc5 5 0.22965047894959037 l2r test
Q2 Q0 doc4 6 -0.23827775903504964 l2r test
  1. Training MAP and CV Test Map can be seen in the output:
Model train train metric 0.9166666666666666 MAP.
Model test test metric 0.39166666666666666 MAP.
  1. A comparison with trec_eval yields the same result
trec_eval -m map train.qrel ./out/cv-try1-run-test.run
map                     all     0.3917
  1. Please note: In cross-validation mode, only queries with both positive and negative instances are used (in our example Q3 is dropped). This will also affect test performance, hence trec_eval -c may offer a different value
trec_eval -c -m map train.qrel ./out/cv-try1-run-test.run
map                     all     0.2611
  1. You will find six more models in the $OUT, one for training on the whole set *-model-train.json, and one for each of the five folds (*-model-fold-$k-best.json). You will also find more run files (one per model, (*-run-fold-$k-best.run)), and a coss-validated run file for testing *-run-test.run.

More Options

  1. Enabling only features A and B with -f FeatureA -f FeatureB (these are filenames in the feature directory)

Command: rank-lips train -d ./train-features -q train.qrel -e 'my first rank-lips experiment' -O ./out -o train-try-feature-subset --z-score --convergence-threshold 0.001 --mini-batch-size 1000 -f FeatureA -f FeatureB --feature-variant FeatScore

 loadRunFiles FeatureA FeatureB
Feature dimension: 2
  1. Disabling z-score normalization during training may lead to a less resilinent model learning. (To disable, delete the --z-score option from the command above.

Command: rank-lips train -d ./train-features -q train.qrel -e 'my first rank-lips experiment' -O ./out -o train-try-feature-subset --convergence-threshold 0.001 --mini-batch-size 1000 -f FeatureA -f FeatureB --feature-variant FeatScore

  "rankLipsTrainedModel": {
    "FeatureA-FeatRecipRank": 2.0027944151298925e-05,
    "FeatureB-FeatRecipRank": 2.0027944151298925e-05,
    "FeatureC-FeatScore": -2.309760786659935,
    "FeatureC-FeatRecipRank": -0.7356527425044204,
    "FeatureB-FeatScore": 5.3752568906776315,
    "FeatureA-FeatScore": -0.009620874528474738
  }
  1. Enabling different feature variants with --feature-variant FeatScore --feature-variant FeatRecipRank

Command: /rank-lips train -d ./train-features -q train.qrel -e 'my first rank-lips experiment' -O ./out -o train-try-feature-variants --z-score --convergence-threshold 0.001 --mini-batch-size 1000 --feature-variant FeatScore --feature-variant FeatRecipRank

  1. Explore changing settings of the training parameters, such as mini-batch size --convergence-threshold 0.0001 --mini-batch-size 1.

We recommend to use a validation set to choose these parameters to trade-off speed versus performance. In this example we chose mini-batch of a single query (1), because we only have 2 train queries in total, typical minibatch-sizes are 10, 100, 1000

Command: rank-lips train -d ./train-features -q train.qrel -e 'my first rank-lips experiment' -O ./out -o train-try-convergence-mini-batch --z-score --convergence-threshold 0.0001 --mini-batch-size 1 --feature-variant FeatScore --feature-variant FeatRecipRank

  1. In the case where not all features are defined for all query/document combinations, we highly recommend to at least set a default that is used for any feature. Finer control over default features can be set per feature-variant…

Command: rank-lips train -d ./train-features-with-missing -q train.qrel -e 'my first rank-lips experiment' -O ./out -o train-missing-features-variants --feature-variant FeatScore --feature-variant FeatRecipRank --default-feature-variant-value FeatRecipRank=100.0 --default-feature-variant-value FeatScore=-9999.99

… or per-feature (which is a combination of run filename and feature variant).

Command rank-lips train -d ./train-features-with-missing -q train.qrel -e 'my first rank-lips experiment' -O ./out -o train-missing-features --feature-variant FeatScore -f FeatureA -f FeatureB --default-feature-value FeatureA-FeatScore=-10.0 --default-feature-value FeatureB-FeatScore=0.0