CRAAM  2.0.0
Robust and Approximate Markov Decision Processes
CRAAM Documentation

Introduction

Craam is a header-only C++ library for solving Markov decision processes with regular, robust, or optimistic objectives. The optimistic objective is the opposite of robust, in which nature chooses the best possible realization of the uncertain values. The library also provides tools for basic simulation, for constructing MDPs from sample*s, and *value function approximation. Objective functions supported are infinite horizon discounted MDPs, finite horizon MDPs, and stochastic shortest path [Puterman2005]. Some basic stochastic shortest path methods are also supported. The library assumes maximization over actions. The number of states and actions must be finite.

The library is build around two main data structures: MDP and RMDP. MDP is the standard model that consists of states ${S}$ and actions ${A}$. The robust solution for an MDP would satisfy, for example, the following Bellman optimality equation:

\[ v(s) = \max_{a \in \mathcal{A}} \min_{p \in \Delta} \left\{ \sum_{s'\in\mathcal{S}} p(s') ( r(s,a,s') + \gamma \, \, v(s') ) ~:~ \|p - P(s,a,\cdot) \| \le \psi, \; p \ll P(s,a,\cdot) \right\}~. \]

Note that $p$ is constrained to be absolutely continuous with respect to $P(s,a,)$. This is a hard requirement for all choices of ambiguity (or uncertainty).

The RMPD model adds a set of outcomes that model possible actions that can be taken by nature. In that case, the robust solution may for example satisfy the following Bellman optimality equation:

\[ v(s) = \max_{a \in \mathcal{A}} \min_{o \in \mathcal{O}} \sum_{s'\in\mathcal{S}} P(s,a,o,s') ( r(s,a,o,s') + \gamma \, v(s') ) ~. \]

Using outcomes makes it more convenient to capture correlations between the ambiguity in rewards and the uncertainty in transition probabilities. It also make it much easier to represent uncertainties that lie in small-dimensional vector spaces. The equation above uses the worst outcome, but in general distributions over outcomes are supported.

The available algorithms are value iteration and modified policy iteration. The library support both the plain worst-case outcome method and a worst case with respect to a base distribution.

Installation and Build Instruction

See the README.rst

Getting Started

See the online documentation or generate it locally as described above.

Unit tests provide some examples of how to use the library. For simple end-to-end examples, see tests/benchmark.cpp and test/dev.cpp. Targets BENCH and DEV build them respectively.

The main models supported are:

The regular value-function based methods are in the header algorithms/values.hpp and the robust versions are in in the header algorithms/robust_values.hpp. There are 4 main value-function based methods:

Method Algorithm
solve_vi Gauss-Seidel value iteration; runs in a single thread.
solve_mpi Jacobi modified policy iteration; parallelized with OpenMP. Generally, modified policy iteration is vastly more efficient than value iteration.
rsolve_vi Like the value iteration above, but also supports robust, risk-averse, or optimistic objectives.
rsolve_mpi Like the modified policy iteration above, but it also supports robust, risk-averse, optimistic objective.

These methods can be applied to eithen an MDP or an RMDP.

The header algorithms/occupancies.hpp provides tools for converting the MDP to a transition matrix and computing the occupancy frequencies.

There are tools for building simulators and sampling from simulations in the header Simulation.hpp and methods for handling samples in Samples.hpp.

The following is a simple example of formulating and solving a small MDP.

#include "craam/RMDP.hpp"
#include "craam/modeltools.hpp"
#include "craam/algorithms/values.hpp"
#include <iostream>
#include <vector>
using namespace craam;
using namespace std;
int main(){
MDP mdp(3);
// transitions for action 0
add_transition(mdp,0,0,0,1,0);
add_transition(mdp,1,0,0,1,1);
add_transition(mdp,2,0,1,1,1);
// transitions for action 1
add_transition(mdp,0,1,1,1,0);
add_transition(mdp,1,1,2,1,0);
add_transition(mdp,2,1,2,1,1.1);
// solve using Jacobi value iteration
auto&& re = algorithms::solve_mpi(mdp,0.9);
for(auto v : re.valuefunction){
cout << v << " ";
}
return 0;
}

To compile the file, run:

$ g++ -fopenmp -std=c++14 -I<path_to_top_craam_folder> simple.cpp

Supported Use Cases

  1. Formulate an uncertain MDP
  2. Compute a solution to an uncertain MDP
  3. Compute value of a fixed policy
  4. Compute an occupancy frequency
  5. Simulate transitions of an MDP
  6. Construct MDP from samples
  7. Simulate a general domain

General Assumptions

References

[Filar1997] Filar, J., & Vrieze, K. (1997). Competitive Markov decision processes. Springer.

[Puterman2005] Puterman, M. L. (2005). Markov decision processes: Discrete stochastic dynamic programming. Handbooks in operations research and management …. John Wiley & Sons, Inc.

[Iyengar2005] Iyengar, G. N. G. (2005). Robust dynamic programming. Mathematics of Operations Research, 30(2), 1–29.

[Petrik2014] Petrik, M., Subramanian S. (2014). RAAM : The benefits of robustness in approximating aggregated MDPs in reinforcement learning. In Neural Information Processing Systems (NIPS).

[Petrik2016] Petrik, M., & Luss, R. (2016). Interpretable Policies for Dynamic Product Recommendations. In Uncertainty in Artificial Intelligence (UAI).