Introduction

Craam is a header-only C++ library for solving Markov decision processes with regular, robust, or optimistic objectives. The optimistic objective is the opposite of robust, in which nature chooses the best possible realization of the uncertain values. The library also provides tools for basic simulation, for constructing MDPs from sample*s, and *value function approximation. Objective functions supported are infinite horizon discounted MDPs, finite horizon MDPs, and stochastic shortest path [Puterman2005]. Some basic stochastic shortest path methods are also supported. The library assumes maximization over actions. The number of states and actions must be finite.

The library is build around two main data structures: MDP and RMDP. MDP is the standard model that consists of states ${S}$ and actions ${A}$. The robust solution for an MDP would satisfy, for example, the following Bellman optimality equation:

\[ v(s) = \max_{a \in \mathcal{A}} \min_{p \in \Delta} \left\{ \sum_{s'\in\mathcal{S}} p(s') ( r(s,a,s') + \gamma \, \, v(s') ) ~:~ \|p - P(s,a,\cdot) \| \le \psi, \; p \ll P(s,a,\cdot) \right\}~. \]

Note that $p$ is constrained to be absolutely continuous with respect to $P(s,a,)$. This is a hard requirement for all choices of ambiguity (or uncertainty).

The RMPD model adds a set of outcomes that model possible actions that can be taken by nature. In that case, the robust solution may for example satisfy the following Bellman optimality equation:

\[ v(s) = \max_{a \in \mathcal{A}} \min_{o \in \mathcal{O}} \sum_{s'\in\mathcal{S}} P(s,a,o,s') ( r(s,a,o,s') + \gamma \, v(s') ) ~. \]

Using outcomes makes it more convenient to capture correlations between the ambiguity in rewards and the uncertainty in transition probabilities. It also make it much easier to represent uncertainties that lie in small-dimensional vector spaces. The equation above uses the worst outcome, but in general distributions over outcomes are supported.

The available algorithms are value iteration and modified policy iteration. The library support both the plain worst-case outcome method and a worst case with respect to a base distribution.

Installation and Build Instruction

See the README.rst

Getting Started

See the online documentation or generate it locally as described above.

Unit tests provide some examples of how to use the library. For simple end-to-end examples, see tests/benchmark.cpp and test/dev.cpp. Targets BENCH and DEV build them respectively.

The main models supported are:

craam::MDP : plain MDP with no specific definition of ambiguity (can be used to compute robust solutions anyway)
craam::RMDP : an augmented model that adds nature's actions (so-called outcomes) to the model for convenience
craam::impl::MDPIR : an MDP with implementability constraints. See [Petrik2016].

The regular value-function based methods are in the header algorithms/values.hpp and the robust versions are in in the header algorithms/robust_values.hpp. There are 4 main value-function based methods:

Method	Algorithm
`solve_vi`	Gauss-Seidel value iteration; runs in a single thread.
`solve_mpi`	Jacobi modified policy iteration; parallelized with OpenMP. Generally, modified policy iteration is vastly more efficient than value iteration.
`rsolve_vi`	Like the value iteration above, but also supports robust, risk-averse, or optimistic objectives.
`rsolve_mpi`	Like the modified policy iteration above, but it also supports robust, risk-averse, optimistic objective.

These methods can be applied to eithen an MDP or an RMDP.

The header algorithms/occupancies.hpp provides tools for converting the MDP to a transition matrix and computing the occupancy frequencies.

There are tools for building simulators and sampling from simulations in the header Simulation.hpp and methods for handling samples in Samples.hpp.

The following is a simple example of formulating and solving a small MDP.

#include "craam/RMDP.hpp"
#include "craam/modeltools.hpp"
#include "craam/algorithms/values.hpp"
#include <iostream>
#include <vector>
using namespace craam;
using namespace std;
int main(){
    MDP mdp(3);
    // transitions for action 0
    add_transition(mdp,0,0,0,1,0);
    add_transition(mdp,1,0,0,1,1);
    add_transition(mdp,2,0,1,1,1);
    // transitions for action 1
    add_transition(mdp,0,1,1,1,0);
    add_transition(mdp,1,1,2,1,0);
    add_transition(mdp,2,1,2,1,1.1);
    // solve using Jacobi value iteration
    auto&& re = algorithms::solve_mpi(mdp,0.9);
    for(auto v : re.valuefunction){
        cout << v << " ";
    }
    return 0;
}

To compile the file, run:

$ g++ -fopenmp -std=c++14 -I<path_to_top_craam_folder> simple.cpp

Supported Use Cases

Formulate an uncertain MDP
Compute a solution to an uncertain MDP
Compute value of a fixed policy
Compute an occupancy frequency
Simulate transitions of an MDP
Construct MDP from samples
Simulate a general domain

General Assumptions

Transition probabilities must be non-negative but do not need to add up to one
Transitions with 0 probabilities may be omitted, except there must be at least one target state in each transition
State with no actions: A terminal state with value 0
Action with no outcomes: Terminates with an error
Outcome with no target states: Terminates with an error

References

[Filar1997] Filar, J., & Vrieze, K. (1997). Competitive Markov decision processes. Springer.

[Puterman2005] Puterman, M. L. (2005). Markov decision processes: Discrete stochastic dynamic programming. Handbooks in operations research and management …. John Wiley & Sons, Inc.

[Iyengar2005] Iyengar, G. N. G. (2005). Robust dynamic programming. Mathematics of Operations Research, 30(2), 1–29.

[Petrik2014] Petrik, M., Subramanian S. (2014). RAAM : The benefits of robustness in approximating aggregated MDPs in reinforcement learning. In Neural Information Processing Systems (NIPS).

[Petrik2016] Petrik, M., & Luss, R. (2016). Interpretable Policies for Dynamic Product Recommendations. In Uncertainty in Artificial Intelligence (UAI).