CRAAM
2.0.0
Robust and Approximate Markov Decision Processes
|
Craam is a header-only C++ library for solving Markov decision processes with regular, robust, or optimistic objectives. The optimistic objective is the opposite of robust, in which nature chooses the best possible realization of the uncertain values. The library also provides tools for basic simulation, for constructing MDPs from sample*s, and *value function approximation. Objective functions supported are infinite horizon discounted MDPs, finite horizon MDPs, and stochastic shortest path [Puterman2005]. Some basic stochastic shortest path methods are also supported. The library assumes maximization over actions. The number of states and actions must be finite.
The library is build around two main data structures: MDP and RMDP. MDP is the standard model that consists of states ${S}$ and actions ${A}$. The robust solution for an MDP would satisfy, for example, the following Bellman optimality equation:
\[ v(s) = \max_{a \in \mathcal{A}} \min_{p \in \Delta} \left\{ \sum_{s'\in\mathcal{S}} p(s') ( r(s,a,s') + \gamma \, \, v(s') ) ~:~ \|p - P(s,a,\cdot) \| \le \psi, \; p \ll P(s,a,\cdot) \right\}~. \]
Note that $p$ is constrained to be absolutely continuous with respect to $P(s,a,)$. This is a hard requirement for all choices of ambiguity (or uncertainty).
The RMPD model adds a set of outcomes that model possible actions that can be taken by nature. In that case, the robust solution may for example satisfy the following Bellman optimality equation:
\[ v(s) = \max_{a \in \mathcal{A}} \min_{o \in \mathcal{O}} \sum_{s'\in\mathcal{S}} P(s,a,o,s') ( r(s,a,o,s') + \gamma \, v(s') ) ~. \]
Using outcomes makes it more convenient to capture correlations between the ambiguity in rewards and the uncertainty in transition probabilities. It also make it much easier to represent uncertainties that lie in small-dimensional vector spaces. The equation above uses the worst outcome, but in general distributions over outcomes are supported.
The available algorithms are value iteration and modified policy iteration. The library support both the plain worst-case outcome method and a worst case with respect to a base distribution.
See the README.rst
See the online documentation or generate it locally as described above.
Unit tests provide some examples of how to use the library. For simple end-to-end examples, see tests/benchmark.cpp
and test/dev.cpp
. Targets BENCH
and DEV
build them respectively.
The main models supported are:
craam::MDP
: plain MDP with no specific definition of ambiguity (can be used to compute robust solutions anyway)craam::RMDP
: an augmented model that adds nature's actions (so-called outcomes) to the model for conveniencecraam::impl::MDPIR
: an MDP with implementability constraints. See [Petrik2016].The regular value-function based methods are in the header algorithms/values.hpp
and the robust versions are in in the header algorithms/robust_values.hpp
. There are 4 main value-function based methods:
Method | Algorithm |
---|---|
solve_vi | Gauss-Seidel value iteration; runs in a single thread. |
solve_mpi | Jacobi modified policy iteration; parallelized with OpenMP. Generally, modified policy iteration is vastly more efficient than value iteration. |
rsolve_vi | Like the value iteration above, but also supports robust, risk-averse, or optimistic objectives. |
rsolve_mpi | Like the modified policy iteration above, but it also supports robust, risk-averse, optimistic objective. |
These methods can be applied to eithen an MDP or an RMDP.
The header algorithms/occupancies.hpp
provides tools for converting the MDP to a transition matrix and computing the occupancy frequencies.
There are tools for building simulators and sampling from simulations in the header Simulation.hpp
and methods for handling samples in Samples.hpp
.
The following is a simple example of formulating and solving a small MDP.
To compile the file, run:
[Filar1997] Filar, J., & Vrieze, K. (1997). Competitive Markov decision processes. Springer.
[Puterman2005] Puterman, M. L. (2005). Markov decision processes: Discrete stochastic dynamic programming. Handbooks in operations research and management …. John Wiley & Sons, Inc.
[Iyengar2005] Iyengar, G. N. G. (2005). Robust dynamic programming. Mathematics of Operations Research, 30(2), 1–29.
[Petrik2014] Petrik, M., Subramanian S. (2014). RAAM : The benefits of robustness in approximating aggregated MDPs in reinforcement learning. In Neural Information Processing Systems (NIPS).
[Petrik2016] Petrik, M., & Luss, R. (2016). Interpretable Policies for Dynamic Product Recommendations. In Uncertainty in Artificial Intelligence (UAI).