V. CASE STUDY AND EXPERIMENTS
We require an abstract representation of the human’s commands
as a strategy to use our synthesis approach in a shared control
scenario, We now discuss how such strategies may be obtained
using inverse reinforcement learning and report on case study
A. Experimental setting
We consider two scenarios, the first of which is the wheelchair
scenario from Fig. 1. We model the wheelchair scenario inside
an interactive Python environment.
In the second scenario, we use a tool called AMASE1, which
is used to simulate multi-unmanned aircraft vehicles (UAV)
missions. Its graphical user interfaces allow humans to send
commands to the one or multiple vehicles at run time. It
includes three main programs: a simulator, a data playback
tool, and a scenario setup tool.
We use the model checker PRISM 19 to verify if the
computed strategies satisfy the specification. We use the LP
solver Gurobi 14 to check the feasibility of the LP problems
that is given in Section IV. We also implemented the greedy
approach for strategy repair in 15.
B. Data collection
We asked five participants to accomplish tasks in the wheelchair
scenario. The goal is moving the wheelchair to a target cell in
gridworld while never occupying the same cell as the moving
obstacle. Similarly, three participants performed the surveillance
task in the AMASE environment.
From the data obtained from each participant, we compute
an individual randomized human strategy ?h via maximum-
entropy inverse reinforcement learning (MEIRL) 28. Refer-
ence 16 uses inverse reinforcement learning to reason about
the human’s commands in a shared control scenario from
human’s demonstrations. However, they lack formal guarantees
on the robot’s execution. In 25, inverse reinforcement learning
is used to distinguish human intents with respect to different
tasks. We show the work flow of the case study in Figure 4.
In our setting, we denote each sample as one particular
command of the participant, and we assume that the participant
issues the command to satisfy the specification. Under this
assumption, we can bound the probability of a possible
deviation from the actual intent with respect to the number of
samples using Hoeffding’s inequality for the resulting strategy,
see 27 for details. Using these bounds, we can determine
the required number of command to get an approximation
of a typical human behavior. The probability of a possible
deviation from the human behavior is given by O(exp(?n?2)),
where n is the number of commands from the human and ?
is the upper bound on the deviation between the probability
of satisfying the specification with the true human strategy
and the probability obtained by the strategy that is computed
by inverse reinforcement learning. For example, to ensure an
upper bound ? = 0.05 on the deviation of the probability
Fig. 4. The setting of the case study for the shared control simulation.
of satisfying the specification with a probability of 0.99, we
require 1060 demonstrations from the human.
We design the blending function by assigning a lower weight
in the human strategy at states where the human strategy
yields a lower probability of reaching the target set. Using
this function, we create the autonomy strategy ?a and pass
(together with the blending function) back to the environment.
Note that the blended strategy ?ha satisfies the specification,
by Theorem 1.
The size of the gridworld in Fig. 1 is variable, and we
generate a number of randomly moving (e.g., the vacuum
cleaner) and stationary obstacles. An agent (e.g., the wheelchair)
moves in the gridworld according to the commands from a
human. For the gridworld scenario, we construct an MDP and
the states of the MDP represents the positions of the agent and
the obstacles. The actions in the MDP induce changes in the
position of the agent.
The safety specification specifies that the agent has to reach
a target cell while not crashing into an obstacle with a certain
probability ? ? 0, 1, formally P??(¬crash U target).
First, we report results for one particular participant in a
gridworld scenario with a 8× 8 grid and one moving obstacle.
The resulting MDP has 2304 states. We compute the human
strategy using MEIRL with the features, e. g., the components
of the cost function of the human, giving the distance to the
obstacle and the goal state.
We instantiate the safety specification with ? = 0.7, which
means the target should be reached with at least a probability
of 0.7. The human strategy ?h induces a probability of 0.546
to satisfy the specification. That is, it does not satisfy the
We compute the repaired strategy ?ha using the greedy and
the QCP approach, and both strategies satisfies the specification
by inducing a probability of satisfying the specification larger
than ?. On the one hand, the maximum deviation between
the human strategy ?h and ?ha is 0.15 with the LP approach,
which implies that the strategy of the human and the autonomy
protocol deviates at most 15% for all states and actions. On
the other hand, the maximum deviation between the human
strategy ?h and the blended strategy ?ha is 0.03 with the QCP
approach. The results shows that the QCP approach computes
a blended strategy that induces more similar strategy to the
human strategy compared to the LP approach.
(a) Strategy ?h (b) Strategy ?ah (c) Strategy ?a
Fig. 5. Graphical representation of the obtained human, blended, and autonomy
strategy in the grid.
SCALABILITY RESULTS FOR THE GRIDWORLD EXAMPLE.
grid states trans. LP synth. ?LP QCP synth. ?QCP
8× 8 2, 304 36, 864 14.12 0.15 31.49 0.03
10× 10 3, 600 57, 600 23.80 0.24 44.61 0.04
12× 12 14, 400 230, 400 250.78 0.33 452.27 0.05
To finally assess the scalability of our approach, consider
Table I. We generated MDPs for different gridworlds with
different number of states and number of obstacles. We list
the number of states in the MDP (labeled as “states”) and
the number of transitions (labeled as “trans”). We report on
the time that the synthesis process took with the LP approach
and QCP approach (labeled as “LP synth.” and “QCP synth”),
which includes the time of solving the LP or QCPs measured
in seconds. It also includes the model checking times using
PRISM for the LP approach. To represent the optimality of the
synthesis, we list the maximal deviation between the repaired
strategy and the human strategy for the LP and QCP approach
(labeled as “?LP” and “?QCP”). In all of the examples, we
observe that the strategies with the QCP approach yields
autonomy strategies with less deviation to the human strategy
while having similar computation time with the LP approach.
D. UAV mission planning
Similar to the gridworld scenario, we generate an MDP, in
which that of the states MDP denotes the position of the agents
and the obstacles in a AMASE scenario. Consider an AMASE
scenario in Fig. 6. In this scenario, the specification or the
mission of the agent (blue UAV) is to keep surveilling the
green regions (labeled as w1, w2, w3) while avoiding restricted
operating zones (labeled as “ROZ1, ROZ2”) and enemy agents
(purple and green UAVs). As we consider reachability problems,
we asked the participants to visit the regions in a sequence,
i.e., visiting the first region, then second, and then the third
region. After visiting the third region, the task is to visit the
first region again to perform the surveillance.
For example, if the last visited region is w3, then the
safety specification in this scenario is P??((¬crash ?
¬ROZ) U target), where ROZ is to visit the ROZ areas
and target is visiting w1.
We synthesize the autonomy protocol on the AMASE
scenario with two enemy agents that induces an MDP with
15625 states. We use the same blending function and same
(a) An AMASE simulator. (b) The GUI of AMASE.
Fig. 6. An example of an AMASE scenario.
threshold ? = 0.7 as in the gridworld example. The features
to compute the human strategy with MEIRL are given by
the distance to the closest ROZ, enemy agents and the target
The human strategy ?h induces a probability of 0.163 to
satisfy the specification, and it violates the specification like
in the gridworld example. Similar to the gridworld example,
we compute the repaired strategy ?ha with the LP and the
QCP approach, and both strategies satisfy the specification. On
the one hand, the maximum deviation between ?h and ?ha is
0.42 with the LP approach, which means the strategies of the
human and the autonomy protocol are significantly different in
some states of the MDP. On the other hand, the QCP approach
yields a repaired strategy ?ha that is more similar to the human
strategy ?h with the maximum deviation being 0.06. The time
of the synthesis procedure with the LP approach is 481.31
seconds and the computation time with the QCP approach
is 749.18 seconds, showing the trade-offs between the LP
approach and the QCP approach. We see that, the LP approach
can compute a feasible solution slightly faster, however the
resulting blended strategy may be less similar to the human
strategy compared to the QCP approach.
VI. CONCLUSION AND CRITIQUE
We introduced a formal approach to synthesize an autonomy
protocol in a shared control setting subject to probabilistic
temporal logic specifications. The proposed approach utilizes
inverse reinforcement learning to compute an abstraction of
a human’s behavior as a randomized strategy in a Markov
decision process. We designed an autonomy protocol such that
the resulting robot strategy satisfies safety and performance
specifications given in probabilistic temporal logic. We also
ensured that the resulting robot behavior is as similar to the
behavior induced by the human’s commands as possible. We
synthesized the robot behavior using quasiconvex programming.
We showed the practical usability of our approach through
case studies involving autonomous wheelchair navigation and
unmanned aerial vehicle planning.
There is a number of limitations and also possible extensions
of the proposed approach. First of all, we computed an
globally optimal strategy by bisection, which requires checking
feasibility of a number of linear programming problems. A
convex formulation of the shared control synthesis problem
would make computing the globally optimal strategy more
We assumed that the human’s commands are consistent
through the whole execution, i. e., the human issues each
command to satisfy the specification. Also, this assumption
implies the human does not consider assistance from the robot
while providing commands – and in particular, the human does
not adapt the strategy to the assistance. It may be possible
to extend the approach to handle non-consistent commands
by utilizing additional side information, such as the task
Finally, in order to generalize the proposed approach to
other task domains, it is worth to explore transfer learning 21
techniques. Such techniques will allow us to handle different
scenarios without requiring to relearn the human strategy from
the human’s commands.
1 Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse
reinforcement learning. In ICML, page 1. ACM, 2004.
2 Christel Baier and Joost-Pieter Katoen. Principles of Model Checking.
The MIT Press, 2008.
3 Mihir Bellare and Phillip Rogaway. The complexity of approximating
a nonlinear program. In Complexity in numerical optimization, pages
16–32. World Scientific, 1993.
4 Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cam-
bridge University Press, New York, NY, USA, 2004.
5 Taolue Chen, Yuan Feng, David S. Rosenblum, and Guoxin Su.
Perturbation analysis in verification of discrete-time Markov chains.
In CONCUR, volume 8704 of LNCS, pages 218–233. Springer, 2014.
6 Anca D. Dragan and Siddhartha S. Srinivasa. Formalizing assistive
teleoperation. In Robotics: Science and Systems, 2012.
7 Anca D. Dragan and Siddhartha S. Srinivasa. A policy-blending
formalism for shared control. I. J. Robotic Res., 32(7):790–805, 2013.
8 Andrew Fagg, Michael Rosenstein, Robert Platt, and Roderic Grupen.
Extracting user intent in mixed initiative teleoperator control. In
Intelligent Systems Technical Conference, page 6309, 2004.
9 Lu Feng, Clemens Wiltsche, Laura Humphrey, and Ufuk Topcu. Synthesis
of human-in-the-loop control protocols for autonomous systems. IEEE
Transactions on Automation Science and Engineering, 13(2):450–462,
10 Vojte?ch Forejt, Marta Kwiatkowska, Gethin Norman, David Parker, and
Hongyang Qu. Quantitative multi-objective verification for probabilistic
systems. In TACAS, pages 112–127. Springer, 2011.
11 Roland Fryer and Matthew O Jackson. A categorical model of cognition
and biased decision making. The BE Journal of Theoretical Economics,
12 Jie Fu and Ufuk Topcu. Synthesis of shared autonomy policies with
temporal logic specifications. IEEE Transactions on Automation Science
and Engineering, 13(1):7–17, 2016.
13 F. Gala?n, M. Nuttin, E. Lew, P. W. Ferrez, G. Vanacker, J. Philips,
and J. del R. Milla?n. A brain-actuated wheelchair: Asynchronous and
non-invasive brain-computer interfaces for continuous control of robots.
Clinical Neurophysiology, 119(9):2159–2169, 2016/05/28.
14 Gurobi Optimization, Inc. Gurobi optimizer reference manual.
15 Nils Jansen, Murat Cubuktepe, and Ufuk Topcu. Synthesis of shared
control protocols with provable safety and performance guarantees. In
ACC, pages 1866–1873. IEEE, 2017.
16 Shervin Javdani, J Andrew Bagnell, and Siddhartha Srinivasa. Shared
autonomy via hindsight optimization. In Robotics: Science and Systems,
17 Dae-Jin Kim, Rebekah Hazlett-Knudsen, Heather Culver-Godfrey, Greta
Rucks, Tara Cunningham, David Portee, John Bricout, Zhao Wang, and
Aman Behal. How autonomy impacts performance and satisfaction:
Results from a study with spinal cord injured subjects using an assistive
robot. IEEE Transactions on Systems, Man, and Cybernetics-Part A:
Systems and Humans, 42(1):2–14, 2012.
18 Jonathan Kofman, Xianghai Wu, Timothy J Luu, and Siddharth Verma.
Teleoperation of a robot manipulator using a vision-based human-robot
interface. IEEE transactions on industrial electronics, 52(5):1206–1219,
19 Marta Kwiatkowska, Gethin Norman, and David Parker. PRISM 4.0:
Verification of probabilistic real-time systems. In CAV, volume 6806 of
LNCS, pages 585–591. Springer, 2011.
20 Adam Leeper, Kaijen Hsiao, Matei Ciocarlie, Leila Takayama, and David
Gossow. Strategies for human-in-the-loop robotic grasping. In HRI, pages
1–8. IEEE, 2012.
21 Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE
Transactions on knowledge and data engineering, 22(10):1345–1359,
22 Shashank Pathak, Erika A?braha?m, Nils Jansen, Armando Tacchella,
and Joost-Pieter Katoen. A greedy approach for the efficient repair of
stochastic models. In NFM, volume 9058 of LNCS, pages 295–309.
23 Martin L. Puterman. Markov Decision Processes: Discrete Stochastic
Dynamic Programming. John Wiley and Sons, 1994.
24 Martin L Puterman. Markov decision processes: discrete stochastic
dynamic programming. John Wiley ; Sons, 2014.
25 Constantin A Rothkopf and Dana H Ballard. Modular inverse rein-
forcement learning for visuomotor behavior. Biological cybernetics,
26 Jian Shen, Javier Ibanez-Guzman, Teck Chew Ng, and Boon Seng
Chew. A collaborative-shared control system with safe obstacle avoidance
capability. In Robotics, Automation and Mechatronics, volume 1, pages
119–123. IEEE, 2004.
27 Brian D Ziebart. Modeling purposeful adaptive behavior with the principle
of maximum causal entropy. 2010.
28 Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey.
Maximum entropy inverse reinforcement learning. 2008.