V. CASE STUDY AND EXPERIMENTS

We require an abstract representation of the human’s commands

as a strategy to use our synthesis approach in a shared control

scenario, We now discuss how such strategies may be obtained

using inverse reinforcement learning and report on case study

results.

A. Experimental setting

We consider two scenarios, the first of which is the wheelchair

scenario from Fig. 1. We model the wheelchair scenario inside

an interactive Python environment.

In the second scenario, we use a tool called AMASE1, which

is used to simulate multi-unmanned aircraft vehicles (UAV)

missions. Its graphical user interfaces allow humans to send

commands to the one or multiple vehicles at run time. It

includes three main programs: a simulator, a data playback

tool, and a scenario setup tool.

We use the model checker PRISM 19 to verify if the

computed strategies satisfy the specification. We use the LP

solver Gurobi 14 to check the feasibility of the LP problems

that is given in Section IV. We also implemented the greedy

approach for strategy repair in 15.

B. Data collection

We asked five participants to accomplish tasks in the wheelchair

scenario. The goal is moving the wheelchair to a target cell in

gridworld while never occupying the same cell as the moving

obstacle. Similarly, three participants performed the surveillance

task in the AMASE environment.

From the data obtained from each participant, we compute

an individual randomized human strategy ?h via maximum-

entropy inverse reinforcement learning (MEIRL) 28. Refer-

ence 16 uses inverse reinforcement learning to reason about

the human’s commands in a shared control scenario from

human’s demonstrations. However, they lack formal guarantees

on the robot’s execution. In 25, inverse reinforcement learning

is used to distinguish human intents with respect to different

tasks. We show the work flow of the case study in Figure 4.

In our setting, we denote each sample as one particular

command of the participant, and we assume that the participant

issues the command to satisfy the specification. Under this

assumption, we can bound the probability of a possible

deviation from the actual intent with respect to the number of

samples using Hoeffding’s inequality for the resulting strategy,

see 27 for details. Using these bounds, we can determine

the required number of command to get an approximation

of a typical human behavior. The probability of a possible

deviation from the human behavior is given by O(exp(?n?2)),

where n is the number of commands from the human and ?

is the upper bound on the deviation between the probability

of satisfying the specification with the true human strategy

and the probability obtained by the strategy that is computed

by inverse reinforcement learning. For example, to ensure an

upper bound ? = 0.05 on the deviation of the probability

1https://github.com/afrl-rq/OpenAMASE

Fig. 4. The setting of the case study for the shared control simulation.

of satisfying the specification with a probability of 0.99, we

require 1060 demonstrations from the human.

We design the blending function by assigning a lower weight

in the human strategy at states where the human strategy

yields a lower probability of reaching the target set. Using

this function, we create the autonomy strategy ?a and pass

(together with the blending function) back to the environment.

Note that the blended strategy ?ha satisfies the specification,

by Theorem 1.

C. Gridworld

The size of the gridworld in Fig. 1 is variable, and we

generate a number of randomly moving (e.g., the vacuum

cleaner) and stationary obstacles. An agent (e.g., the wheelchair)

moves in the gridworld according to the commands from a

human. For the gridworld scenario, we construct an MDP and

the states of the MDP represents the positions of the agent and

the obstacles. The actions in the MDP induce changes in the

position of the agent.

The safety specification specifies that the agent has to reach

a target cell while not crashing into an obstacle with a certain

probability ? ? 0, 1, formally P??(¬crash U target).

First, we report results for one particular participant in a

gridworld scenario with a 8× 8 grid and one moving obstacle.

The resulting MDP has 2304 states. We compute the human

strategy using MEIRL with the features, e. g., the components

of the cost function of the human, giving the distance to the

obstacle and the goal state.

We instantiate the safety specification with ? = 0.7, which

means the target should be reached with at least a probability

of 0.7. The human strategy ?h induces a probability of 0.546

to satisfy the specification. That is, it does not satisfy the

specification.

We compute the repaired strategy ?ha using the greedy and

the QCP approach, and both strategies satisfies the specification

by inducing a probability of satisfying the specification larger

than ?. On the one hand, the maximum deviation between

the human strategy ?h and ?ha is 0.15 with the LP approach,

which implies that the strategy of the human and the autonomy

protocol deviates at most 15% for all states and actions. On

the other hand, the maximum deviation between the human

strategy ?h and the blended strategy ?ha is 0.03 with the QCP

approach. The results shows that the QCP approach computes

a blended strategy that induces more similar strategy to the

human strategy compared to the LP approach.

(a) Strategy ?h (b) Strategy ?ah (c) Strategy ?a

Fig. 5. Graphical representation of the obtained human, blended, and autonomy

strategy in the grid.

TABLE I

SCALABILITY RESULTS FOR THE GRIDWORLD EXAMPLE.

grid states trans. LP synth. ?LP QCP synth. ?QCP

8× 8 2, 304 36, 864 14.12 0.15 31.49 0.03

10× 10 3, 600 57, 600 23.80 0.24 44.61 0.04

12× 12 14, 400 230, 400 250.78 0.33 452.27 0.05

To finally assess the scalability of our approach, consider

Table I. We generated MDPs for different gridworlds with

different number of states and number of obstacles. We list

the number of states in the MDP (labeled as “states”) and

the number of transitions (labeled as “trans”). We report on

the time that the synthesis process took with the LP approach

and QCP approach (labeled as “LP synth.” and “QCP synth”),

which includes the time of solving the LP or QCPs measured

in seconds. It also includes the model checking times using

PRISM for the LP approach. To represent the optimality of the

synthesis, we list the maximal deviation between the repaired

strategy and the human strategy for the LP and QCP approach

(labeled as “?LP” and “?QCP”). In all of the examples, we

observe that the strategies with the QCP approach yields

autonomy strategies with less deviation to the human strategy

while having similar computation time with the LP approach.

D. UAV mission planning

Similar to the gridworld scenario, we generate an MDP, in

which that of the states MDP denotes the position of the agents

and the obstacles in a AMASE scenario. Consider an AMASE

scenario in Fig. 6. In this scenario, the specification or the

mission of the agent (blue UAV) is to keep surveilling the

green regions (labeled as w1, w2, w3) while avoiding restricted

operating zones (labeled as “ROZ1, ROZ2”) and enemy agents

(purple and green UAVs). As we consider reachability problems,

we asked the participants to visit the regions in a sequence,

i.e., visiting the first region, then second, and then the third

region. After visiting the third region, the task is to visit the

first region again to perform the surveillance.

For example, if the last visited region is w3, then the

safety specification in this scenario is P??((¬crash ?

¬ROZ) U target), where ROZ is to visit the ROZ areas

and target is visiting w1.

We synthesize the autonomy protocol on the AMASE

scenario with two enemy agents that induces an MDP with

15625 states. We use the same blending function and same

(a) An AMASE simulator. (b) The GUI of AMASE.

Fig. 6. An example of an AMASE scenario.

threshold ? = 0.7 as in the gridworld example. The features

to compute the human strategy with MEIRL are given by

the distance to the closest ROZ, enemy agents and the target

region.

The human strategy ?h induces a probability of 0.163 to

satisfy the specification, and it violates the specification like

in the gridworld example. Similar to the gridworld example,

we compute the repaired strategy ?ha with the LP and the

QCP approach, and both strategies satisfy the specification. On

the one hand, the maximum deviation between ?h and ?ha is

0.42 with the LP approach, which means the strategies of the

human and the autonomy protocol are significantly different in

some states of the MDP. On the other hand, the QCP approach

yields a repaired strategy ?ha that is more similar to the human

strategy ?h with the maximum deviation being 0.06. The time

of the synthesis procedure with the LP approach is 481.31

seconds and the computation time with the QCP approach

is 749.18 seconds, showing the trade-offs between the LP

approach and the QCP approach. We see that, the LP approach

can compute a feasible solution slightly faster, however the

resulting blended strategy may be less similar to the human

strategy compared to the QCP approach.

VI. CONCLUSION AND CRITIQUE

We introduced a formal approach to synthesize an autonomy

protocol in a shared control setting subject to probabilistic

temporal logic specifications. The proposed approach utilizes

inverse reinforcement learning to compute an abstraction of

a human’s behavior as a randomized strategy in a Markov

decision process. We designed an autonomy protocol such that

the resulting robot strategy satisfies safety and performance

specifications given in probabilistic temporal logic. We also

ensured that the resulting robot behavior is as similar to the

behavior induced by the human’s commands as possible. We

synthesized the robot behavior using quasiconvex programming.

We showed the practical usability of our approach through

case studies involving autonomous wheelchair navigation and

unmanned aerial vehicle planning.

There is a number of limitations and also possible extensions

of the proposed approach. First of all, we computed an

globally optimal strategy by bisection, which requires checking

feasibility of a number of linear programming problems. A

convex formulation of the shared control synthesis problem

would make computing the globally optimal strategy more

efficient.

We assumed that the human’s commands are consistent

through the whole execution, i. e., the human issues each

command to satisfy the specification. Also, this assumption

implies the human does not consider assistance from the robot

while providing commands – and in particular, the human does

not adapt the strategy to the assistance. It may be possible

to extend the approach to handle non-consistent commands

by utilizing additional side information, such as the task

specifications.

Finally, in order to generalize the proposed approach to

other task domains, it is worth to explore transfer learning 21

techniques. Such techniques will allow us to handle different

scenarios without requiring to relearn the human strategy from

the human’s commands.

REFERENCES

1 Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse

reinforcement learning. In ICML, page 1. ACM, 2004.

2 Christel Baier and Joost-Pieter Katoen. Principles of Model Checking.

The MIT Press, 2008.

3 Mihir Bellare and Phillip Rogaway. The complexity of approximating

a nonlinear program. In Complexity in numerical optimization, pages

16–32. World Scientific, 1993.

4 Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cam-

bridge University Press, New York, NY, USA, 2004.

5 Taolue Chen, Yuan Feng, David S. Rosenblum, and Guoxin Su.

Perturbation analysis in verification of discrete-time Markov chains.

In CONCUR, volume 8704 of LNCS, pages 218–233. Springer, 2014.

6 Anca D. Dragan and Siddhartha S. Srinivasa. Formalizing assistive

teleoperation. In Robotics: Science and Systems, 2012.

7 Anca D. Dragan and Siddhartha S. Srinivasa. A policy-blending

formalism for shared control. I. J. Robotic Res., 32(7):790–805, 2013.

8 Andrew Fagg, Michael Rosenstein, Robert Platt, and Roderic Grupen.

Extracting user intent in mixed initiative teleoperator control. In

Intelligent Systems Technical Conference, page 6309, 2004.

9 Lu Feng, Clemens Wiltsche, Laura Humphrey, and Ufuk Topcu. Synthesis

of human-in-the-loop control protocols for autonomous systems. IEEE

Transactions on Automation Science and Engineering, 13(2):450–462,

2016.

10 Vojte?ch Forejt, Marta Kwiatkowska, Gethin Norman, David Parker, and

Hongyang Qu. Quantitative multi-objective verification for probabilistic

systems. In TACAS, pages 112–127. Springer, 2011.

11 Roland Fryer and Matthew O Jackson. A categorical model of cognition

and biased decision making. The BE Journal of Theoretical Economics,

8(1).

12 Jie Fu and Ufuk Topcu. Synthesis of shared autonomy policies with

temporal logic specifications. IEEE Transactions on Automation Science

and Engineering, 13(1):7–17, 2016.

13 F. Gala?n, M. Nuttin, E. Lew, P. W. Ferrez, G. Vanacker, J. Philips,

and J. del R. Milla?n. A brain-actuated wheelchair: Asynchronous and

non-invasive brain-computer interfaces for continuous control of robots.

Clinical Neurophysiology, 119(9):2159–2169, 2016/05/28.

14 Gurobi Optimization, Inc. Gurobi optimizer reference manual.

url=http://www.gurobi.com, 2013.

15 Nils Jansen, Murat Cubuktepe, and Ufuk Topcu. Synthesis of shared

control protocols with provable safety and performance guarantees. In

ACC, pages 1866–1873. IEEE, 2017.

16 Shervin Javdani, J Andrew Bagnell, and Siddhartha Srinivasa. Shared

autonomy via hindsight optimization. In Robotics: Science and Systems,

2015.

17 Dae-Jin Kim, Rebekah Hazlett-Knudsen, Heather Culver-Godfrey, Greta

Rucks, Tara Cunningham, David Portee, John Bricout, Zhao Wang, and

Aman Behal. How autonomy impacts performance and satisfaction:

Results from a study with spinal cord injured subjects using an assistive

robot. IEEE Transactions on Systems, Man, and Cybernetics-Part A:

Systems and Humans, 42(1):2–14, 2012.

18 Jonathan Kofman, Xianghai Wu, Timothy J Luu, and Siddharth Verma.

Teleoperation of a robot manipulator using a vision-based human-robot

interface. IEEE transactions on industrial electronics, 52(5):1206–1219,

2005.

19 Marta Kwiatkowska, Gethin Norman, and David Parker. PRISM 4.0:

Verification of probabilistic real-time systems. In CAV, volume 6806 of

LNCS, pages 585–591. Springer, 2011.

20 Adam Leeper, Kaijen Hsiao, Matei Ciocarlie, Leila Takayama, and David

Gossow. Strategies for human-in-the-loop robotic grasping. In HRI, pages

1–8. IEEE, 2012.

21 Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE

Transactions on knowledge and data engineering, 22(10):1345–1359,

2010.

22 Shashank Pathak, Erika A?braha?m, Nils Jansen, Armando Tacchella,

and Joost-Pieter Katoen. A greedy approach for the efficient repair of

stochastic models. In NFM, volume 9058 of LNCS, pages 295–309.

Springer, 2015.

23 Martin L. Puterman. Markov Decision Processes: Discrete Stochastic

Dynamic Programming. John Wiley and Sons, 1994.

24 Martin L Puterman. Markov decision processes: discrete stochastic

dynamic programming. John Wiley ; Sons, 2014.

25 Constantin A Rothkopf and Dana H Ballard. Modular inverse rein-

forcement learning for visuomotor behavior. Biological cybernetics,

107(4):477–490, 2013.

26 Jian Shen, Javier Ibanez-Guzman, Teck Chew Ng, and Boon Seng

Chew. A collaborative-shared control system with safe obstacle avoidance

capability. In Robotics, Automation and Mechatronics, volume 1, pages

119–123. IEEE, 2004.

27 Brian D Ziebart. Modeling purposeful adaptive behavior with the principle

of maximum causal entropy. 2010.

28 Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey.

Maximum entropy inverse reinforcement learning. 2008.