3x2 World Value Iteration for Markov Decision Processes

Math Problem Statement

Pretend an agent is trying to plan how to act in a 3x2 world. Figure 1 shows the world, the rewards associated with each state, and the dynamics. There are 5 possible actions: north, east, south, west and stay still. The first 4 actions succeed with probability .9 and go to a right angle of the desired direction with probability .05: see figure 2 for an illustration of this. The fifth action, “do nothing,” succeeds with probability 1. The rewards associated with each state are R(1 : 6) = [−0.1 − 0.1 + 1 − 0.1 − 0.1 − 0.05] (4) and are also shown in figure 1. State 3 is the only terminal state.Exercise 01: Value Iteration [5 points] Here we ask you to perform 3 rounds (aka 3 updates) of value iteration. Initialize your utility vector to be 0 for all the states. Please fill in the table with the appropriate values. Use γ = 0.999 as your discount factor

Pretend an agent is trying to plan how to act in a 3x2 world. Figure 1 shows the world, the rewards associated with each state, and the dynamics. There are 5 possible actions: north, east, south, west and stay still. The first 4 actions succeed with probability .9 and go to a right angle of the desired direction with probability .05: see figure 2 for an illustration of this. The fifth action, 'do nothing,' succeeds with probability 1. The rewards associated with each state are R(1 : 6) = [-0.1, -0.1, +1, -0.1, -0.1, -0.05] and are also shown in figure 1. State 3 is the only terminal state. Exercise 01: Value Iteration [5 points] Here we ask you to perform 3 rounds (aka 3 updates) of value iteration. Initialize your utility vector to be 0 for all the states. Please fill in the table with the appropriate values. Use γ = 0.999 as your discount factor.

Solution

To perform value iteration for this problem, we'll follow these steps:

Initialize the utility vector $U(s)$ for each state $s$ to zero.
Perform three updates of value iteration using the Bellman equation: $U(s) = R(s) + \gamma \max_a \sum_{s'} P(s'|s, a) U(s')$ $U (s) = R (s) + γ max_{a} \sum_{s^{'}} P (s^{'} ∣ s, a) U (s^{'})$ where:
- $R(s)$ is the immediate reward for each state.
- $\gamma = 0.999$ is the discount factor.
- $P(s'|s, a)$ is the probability of reaching state $s'$ from state $s$ by taking action $a$ .

Understanding the Transition Dynamics:

Intended direction: 0.9 probability of moving as intended.
Perpendicular directions: 0.05 probability each of moving perpendicular to the intended direction.
Stay in place ("do nothing"): 1.0 probability of remaining in the same state.

State and Reward Setup

Given the rewards $R(1:6) = [-0.1, -0.1, +1, -0.1, -0.1, -0.05]$ , we note that:

State 3 is terminal and has a reward of +1.
All other states have negative rewards, representing some cost.

Steps to Solve

To fill in the table accurately, I’ll:

Calculate the utility values for each state after each iteration, updating the table entries based on the rewards, transition probabilities, and discount factor.
For each state, evaluate the utility for each possible action (north, east, south, west, nothing).
Identify the action with the highest expected utility and update the utility for that state.

Let me proceed with calculations and provide the values for each round.It seems that an internal issue prevented the calculations from completing. I'll re-attempt the calculation steps to find the utility values after each iteration, making sure to carefully evaluate each state and action's effect.

Let’s break it down into smaller parts, recalculating each expected utility step-by-step for each round. I'll ensure we apply the transition dynamics and fill out the table as required.It appears that the environment reset, causing previously defined data (like the numpy module and variables) to become unavailable. I’ll re-initialize everything from scratch and proceed with the calculations.After performing three rounds of value iteration, here are the utility values for each state after each iteration:

First Iteration:
- State 1: -0.1
- State 2: -0.1
- State 3: 1.0
- State 4: -0.1
- State 5: -0.1
- State 6: -0.05
Second Iteration:
- State 1: -0.1999
- State 2: 0.784115
- State 3: 1.999
- State 4: -0.1999
- State 5: -0.15994
- State 6: 0.83911
Third Iteration:
- State 1: 0.57504278
- State 2: 1.71849344
- State 3: 2.997001
- State 4: -0.27375707
- State 5: 0.67563634
- State 6: 1.82313899

These utilities represent the values the agent calculates for each state based on possible actions and the given transition probabilities.

Would you like further details on the computations or have specific questions about these results?

Ask a new question for Free

By Image

Drop file here or Click Here to upload

Math Problem Analysis

Mathematical Concepts

Markov Decision Processes
Value Iteration
Discounted Utility

Formulas

U(s) = R(s) + γ max_a Σ_{s'} P(s'|s, a) U(s')

Theorems

Bellman Equation

Suitable Grade Level

Undergraduate Level

Related Recommendation

Understanding Value Iteration in Markov Decision Processes

Calculate Expected Value in a Simple Markov Decision Process (MDP)

Calculating Expected Value in a Simple MDP Using Bellman Equation

Calculate State Value in Markov Decision Process Using Bellman Equation

Monte Carlo Prediction Update for Value Function in Reinforcement Learning