Math Problem Statement
Suppose that an environment has six states, encoded as integers 0 - 5. An agent is attempting to learn the value function Vpi(s) for a policy pi by applying Monte Carlo prediction. After a few iterations of Monte Carlo prediction, the agent's estimate for Vpi(s) is as indicated by the dictionary shown below.
V_estimate = {0:10.4, 1:16.7, 2:12.8, 3:24.2, 4:8.2, 5:16.7}
The agent generates a new episode starting in state 0. During the episode, the agent performs a sequence of four actions. The sequence of states visited as a result of taking these actions, and the rewards earned by the agent are provided by the lists below.
states_visited = [0, 4, 3, 1, 5]
rewards_earned = [3, 2, -5, 6]
Using a discount rate of 0.9 and a learning rate of 0.1, determine the agent's new estimate for Vpi(0)after performing an update based on the episode.
Solution
Ask a new question for Free
By Image
Drop file here or Click Here to upload
Math Problem Analysis
Mathematical Concepts
Monte Carlo Methods
Reinforcement Learning
Value Function Estimation
Discounted Return
Learning Rate
Formulas
G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots
V_{\pi}(s) \leftarrow V_{\pi}(s) + \alpha (G_t - V_{\pi}(s))
Theorems
Monte Carlo Prediction Update Rule
Suitable Grade Level
Undergraduate
Related Recommendation
Calculate State Value in Markov Decision Process Using Bellman Equation
Calculating Expected Value in a Simple MDP Using Bellman Equation
Calculate Expected Value in a Simple Markov Decision Process (MDP)
Understanding Value Iteration in Markov Decision Processes
Expected Reward in Markov Chains: Markov Process Proof and Formula