CPSC 502 -- Fall 2006

Non-Assignment 5

- Give the feature-based representation of the
*MW*and*RHM*features. - Give the STRIPs representations for the pickup mail, and the deliver mail actions.
- What are the errors on lecture 11.2, page 4?
- What are the errors on lecture 11.3, page 5?
- Suppose the robot cannot carry both coffee and mail at the same time. Give two different ways that the CSP that represents the planning problem can be changed to reflect this. Test it, by giving a problem where the answer is different when the robot has this limitation than when it doesn't.

- Do exercise 12.2
- Do exercise 12.4
- Do exercise 12.5

- Suppose our Q-learning agent, with fixed alpha (
*alpha*), and discount gamma (*gamma*), was in state 34 did action 7, received reward 3 and ended up in state 65. What value(s) get updated? Give an expression for the new value. (You need to be as specific as possible) - In temporal difference learning (e.q. Q-learning), to get the average of a sequence of k values, we let alpha = 1/k. Explain why it may be advantageous to keep alpha fixed in the context of reinforcement learning.
- Suppose someone suggested using
*alpha*. Explain why it is of this form (e.g., why_{k}=10.0/(9.0+k)*9+k*on the bottom?) Would you expect it to work well? Explain. - Explain what happens in reinforcement learning if the agent always chooses the action that maximizes the Q-value. Suggest two ways that can force the agent to explore.
- In MDPs and reinforcement learning explain why we often use discounting of future rewards. How would an agent act differently if the discount factor was 0.6 as opposed to 0.9.
- What is the main difference between asynchronous value iteration and standard value iteration? Why does asynchronous value iteration often work better than standard value iteration?
- What is the relationship between asynchronous value iteration and Q-learning?

There are monsters at the squares marked with an *X*. Each monster
randomly and independently, at each time,
checks if the robot is on their square. If
the robot is on the square when the monster checks, it has a
reward of *-10* (i.e., it loses 10 points). At the centre point the
monster checks at each time with probability *p _{2}=0.4*, at the other 4
squares marked with an

The robot has 8 actions corresponding to the 8 neighbouring
squares. The diagonal moves are noisy; there is a *p _{4}=0.6* probability of going
in the direction chosen and an equal chance of going to each of the 4
neighboring squares that are closest to the desired direction. The
vertical and horizontal moves are also noisy; there is a probability

There is a discount factor of *p _{6}=0.9*.

Assume that the rewards are immediate on entering a state (i.e., if the robot enters a state where there is a monster, it gets the (negative) reward on entering the state, and if the robot enters the state where there is a treasure it gets the reward on entering the state, even if the treasure arrives at the same time).

Suppose we are using the inefficient state space representation with 125 states.

Suppose we are running value iteration, and have the following values for each state: [The numbers in the square represent the value of that state, and where empty squares have a zero value. It is irrelevant to this question how these values got there]:

where the treasure is at the circle. There are also states for the treasures at the other four corners.Consider the next step of value iteration. For state

*s*, which is marked by_{13}***in the above figure, and the action*a*which is "up", what value is assigned to_{2}*Q[s*on the next iteration of value iteration? You need to show all working, but don't need to do any arithmetic (i.e., leave it as an expression). Explain each terms in your expression._{13},a_{2}]

- Suppose a monster did not appear at any time during any of these experiences. What Q-values are updated during Q-learning based on this experience? Explain what values they get assigned. You should assume that
*alpha*._{k}=1/k - Suppose that, at some later time, the robot revisits
the same states:
*s1*to*s2*to*s3*to*s4*to*s5*, and hasn't visited any of these states in between (i.e, this is the second time visiting any of these states). Suppose this time, the monster appears so that the robot gets a penalty. What Q-values have their values changed? What are their new values?

David Poole