The pursuit of happiness: A reinforcement learning perspective on habituation and comparisons

Experiment 1: Exploring the value of prior expectations and relative comparisons

Environment design. For our first set of experiments, we use a simulated physical space shown by the gridworld environment in Fig 1a. The gridworld environment is a popular testbed for various RL studies [22, 31, 55, 56] and lays out a straightforward way to study and model sequential decision-making. A single agent resides in the gridworld, and can choose between five actions: Up, Down, Right, Left, and Stay. Upon taking an action, with 90% probability the agent moves one step in the direction of the intended action and with 10% probability it randomly adopts one of the four other actions. The thick dark grey lines in the figure represent obstacles (walls) that the agent can not cross (regardless of the movement action), so it has to navigate through gaps in the walls to move to adjacent subspaces. At the beginning of training, one location (randomly picked between the four states in the top-right area of the gridworld; possible locations shown in the green box in the figure) contains the food, and the agent receives an objective reward of +1 whenever it is in the food state. For generality, we also include poison and sinkhole states, states which the agent should learn to avoid, in random locations in the environment (2 each in environments of size 7 × 7 and 4 each in environments of size 13 × 13). The agent receives an objective reward of −1 at the poison states and while the agent receives an objective reward of 0 at sinkhole states, they are very hard to get out of (the agent stays in the state with 95% probability regardless of the chosen action). While our motivation to include these states was to study avoidance behavior, we note that our qualitative results do not depend on the number of poison and sinkhole states we include in the environment. We also refer the reader to the S1 Text where we replicate our results in environments that contained no poison and sinkhole states. The agent receives an objective reward of 0 at all the other states. The agent’s starting location is one of the four states in the bottom-left area of the gridworld, opposite the food quadrant (possible locations shown in the yellow box in the figure). PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 1. Environment design. (a) The two-dimensional gridworld environment used in Experiment 1. (b) To study the properties of the optimal reward, we made several modifications to the gridworld environment. Top row: In the one-time learning environment, the agent could chose to stay in the food location constantly after reaching it. In the lifetime learning environment, the agent was teleported to a random location in the gridworld as soon as it reached the food state. Middle row: In the stationary environment, the food remained in the same location throughout the agent’s lifetime. In the non-stationary environment, the food changed its location during the agent’s lifetime. Bottom row: We used a gridworld of size 7 × 7 to simulate a dense reward setting. To simulate a sparse reward setting, we increased the size of the gridworld to 13 × 13. https://doi.org/10.1371/journal.pcbi.1010316.g001 A significant advantage of using the gridworld environment is that we can easily modulate it to make the task harder/easier for the agent to solve (also refer to Fig 1b). The simplest environment we use is a stationary gridworld of size 7 × 7, where the optimal policy takes 12 steps on average to reach the food state from the start state. Further, in this environment, once the agent reaches the food state, it can choose to stay in the food location constantly and keep accumulating objective rewards (Exp 1a, first simulation). Thus, this environment essentially only requires one-time learning. We then modify the environment such that the agent teleports to a random location in the gridworld as soon as it reaches the food state. This environment requires lifetime learning as the agent has to learn how to reach the food state from any random state in the environment (Exp 1a, second simulation). After this, we increase the difficulty significantly via two important modifications. In Exp 1b, we modify the environment to make it non-stationary such that the food changes its location during the agent’s lifetime. This means that the agent has to re-learn the optimal policy whenever the environment changes. In Exp 1c, we modulate the difficulty by increasing the size of the environment from 7 × 7 to 13 × 13. This doubles the number of steps the optimal policy takes to reach the food state and simulates a sparser reward setting. Both flexible behavioral change and learning via delayed rewards are important aspects of intelligent behavior and these settings often provide significant challenges to standard model-free RL [22, 44, 48, 57, 58]. In simulating these environments we can study whether prior expectations and comparisons might help overcome these challenges.

Exp 1a: Comparison provides an exploration incentive and improves learning significantly. We begin by simulating a simple 7 × 7 stationary environment that requires one-time learning. Ideally, the agent should find the food state as quickly as possible and then it should stay in that state for the rest of the lifetime. To evaluate how well the different reward functions fulfill the designer’s objective, we compare the average cumulative objective reward at the end of the respective agents’ lifetime. The reward functions we consider can be classified into seven different categories (refer to Table 1). For all analyses that follow, we report the best performing agent (along with the corresponding parameter values) within each of these reward function categories. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Table 1. Categories of reward functions. The reward functions we consider can be classified into seven categories. First is ‘Objective only’, where the reward function depends only on the first component, w 1 . Similarly, ‘Expect only’ is the function that depends only on the second component, w 2 , and ‘Compare only’ is the function that depends solely on the third component, w 3 . Then, we have the functions that are a combination of two components—‘Objective+Expect’, ‘Objective+Compare’, and ‘Expect+Compare’. Finally, we have the reward function, ‘All’, that depends on all three components. https://doi.org/10.1371/journal.pcbi.1010316.t001 Fig 2a plots the mean cumulative objective reward of the best agents from each reward function category (α = 0.9, ϵ = 0.01 for all agents). We find that the ‘Compare only’ agent (w 3 = 0.4, ρ = 0.9) obtains the highest cumulative objective reward (M = 2097.02, SD = 219.35), more than the standard reward-based agent, ‘Objective only’ (M = 1321.89, SD = 644.61; w 1 = 0.3), as well as the expectation-based agent, ‘Expect only’ (M = 1447.47, SD = 588.75; w 2 = 0.8). Further, the ‘Compare only’ agent outperforms the ‘Objective+Expect’ agent (M = 1322.42, SD = 733.86; w 1 = 1.0, w 2 = 1.0), and performs equivalently to the ‘Objective+Compare’ agent (M = 2095.44, SD = 219.97; w 1 = 0.4, w 3 = 0.4, ρ = 0.9), the ‘Expect+Compare’ agent (M = 2067.29, SD = 242.58; w 2 = 0.1, w 3 = 0.8, ρ = 0.9), and the ‘All’ agent (M = 2066.72, SD = 242.62; w 1 = 0.7, w 2 = 0.4, w 3 = 0.8, ρ = 0.9). PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 2. Comparison improves learning in simple dense, stationary environments. (a) Mean cumulative objective reward attained by the different agents in a distribution of 7 × 7 stationary environments requiring one-time learning (lifetime = 2500 steps). Here, relative comparison significantly improve performance and the ‘Compare only’ agent obtains the highest cumulative objective reward. (b) Average visit counts of the ‘Objective only’, ‘Expect only’, and ‘Compare only’ agents (darker color represents higher visit counts and vice-versa). Compared to the ‘Objective only’ and the ‘Expect only’ agents, the ‘Compare only’ agent spends very little time visiting the non-food states in the world. (c) Simulation of the agents’ behavior in a simple 4-state environment. The ‘Compare only’ agent assigns a negative value to any non-food state it visits (due to its aspiration level) which encourages it to visit novel states in the environment. This allows the agent to find the food location very quickly. The ‘Objective only’ and ‘Expect only’ agents primarily rely on random exploration and find the food location more slowly (α = 0.1 for all agents). (d) Mean cumulative objective reward attained by the different agents in a distribution of 7 × 7 stationary environments, requiring lifetime learning (lifetime = 12500 steps). The ‘Compare only’ agent again obtains the highest cumulative objective reward. (e) Time course of the cumulative objective reward attained by the different agents in the 7 × 7 environment requiring lifetime learning (left) and the time course of cumulative subjective reward experienced by the different agents (right). (f) Left: Mean cumulative objective reward attained by the ‘Compare only’ agent as a function of its aspiration level in the lifetime learning environment. The performance of the agent drops if the aspiration level is set to be too high or too low (the optimal aspiration level is marked in yellow). Right: Mean cumulative subjective reward of the agent as a function of its aspiration level (optimal aspiration level agent is shown in yellow marker). https://doi.org/10.1371/journal.pcbi.1010316.g002 To provide some intuition about why the ‘Compare only’ agent performs so well, we simulate the agents’ behavior in a simple 4-state environment. This environment contains no obstacles or poison and sinkhole states, and the agent always moves with 100% probability in the direction of the intended action. The agent has three movement actions: Right, Left, and Stay, and its starting location is state s 1 and the food location is in state s 4 . As in the main gridworld experiment, all non-food states provide an objective reward of 0 and the food state provides an objective reward of +1. The agent can again stay at the food location constantly, thereby requiring one-time learning. Fig 2(c) shows a visual representation of one instance of the agent-environment interaction history along with the subsequent learnt state values for the ‘Objective only’, ‘Expect only’, and ‘Compare only’ agents. We see that near the beginning of its lifetime (step = 5), the ‘Compare only’ agent assigns a negative value to s 1 . This is because this state gives an objective reward of 0 but the agent’s aspiration level is higher than this value (= 0.95) and thus, the agent receives a subjective reward of −0.95 at this state. This encourages the agent to move to s 2 as the current value of s 2 is 0 and greater than the value of s 1 . After reaching s 2 , the agent again assigns a negative value to s 2 and then moves to s 3 and again assigns it a negative value, and then moves to s 4 where the food is located. Upon reaching the food state, the agent receives a positive subjective reward of 0.05, r 4 = 1 and ρ = 0.95, which encourages it to stay in the food location for the remainder of its lifetime. In contrast, in the absence of exploration bonuses such as novelty bonus, optimistic initialization etc., the ‘Objective only’ and ‘Expect only’ agents assign a value of 0 to both s 1 and s 2 at the beginning of their lifetime (since they both receive a subjective reward of 0 at these states). Consequently, at s 2 , the agents are equally likely to move to s 1 as they are to move to s 3 (since the value of all these states is 0). As a result, these agents take a longer time to eventually reach the food location. In sum, relative comparison helps learning because it provides an exploration incentive—the agent quickly comes to assign a negative value to any non-food state it visits (due to its aspiration level) which indirectly encourages the agent to visit novel states. We suggest that it is this difference in the way these agents learn and explore that helps the ‘Compare only’ agent to obtain higher cumulative objective reward in the larger and more complex gridworld environment. To illustrate this, Fig 2b plots the average number of times the agents visit the different states of the gridworld during their lifetime. We observe that in contrast to the ‘Objective only’ and the ‘Expect only’ agent, the ‘Compare only’ agent spends less time visiting the non-food states in the world. The ‘Compare only’ agent avoids the non-food states it has already visited (as the value of those states is negative) and it instead prefers to visit the states it has not yet visited (as the value of those states is zero), allowing it to find the food location very quickly. On the other hand, until the time that the ‘Objective only’ and ‘Expect only’ agents first visit the food state, they assign a value of zero to all non-food states regardless of their visit counts (except the poison state which issues a negative objective reward = −1). Thus, these agents are equally likely to move to a previously visited state as they are to move to a novel state which results in them finding the food location more slowly (as they primarily rely on random exploration to find the food location). To complete our analysis of the dense, stationary environment, our next simulation considers a 7 × 7 environment that requires lifetime learning, where the agent is teleported to a random location in the world as soon as it reaches the food state (α = 0.5, ϵ = 0.1 for all agents). We find that in this setting (Fig 2b), the ‘Compare only’ agent (w 3 = 0.5, ρ = 0.05) again accumulates the highest cumulative objective reward (M = 771.84, SD = 180.48) greater than the ‘Objective only’ (M = 383.38, SD = 184.67; w 1 = 0.7) and the ‘Expect only’ agent (M = 384.68, SD = 215.28; w 2 = 0.1). Further, the ‘Compare only’ agent performs equivalently to the ‘Objective+Compare’ agent (M = 763.51, SD = 169.32; w 1 = 0.1, w 3 = 0.8, ρ = 0.05), ‘Expect+Compare’ agent (M = 756.56, SD = 179.07; w 2 = 0.1, w 3 = 0.8, ρ = 0.01), as well as the ‘All’ agent (M = 752.45, SD = 205.88; w 1 = 0.7, w 2 = 0.1, w 3 = 0.5, ρ = 0.01). Fig 2e (left plot) further demonstrates the learning difference between the ‘Objective only, ‘Expect only, and the ‘Compare only’ agents. The ‘Compare only’ agent learns faster and attains higher cumulative objective reward compared to the other agents. Fig 2e (right plot) shows the difference in the subjective rewards of the three agents throughout their lifetime. The subjective reward of the ‘Objective only’ agent is, naturally, proportional to the objective reward it receives in its lifetime. In some sense, the ‘Objective only’ agent is experiencing happiness in proportion to the objective reward it receives. The ‘Expect only’ agent, apart from small boosts in happiness (which occur on the first few food state visits), maintains a steady state of happiness throughout its lifetime, which is akin to the hedonic treadmill. The ‘Compare only’ agent experiences negative subjective reward in the early stages of training i.e., it can be thought as being more unhappy in the beginning (because of the initial visits to the non-food states). However, this then provides an exploration incentive to the agent and it visits the food state more regularly which eventually leads to higher subjective reward i.e., in some sense, its happiness rises after it learns a good policy. Taken together, these simulations suggest that given a distribution of dense, stationary environments (either requiring one-time or lifetime learning), a reward function based on comparison to a well chosen aspiration level optimizes the course of learning.

Optimal aspiration level and the trade-off between objective and subjective rewards. While the above results show that comparison serves as a useful learning signal, it is important to note that the aspiration level of comparison-based agents needs to be set appropriately in order for the agents to act optimally. Fig 2f plots the average cumulative objective reward obtained by the ‘Compare only’ agent as a function of the aspiration level in the dense, stationary, lifetime learning environment and shows that the performance of the agent is lowered if the aspiration level is set to be too high or too low. If the aspiration level is set to be too high, then the agent assigns high negative values to all the states it visits (since the subjective reward received at the states is very negative). This can cause the agent to become pessimistic in its exploration strategy and learn a sub-optimal policy. On the other hand, if the agent’s aspiration level is too low, then it learns more slowly as it is not encouraged to explore novel states. We can also study the relationship between the aspiration level and the experienced subjective rewards of an agent. Agents that have very high aspiration level accumulate high negative subjective rewards in their lifetime (Fig 2f). Conversely, agents with very low aspiration level end up accumulating high positive subjective rewards in their lifetime. However, both these kinds of agents are not well-calibrated to the statistics of the environment. For example, an agent that has an aspiration level = −0.1 will be deluded as it would keep visiting states that provide an objective reward of 0 (since they give subjective reward = 0.1) and will most likely never discover the food location. Thus, agents that experience too many positive subjective rewards or too many negative subjective rewards do not obtain high objective rewards. In some sense, this perhaps suggests that both being too happy or too unhappy results in poor performance and agents that obtain the highest cumulative objective reward tend to experience a moderate amount of unhappiness in their lifetime. While our definition of happiness is obviously very simple, this analysis acts as a demonstration exercise to show the trade-off an agent designer faces in terms of maximizing the subjective reward accumulated by an agent and the cumulative objective reward accrued by that agent.

Exp 1b: Prior expectation and comparison help deal with non-stationarity. For our next simulation, we study the properties of the optimal reward function in a non-stationary environment. As before, at the beginning of the agent’s lifetime, the food is randomly located in one of the four states in the top-right area of the 7 × 7 gridworld. Once the agent reaches the food location, it can choose to stay there constantly (i.e., it is not teleported to a random location in the world). However, every 1250 steps, the food changes its location and moves to one of the other remaining corners of the gridworld requiring the agent to continue exploring in order to find the new food location. Fig 3a plots the mean cumulative objective reward of the best performing agents from each reward category. Here, the ‘Objective only’ agent (w 1 = 0.4, α = dynamic, ϵ = 0.1) performs very poorly, obtaining the lowest objective reward (M = 840.10, SD = 578.24). It is outperformed by both the ‘Expect only’ (M = 1483.05, SD = 766.75; w 2 = 0.9, α = 0.9, ϵ = 0.1), and the ‘Compare only’ agent (M = 2669.08, SD = 539.81; w 3 = 0.2, ρ = 0.9, α = 0.1, ϵ = 0.1). While the ‘Compare only’ agent outperforms the ‘Expect only’ agent, we find that the ‘Expect+Compare’ agent (M = 2846.06, SD = 587.77; w 2 = 0.1, w 3 = 0.6, ρ = 0.9, α = dynamic, ϵ = 0.1) performs better than the ‘Compare only’ agent thereby suggesting that both prior expectation and comparison are helpful in dealing with non-stationarity. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 3. Prior expectation and comparison make an agent robust to changes in the environment. (a) Mean cumulative objective reward attained by the agents in a distribution of 7 × 7 non-stationary environments (lifetime = 5000 steps). Both prior expectation and relative comparisons are helpful in dealing with non-stationarity. (b) Agents’ behavior in a simple 4-state non-stationary environment. By step = 50, the ‘Objective only’ agent assigns a considerably higher value to the food state compared to the ‘Compare only’ and ‘Expect+Compare’ agents. At step = 51, when the food changes its location, the ‘Objective only’ agent receives a subjective reward of 0 at state s 4 and takes a long time to lower the value of this state. Even by step = 100, it is not able to discover the new food location. In contrast, after the food changes location, the ‘Compare only’ and ‘Expect+Compare’ agents receive high negative subjective rewards at state s 4 which reduces their value estimate of s 4 very quickly. This encourages them to visit other states and enables them to discover the new food location very quickly. (c) Graph showing how the value of the food state changes as a function of the visit counts for the different agents. While the state value converges for all the three agents, the ‘Objective only’ agent ends up assigning a very high value to the food state because it receives a subjective reward = 1 at each visit. The ‘Compare only’ and ‘Expect only’ agents receive lower subjective rewards and hence the converged state value is considerably lower for these agents. (d) Average reward rate of the various agents during their lifetime on the 7 × 7 gridworld environment (the food changes its location after every 1250 steps). (e) Left: Mean cumulative objective reward attained by the ‘Expect only’ agent as a function of the w 2 values in the non-stationary environment. The performance of the agent drops if the weight is too high or too low (optimal w 2 value is marked in yellow). Right: Mean cumulative subjective reward of the ‘Expect only’ agent as a function of the w 2 value. https://doi.org/10.1371/journal.pcbi.1010316.g003 To gain an intuition about why these factors help in non-stationary environments, we again look at the agent’s behavior in the previously described simple 4-state environment. As before, the agent’s starting location is s 1 and the food is located at s 4 . To simulate non-stationarity, at step = 50, the food changes its location from s 4 to s 1 and it stays there till the end of the agent’s lifetime (lifetime = 100). Fig 3b shows a visual representation of one instance of the agent-environment interaction history (starting from step = 49) for the ‘Objective only’, ‘Compare only’, and ‘Expect+Compare’ agents. We first see that the agents differ in the way they update the value of a rewarding state as a function of the number of times they visit that state. For example, at step = 50, the ‘Objective only’ agent’s state value estimate for s 4 is considerably higher than the ‘Compare only’ and ‘Expect+Compare’ agents’ estimate. This is because the ‘Objective only’ agent receives a subjective reward of +1 each time it visits the food state whereas the ‘Compare only’ agent only receives a subjective reward of 0.05 (ρ = 0.95). The ‘Expect+Compare’ agent receives a subjective reward = 1.05 at the first visit but it then receives a subjective reward close to 0 for subsequent visits. More generally, the converged state value estimate for the food state differs substantially for the three agents. As shown in Fig 3c, given γ = 0.99, the state value estimate of the food state converges to 100 for the ‘Objective only’ agent whereas it converges to 5 for the ‘Compare only’ agent, and to 1.04 for the ‘Expect+Compare’ agent (also refer to the S1 Text for a derivation of the convergence and the respective upper and lower bound for the different agents). At step = 50, the food location changes from s 4 to s 1 but all three agents stay at s 4 because the estimated value of s 4 is higher than the estimated value of s 3 . The ‘Objective only’ agent then receives a subjective reward = 0 which reduces the agent’s estimated value of s 4 . However, because the previous estimated value of s 4 is so high, it takes a long time for the value of s 4 to become lower than s 3 which results in the agent remaining in s 4 for a long period. The ‘Compare only’ agent receives a high negative subjective reward upon visiting s 4 (= −0.95) and because the value estimate of s 4 is not very high, the value estimate reduces very quickly prompting the agent to explore new locations and eventually find the new food location at s 1 . The ‘Expect+Compare’ agent receives an even higher negative subjective reward (subjective reward induced by the Expect component = −1.04 and the Compare component = −0.95) and thus, the agent’s value estimate of s 4 reduces even more quickly and it ends up finding the new food location faster than all other agents. The simulation results for the 4-state environment are consistent with the behavior of the agents in the 7 × 7 gridworld experiment. Fig 3d plots the average reward rate of the ‘Objective only’, ‘Expect only’, ‘Compare only’, and ‘Expect+Compare’ agents during their lifetimes on the gridworld environment. Here, the food changes its location every 1250th step and the ‘Objective only’ agent is not able to learn and discover the new location of the food. The ‘Expect only’ is able to better deal with the change in the environment as it is eventually able to find the new location of the food. The ‘Compare only’ agent also handles the change in the environment very well and it comfortably outperforms the ‘Expect only’ agent primarily because it is more efficient in its exploration (see also the previous section). Finally, the ‘Expect+Compare’ agent boosts the performance of the ‘Compare only’ agent as it is able to find the new food location faster. These results suggest that in non-stationary environments, both prior expectations and relative comparisons are valuable components as they help an agent quickly ‘move on’ from states that used to be rewarding in the past.

Optimal expectations. Similar to the relationship between the aspiration level and the subjective reward experienced by the comparison-based agents, the experienced subjective reward and performance of the ‘Expect only’ agent also depend considerably on the value of w 2 (especially in the non-stationary environment). Fig 3e(left) plots the mean cumulative objective reward obtained by the ‘Expect only’ agent in the non-stationary environment as a function of w 2 , showing that performance is optimized at an intermediate point when the prior expectation is neither too high nor too low. By contrast with this non-monotonicity, Fig 3e(right) shows that agents with very low expectations (low w 2 ) obtain high subjective reward in their lifetime despite their lack of high cumulative objective reward, whereas agents that have high expectations (by having high w 2 ) obtain high negative subjective reward and still without attaining high objective reward. In some sense, this suggests that agents with low expectations are very happy in their lifetimes without performing well and agents with very high expectations are very unhappy while also not performing well. On the other hand, agents with moderate expectations tend to obtain the highest cumulative objective reward while experiencing some amount of unhappiness in their lifetime. This further demonstrates the trade-off an agent designer faces with regards to maximizing the cumulative objective reward and the happiness experienced by an agent. Note that the agent designer does not face this trade-off for the ‘Objective only’ agent—the happiness of which is directly proportional to its objective reward. Thus, the ‘Objective only’ agent that accrues the highest cumulative objective reward is the happiest such agent.

Exp 1c: Reward sparsity requires controlling comparisons. We now study the properties of the optimal reward function in a sparser reward environment using a 13 × 13 gridworld that requires lifetime learning, where the agent is teleported to a random location in the environment as soon as it reaches the food state. As shown in Fig 4a, the ‘Objective only’ agent (w 1 = 0.9), ‘Expect only’ agent (w 2 = 0.2), and the ‘Objective+Expect’ agent (w 1 = 0.1, w 2 = 0.6) perform very poorly and attain low cumulative objective reward in their lifetime (α = 0.5, ϵ = 0.1 for all three agents). This is not surprising as reinforcement learning (and exploration more generally) in sparsely rewarding environments is known to be a challenging problem [22, 44, 48, 57, 58]. The ‘Compare only’ agent (w 3 = 0.1, ρ = 0.001, α = 0.5, ϵ = 0.1), performs relatively well (M = 102.86, SD = 39.91) and obtains higher cumulative objective reward than the ‘Objective only’, ‘Expect only’, and the ‘Objective+Expect’ agent. The addition of the Expect component to the ‘Compare only’ agent is very helpful as we find that both the ‘Expect+Compare’ (M = 130.14, SD = 43.49; w 2 = 0.6, w 3 = 0.1, ρ = 0.01, α = 0.7, ϵ = 0.1) and the ‘All’ agent (M = 133.58, SD = 48.21; w 1 = 0.9, w 2 = 0.9, w 3 = 0.3, ρ = 0.01, α = 0.7, ϵ = 0.1) obtain the highest cumulative objective reward and perform better than the ‘Compare only’ agent. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 4. Relative comparisons can lead to undesirable behavior in sparsely rewarded environments. (a) Mean cumulative objective reward attained by the agents in a distribution of 13 × 13, stationary environments requiring lifetime learning (lifetime = 12500 steps). While the ‘Compare only’ agent performs relatively well, it is significantly outperformed by the ‘Expect+Compare’ and the ‘All’ agent. (b) Visualization of the visit counts and the learnt policy of the ‘Compare only’ agent for states near to the food state. The agent does not visit the food state as often as it visits some of the nearby non-rewarding states (highlighted in yellow). The agent’s learnt policy suggests that it has developed a form of aversion to the food state as it takes a needlessly long route to reach the food state. (c) Graph showing how the value of the starting state (which provides an objective reward of 0) changes as a function of the visit counts for the ‘Compare only’ and ‘Expect+Compare’ agents. As the ‘Compare only’ agent keeps re-visiting the starting state, it keeps assigning a lower value to this state (due to its aspiration level). On the other hand, due to prior expectations, the ‘Expect+Compare’ agent prevents this value from becoming too negative. (d) Development and prevention of aversion in the simple 4-state environment (the agent is teleported to s 1 after reaching food state). Each interaction shows the agent’s current estimation of the best action to take at each state and its estimated Q-value of taking that action at that state. Here, the aspiration level of the agents is deliberately set to be very high. The ‘Compare only’ agent develops an aversion to the food state (at step = 60 and = 80) whereas the ‘Expect+Compare’ agent does not exhibit any aversion behavior. (e) Visualization of the visit counts of the ‘Compare only’ and the ‘Expect+Compare’ agent (darker shade represents greater visit counts and vice-versa). At the 6000th timestep, the visit counts of the two agents are comparable. At the 8000th timestep, the ‘Compare only’ develops aversions and visits states near to the food state more often than it visits the food state. https://doi.org/10.1371/journal.pcbi.1010316.g004 To understand why the Compare component is not sufficient by itself to maximize cumulative objective reward, we construct a simplified version of the previous gridworld environment—we remove all poison and sinkhole states and instead of being teleported to a random state in the world, the agent is always teleported back to the starting state whenever it reaches the food state. The ‘Compare-only’ agent (M = 132.16, SD = 15.51) again performs worse than the ‘Expect+Compare’ (M = 145.15, SD = 11.13) as well as the ‘All’ agent (M = 152.65, SD = 14.06) in this setting. Fig 4b shows one instance of the visit counts and the learnt policy of the ‘Compare-only’ agent at the end of its lifetime in this environment. The agent visits the states next to the food state quite often but it does not visit the food state as much as it visits these nearby non-rewarding states. The learnt policy of the agent is even more surprising—the agent learns a policy that encourages it to visit the food state but this policy is rather sub-optimal as the agent takes an unnecessarily long route to reach the food state. For example, if the agent is at the state which is to the immediate left of the food state, then following the learnt policy, the agent would take 3 steps to reach the food state from that location (whereas the optimal policy would reach the food state in just 1 step, by taking the action ‘right’). One explanation for this behavior is that perhaps the ‘Compare only’ agent develops some form of ‘aversion’ to the food state. Whenever the agent visits the food state, it teleports back to the starting state location. The starting state locations give a negative subjective reward to the agent (since ρ = 0.001) and the agent quickly assigns a negative value to these states. Once the agent starts visiting the food location more frequently, it also inadvertently visits the starting states (due to teleportation) and keeps assigning an even lower value to these states (they provide a subjective reward = −0.001 at each visit). Thus, the agent eventually starts avoiding the food state in order to avoid going back and re-visiting the highly negatively valued starting states. The ‘Expect+Compare’ agent does not develop an aversion to the food state because the Expect component ensures that the value estimate of the starting state locations does not become too negative—whenever the Compare component reduces the value of the starting state, the Expect component induces a positive subjective reward (since it expects a negative reward at the state but instead receives an objective reward equal to 0). This is also shown in Fig 4c which plots the state value estimate of the starting state location as a function of the visit count for the two agents (see also the S1 Text for a derivation of the convergence and the difference in the upper and lower bound for the starting states for the two agents). We provide an illustration of how aversion can develop and how the Expect component helps the Compare component using the 4-state environment. The agent’s starting location is s 1 and it is always teleported back to this location whenever it visits the food location at s 4 . To simulate aversion, we deliberately set the aspiration level of the agents to be 0.6, considerably higher than the optimal aspiration level (= 0.2). Fig 4d shows a visualization of one instance of the agents’ interaction history with the environment. For each interaction, we show the agent’s current policy i.e., the estimated best action to take at each state (where ‘<’ corresponds to taking the action ‘left’, ‘>’ corresponds to taking the action ‘right’, and ‘+’ corresponds to taking the action ‘stay’) and the estimated value of taking that action at that state. In the beginning, at step = 20, the learnt policy of the ‘Compare only’ agent is optimal and it takes the correct action at each state. At step = 40, the agent still follows the optimal policy but the value of taking the action ‘right’ at states s 1 and s 2 is highly negative (and much lower than what it was at step = 20). At step = 60, the agent’s policy becomes sub-optimal and it estimates the best action to take at s 1 to be ‘left’ and the best action to take at s 2 to be ‘stay’. Eventually, at step = 100, the agent learns to take the correct actions at all the states and its policy becomes optimal again. The ‘Expect+Compare’ agent does not exhibit any aversion behavior and the learnt policy of the agent remains optimal throughout its lifetime. These simulations highlight an important drawback of the ‘Compare only’ agent—when comparisons are left unchecked, it can lead to too much pessimism and the agent might end up learning a sub-optimal policy especially in settings where rewards are sparsely available.

The importance of dynamic aspiration levels. Until now, we have considered ‘Compare only’ agents that have a fixed aspiration level throughout their lifetime. While having a fixed aspiration level can help the agent to learn in a variety of densely rewarded settings, it fails to perform as well in the sparse reward setting. In a sparse reward environment, the fixed aspiration level can be useful in the initial stages as it helps the agent to find the food location relatively quickly. However, after a certain amount of time, when the agent starts to visit the food state constantly, the same aspiration becomes too high for optimal learning. Fig 4e plots one instance of the visit counts of the ‘Compare only’ and the ‘Expect+Compare’ agent at the 6000th timestep as well as the 8000th timestep. At the 6000th timestep, the two agents are comparable to each other and there is very little difference in the visit count plots. At the 8000th timestep, the ‘Compare-only’ agent starts to show the aversion behavior and visits states near to the food state more often than it visits the food state suggesting that aversion develops primarily during the later stages of its lifetime. While one way to address this shortcoming is to include the Expect component, another way is to have a dynamic aspiration level i.e., an aspiration level that changes during the agent’s lifetime. We find that for the ‘Compare only’ agent, an optimal strategy is to start with the optimal aspiration level (i.e., the fixed value of the aspiration level which resulted in the highest cumulative objective reward) and then lower the aspiration level by 10% midway during the agent’s lifetime. Starting off with the optimal aspiration level allows efficient exploration of the environment in the beginning of the lifetime. The lowered aspiration level then ensures that the value of the non-food states does not become too negative and the agent does not develop any aversion to the food state. This strategy considerably improves the agent—the comparison-based agent with the dynamic aspiration (M = 145.15, SD = 11.13) performs better than the ‘Compare only’ agent with the fixed aspiration level as well as the ‘Expect+Compare’ agent. An important point to note is that the above strategy of setting the dynamic aspiration level is specific to this particular environment—just like a reward function, the aspiration level also needs to reflect the statistics of the environment. Here, the right strategy was to lower the aspiration level after a certain amount of time and some settings might require a different strategy altogether.