Attempting a toy model of vertebrate understanding

Tag: give-up time

Essay 31: Striatum as Timeout

Let’s return to the task of essay 16 on give-up time in foraging, which covered food search with a timeout. At first the animal uses a general roaming search and if it smells a food odor, it switches to a targeted seek following the odor with chemotaxis. If the animal finds food in the odor plume, it eats the food, but if it doesn’t find food, it will eventually give up and avoid the local area before returning to the roaming search.

Search state machine. Roam is the starting state, switching to seek when it detects odor, and switching to avoid after a timeout.

For another attempt at the problem, let’s take the striatum (basal ganglia) as implementing the timeout portion of this task using the neurotransmitter adenosine as a timeout signal and incorporating the multiple action path discussion from essay 30 on RTPA. Adenosine is a byproduct of ATP breakdown and is a measure of cellular activity. With sufficiently high adenosine, the striatum switches from the active seek path to an avoidance path. These circuits are where caffeine works to suppress the adenosine timeout, allowing for longer concentration.

Mollusk navigation

As mentioned in essay 30, the mollusk sea slug has a food search circuit with a similar logic to what we need here. The animal seeks food odors when it’s hungry, but it avoids food odors when it’s not hungry [Gillette and Brown 2015].

Mollusk food search circuit, modulated by hunger.
Mollusk food search circuit, illustrating a hunger-modulated switchboard. When the animal is not hungry, the switchboard reverses the odor to motor links turning it away from food.

This essay uses the same idea but replaces the hunger modulation with a timeout. When the timeout occurs, the circuit switches from a food seek action path to a food avoid action path.

Odor action paths

Two odor-following actions paths exist in the lamprey, one using Hb.m (medial habenula) and one using V.pt (posterior tuberculum). The Hb.m path is a chemotaxis path following a temporal gradient. The V.pt path projects to MLR (midbrain locomotor region), but The lamprey Ob.m (medial olfactory bulb) projects to both Hb.m (medial habenula) and to V.pt (posterior tuberculum), which each project to different locomotor paths [Derjean et all 2010], Hb.m to R.ip (interpeduncular nucleus) and V.pt to MLR (midbrain locomotor region). The zebrafish also has Ob projections to Hb and V.pt [Imamura et al 2020], [Kermen et al 2013].

Dual odor-seeking action paths in the lamprey and zebrafish. Hb (habenula), Ob.m (medial olfactory bulb), V.pt (posterior tectum).

Further complicating the paths, the Hb.m itself contains both an odor seeking path and an odor avoiding path [Beretta et al 2012], [Chen et al 2019]. Similarly Hb.m has dual action paths for social winning and losing [Okamoto et al 2021]. So, this essay could use the dual paths in Ob.m instead of contrasting Ob.m with V.pt, but the larger contract should make the simulation easier to follow.

This essay’s simulation makes some important simplifications. The Hb to R.ip path is a temporal gradient path used for chemotaxis, phototaxis and thermotaxis. In a real-world marine environment, odor diffusion and water turbulence is much more complicated, producing more clumps and making a simple gradient ascent more difficult [Hengenius et al 2012]. Because this essay is only focused on the switchboard effect, this simplification should be fine.

Striatum action paths with adenosine timeout

The timeout circuit uses the striatum, which has two paths: one selecting the main action, and the second either stopping the action, or selecting an opposing action [Zhai et al 2023]. The two paths are distinguished by their responsiveness to dopamine with S.d1 (striatal projection with D1 G-s stimulating) or S.d2 (striatal projection with D2 G-i inhibiting) marking the active and alternate paths respectively. This model is a simplification of the mammalian striatum where the two paths interact in a more complicated fashion [Cui et al 2013].

Essay odor seek with timeout circuit. The seek path flows from Ob, through S.d1 to P.v to V.pt. The avoid path flows from Obj, though S.d2 to Pv. to Hb. Ad (adenosine), Hb (habenula), Ob (olfactory bulb), Pv (ventral pallidum), S.d1 (striatum D1 projection neuron), S.d2 (striatum D2 projection neuron), V.pt (posterior tuberculum)

As mentioned, the two actions paths are the seek path from Ob to V.pt and the avoid path from Ob to Hb. For the timeout and switchboard, the Ob has a secondary projection to the striatum. Although this circuit is meant as a proto-vertebrate simplification, Ob does project to S.ot (olfactory tubercle) and to the equivalent in zebrafish [Kermen et al 2013].

The timeout is managed by adenosine, which is a neurotransmitter derived from ATP and a measure of neural activity. The striatum has three sub-circuits for this kind of functionality, which I’ll cover in order of complexity.

S.d1 and adenosine inhibition

The first circuit only uses the direct S.d1 path and adenosine as a timeout mechanism. When the animal follows an odor, the Ob to S.d1 signal enables the seek action. As a timeout, ATP from neural activity degrades to adenosine and the buildup of adenosine is a decent measure of activity over time. The longer the animal seeks, the more adenosine builds up. Of the Ob projection axis contains an A1i (adenosine G-i inhibitory) receptor, the adenosine will inhibit the release of glutamate from Ob, which will eventually self-disable the seek action.

S.d1 action path inhibited by adenosine buildup as a timeout. A1i (adenosine G-i inhibitory receptor), Ad (adenosine), mGlu5q (metabotropic glutamate G-q receptor), Ob (olfactory bulb), S.d1 (D1-type striatal projection neuron)

In practice, the striatum uses astrocytes to manage the glutamate release. An astrocyte that envelops the synapse measures glutamate release with an mGlu5q (metabotropic glutamate with G-q/11 binding) receptor and accumulates internal calcium [Cavaccini et al 2020]. The astrocyte’s calcium triggers an adenosine release as a gliotransmitter, making the adenosine level a timeout measure of glutamate activity. The presynaptic A1i receptor then inhibits the Ob signal. The timeframe is on the order of 5 to 20 minutes with a recovery of about 60 minutes, although the precise timing is probably variable. Interestingly, the time-out is a log function instead of linear measure of activity [Ma et al 2022].

This circuit doesn’t depend on the postsynaptic S.d1 firing [Cavaccini et al 2020], which contrasts with the next LTD (long term depression) circuit which only inhibits the axon if the S.d1 projection neuron fires.

S.d1 presynaptic LTD using eCB

S.d1 self-activating LTD uses retrotransmission to inhibit its own input using eCB (endocannabiniods) as a neurotransmitter. Like the astrocyte in the previous circuit, S.d1 uses a mGlu5q receptor to trigger eCB release, but also require that S.d1 fire, as triggered by NMDA glutamate receptor. The axon receives the eCB retrotransmission with a CB1i (cannabinoid G-i inhibitory) receptor and trigger presynaptic LTD [Shen et al 2008], [Wu et al 2015]. Like the previous circuit, the timeframe seems to be on the order of 10 minutes, lasting for 30 to 60 minutes.

S.d1 LTD circuit. A coincidence of glutamate detection with mGlu5q and S.d1 activation with NMDA triggers eCB release, which activates CB1i leading to presynaptic LTD. CB1i (cannabinoid G-i inhibitory receptor), mGlu5q (glutamate G-q receptor), Ob (olfactory bulb), S.d1 (striatum D1-type projection neuron).

This circuit inhibits itself over time without using adenosine or astrocytes. In the full striatum circuit, high dopamine levels suppress this LTD suppression, meaning that dopamine inhibits the timeout [Shen et al 2008].

The next circuit adds the S.d2 path, which uses adenosine and self-activity to trigger postsynaptic LTD.

S.d2 postsynaptic LTP via A2a.s

Consider a third circuit that has the benefits of both previous circuits because it uses adenosine as a timer managed by astrocytes and is also specific to postsynaptic activity. In addition, it allows for a second action path, changing the circuit from a Go/NoGo system to a Go/Avoid action pair. This circuit uses LTP (long term potentiation) on the S.d2 striatum neurons.

Timeout circuit using postsynaptic LTD at the S.d2 neuron and adenosine as a timeout signal. As adenosine accumulates, it stimulates S.d2, which both disables S.d1 and drives the avoid path. A2a.s (adenosine G-s stimulatory receptor), Ad (adenosine), mGlu5q (glutamate G-q metabotropic receptor), Ob (olfactory bulb), S.d1 (striatum D1-type projection neuron), S.d2 (striatum D2-type projection neuron)

When the odor first arrives, Ob activates the S.d1 path, seeking toward the odor. S.d1 is activated instead of S.d2 because of dopamine. In this simple model, the Ob itself could provide the initial dopamine like c. elegans odor-detecting neurons or the tunicate’s coronal cells or the dual glutamate and dopamine neurons in Vta (ventral tegmental area).

As time goes on, adenosine from the astrocyte builds up, which activates the S.d2 A2s.a (adenosine G-s stimulatory receptor) until it overcomes dopamine suppression and increases the S.d2 activity with LTP [Shen et al 2008]. Once S.d2 activates, it suppresses S.d1 [Chen et al 2023] and drives the avoid path.

The combination of these circuits looks like it’s precisely what the essay needs.

Simulation

In the simulation, when the animal is hunting food and finds a food odor plume, it directly seeks toward the center and eats if it find food. In the screenshot below, the animal is eating.

Simulation showing the animal eating food after seeking the odor plume.

Satiation disables the food seek. This might sound obvious, but hunger gating of food seeking requires specific satiety circuits to any seek path that’s food specific, which means the involvement of H.l (lateral hypothalamus) and related areas like H.arc (arcuate hypothalamus) and H.pv (periventricular hypothalamus). And, of course, the simulation requires simulation code to only enable food odor seek when the animal is searching for food.

The next screenshot shows the central problem of the essay, when the animal seeks a food odor but there’s no food at the center.

Screenshot showing the animal stuck in the middle of the food odor plume before the timeout.

Without a timeout, the animal circles the center of the food odor plume endlessly. After a timeout, the animal actively leaves the plume and avoid that specific odor until the timeout decays.

Screenshot showing the animal escaping from the odor plume after the timeout.

This system is somewhat complex because of the need for hysteresis. A too-simple solution with a single threshold can oscillate, because as soon as the animal starts leaving the timeout decays, which then re-enables the food-seek, which then quickly times out, repeating. Instead, the system needs to make re-enabling of the food seek more difficult after a timeout.

But that adds a secondary issue because if food seek is a lower threshold, then the sustain of seek needs to raise the threshold while the seek occurs. So, the sustain of seek needs a lower threshold than starting seek. This hysteresis and seek sustain presumably needs to be handled by the actual striatum circuit.

Discussion

I think this essay shows that using the stratum for an action timeout for food seek is a plausible application. The circuit is relatively simple and is effective, improving search by avoiding failed areas.

However, the simulation does raise some issues, particularly hysteresis problem. If the striatum does provide a timeout along these lines, it must somehow solve the hysteresis problem. While the animal is seeking, the ongoing LTP/LTD inhibition should use a high threshold to stop seeking, but once avoidance starts, there needs to be a high threshold to return to seeking to avoid oscillations between the two action paths.

Because LTD/LTP is a relatively long chemical process (minutes) internal to the neurons, as opposed to an instant switch in the simulation, the delay itself might be sufficient to solve the oscillation problem. It’s also possible that some of the more complicated parts of the circuit, such as P.ge (globus pallidus) and its feedback to the striatum or H.stn (subthalamic nucleus) might affect the sustain of seek or breaking it and so control the hysteresis problem.

The simulation also reinforced the absolute requirement that action paths need to be modulated by internal state like hunger. For the seek paths, both Hb.m and V.pt are heavily modulated by H.l and other hypothalamic hunger and satiety signals.

As expected, the simulation also illustrated the need for context information separate from the target odor. While the food odor is timed out, the animal can’t search the other odor plume because this essay’s animal can’t distinguish between the odor plumes, and therefore avoids both odors. With a long timeout and many odor plumes, this delays the food search. A future enhancement is to add context to the timeout. If the animal can timeout a specific odor plume, it can search alternatives even if the food odor itself is identical.

References

Beretta CA, Dross N, Guiterrez-Triana JA, Ryu S, Carl M. Habenula circuit development: past, present, and future. Front Neurosci. 2012 Apr 23;6:51. 

Cavaccini A, Durkee C, Kofuji P, Tonini R, Araque A. Astrocyte Signaling Gates Long-Term Depression at Corticostriatal Synapses of the Direct Pathway. J Neurosci. 2020 Jul 22;40(30):5757-5768. 

Chen JF, Choi DS, Cunha RA. Striatopallidal adenosine A2A receptor modulation of goal-directed behavior: Homeostatic control with cognitive flexibility. Neuropharmacology. 2023 Mar 15;226:109421. 

Chen WY, Peng XL, Deng QS, Chen MJ, Du JL, Zhang BB. Role of Olfactorily Responsive Neurons in the Right Dorsal Habenula-Ventral Interpeduncular Nucleus Pathway in Food-Seeking Behaviors of Larval Zebrafish. Neuroscience. 2019 Apr 15;404:259-267. 

Cui G, Jun SB, Jin X, Pham MD, Vogel SS, Lovinger DM, Costa RM. Concurrent activation of striatal direct and indirect pathways during action initiation. Nature. 2013 Feb 14;494(7436):238-42.

Derjean D, Moussaddy A, Atallah E, St-Pierre M, Auclair F, Chang S, Ren X, Zielinski B, Dubuc R. A novel neural substrate for the transformation of olfactory inputs into motor output. PLoS Biol. 2010 Dec 21;8(12):e1000567. 

Gillette R, Brown JW. The Sea Slug, Pleurobranchaea californica: A Signpost Species in the Evolution of Complex Nervous Systems and Behavior. Integr Comp Biol. 2015 Dec;55(6):1058-69. 

Hengenius JB, Connor EG, Crimaldi JP, Urban NN, Ermentrout GB. Olfactory navigation in the real world: Simple local search strategies for turbulent environments. J Theor Biol. 2021 May 7;516:110607.

Imamura F, Ito A, LaFever BJ. Subpopulations of Projection Neurons in the Olfactory Bulb. Front Neural Circuits. 2020 Aug 28;14:561822. 

Kermen F, Franco LM, Wyatt C, Yaksi E. Neural circuits mediating olfactory-driven behavior in fish. Front Neural Circuits. 2013 Apr 11;7:62.

Ma L, Day-Cooney J, Benavides OJ, Muniak MA, Qin M, Ding JB, Mao T, Zhong H. Locomotion activates PKA through dopamine and adenosine in striatal neurons. Nature. 2022 Nov;611(7937):762-768.

Okamoto H, Cherng BW, Nakajo H, Chou MY, Kinoshita M. Habenula as the experience-dependent controlling switchboard of behavior and attention in social conflict and learning. Curr Opin Neurobiol. 2021 Jun;68:36-43. 

Shen W, Flajolet M, Greengard P, Surmeier DJ. Dichotomous dopaminergic control of striatal synaptic plasticity. Science. 2008 Aug 8;321(5890):848-51. 

Wu YW, Kim JI, Tawfik VL, Lalchandani RR, Scherrer G, Ding JB. Input- and cell-type-specific endocannabinoid-dependent LTD in the striatum. Cell Rep. 2015 Jan 6;10(1):75-87. 

Zhai S, Cui Q, Simmons DV, Surmeier DJ. Distributed dopaminergic signaling in the basal ganglia and its relationship to motor disability in Parkinson’s disease. Curr Opin Neurobiol. 2023 Dec;83:102798.

16: Give-up Time in Foraging

The essay 16 simulation is a foraging slug that follows odors to food, which must give-up on an odor when the odor plume doesn’t have food. Foraging researchers treat the give-up time as a measurable value, in optimal foraging in the context of the marginal value theorem (MVT), which tells when an animal should give up [Charnov 1976]. This post is a somewhat disorganized collection of issues related to implementing the internal state needed for give up time.

Giving up on an odor

The odor-following task finds food by following a promising odor. A naive implementation with a Braitenberg vehicle circuit [Braitenberg 1984], as early evolution might have tried, has the fatal flaw that the animal can’t give up on an odor. The circuit always approaches the odor.

Braitenberg vehicles for approach and escape
Braitenberg vehicles for approach and avoid.

Since early evolution requires simplicity, a simple solution is adding a timer, possibly habituation but possibly a non-habituation timer. For example, a synapse LTD (long term depression) might ignore the sensor after some time. Or an explicit timer might trigger an inhibition state.

Odor-following state diagram including give-up timer.
State diagram for the odor-following task with give-up timer. Blue is stateful; beige is stateless.

In the diagram, the beige nodes are stateless stimulus-response transitions. The blue area is internal state required to implement the timers. This post is loosely centered around exploring the state for give-up timing.

Fruit fly mushroom body neurons

Consider a sub-circuit of the mushroom body, focusing on the Kenyon cell (KC) to mushroom body output neuron (MBON) synapses, and the dopamine neuron (DAN) that modulates it. For simplicity, I’m ignoring the KC fanout/fanin and ignoring the habituation layer between odor sensors and the KC, as if the animal was an ancestral Precambrian animal.

Give-up timing might be implemented either in the synapses in blue between the KC and MBON, or potentially in circuits feeding into the DAN. The blue synapses can depress over time (LTD) when receiving odor input [Berry et al. 2018], with a time interval on the order of 10-20 minutes. Alternatively, the timeout might occur in circuitry before the DAN and use dopamine to signal giving up.

In mammals, the second option involving a dopamine spike might signal a give-up time. Although the reward-prediction error (RPE) in the Vta (ventral tegmental area) is typically interpreted as a reinforcement-learning signal, it could also signal a give-up time.

Mammalian analogy

In mammals, a give-up signal might be a combination of some or all of several neurotransmitters: dopamine (DA), serotonin (5HT), acetylcholine (ACh), and possibly norepinephrine (NE).

Dopamine has a characteristic phasic dip when the animal decides no reward will come. Many researchers consider this no-reward dip to be a reward-prediction error (RPE) in the sense of reinforcement learning [Schultz 1997].

One of the many serotonin functions appears patience-related [Lottem et al. 2018], [Miyazaki et al. 2014]. Serotonin ramps while the animal is persevering at the task and rapidly drops when the animal gives. Serotonin is also required for reversal learning, although this may be unrelated.

Acetylcholine (ACh) is required for task switching. Since giving-up is a component of task switching, ACh likely plays some role in the circuit.

[Aston-Jones and Cohen 2005] suggest a related role for norepinephrine for patience, impatience and decision making.

On the one hand, having essentially all the important modulatory neurotransmitters involved in this problem doesn’t give a simple answer. On the other hand, the involvement of all of them in give-up timing may be an indication of how much neural circuitry is devoted to this problem.

Mammalian RPE circuitry

The following is a partial(!) diagram for the mammalian patience/failure learning circuit, assuming the RPE signal detected in DA/Vta is related to give-up time. The skeleton of the circuit is highly conserved: almost all of it exists in all vertebrates, with the possible exception ofr the cortical areas F.vm (ventromedial prefrontal cortex) and C.sma (supplemental motor area). For simplicity, the diagram doesn’t include the ACh (V.ldt/V.ppt) and NE (V.lc) circuits. The circuit’s center is the lateral habenula, which is associated with a non-reward failure signal.

Partial reward-error circuit in the mammalian brain.

Key: T.pf (parafascicular thalamus), T.pv (paraventricular thalamus), C.sma (supplementary motor area cortex), F.vm (ventromedial prefrontal cortex), A.bl (basolateral amygdala), E.hc (hippocampus), Ob (olfactory bulb), O.pir (piriform cortex), S.v (ventral striatum/nucleus accumbens), S.a (central amygdala), S.ls (lateral septum), S.ot (olfactory tubercle), P.hb (habenula-projecting pallidum), P.a (bed nucleus of the stria terminalis), Hb.l (lateral habenula), H.l (lateral hypothalamus), H.stn (sub thalamic nucleus), Poa (preoptic area), Vta (ventral tegmental area), V.dr (dorsal raphe), DA (dopamine), 5HT (serotonin). Blue – limbic, Red – striatal/pallidal. Beige – cortical. Green – thalamus.

Some observations, without going into too much detail. First, the hypothalamus and preoptic area is heavily involved in the circuit, which suggests its centrality and possibly primitive origin. Second, in mammals the patience/give-up circuit has access to many sophisticated timing and accumulator circuits, including C.sma, F.ofc (orbital frontal cortex), as well as value estimators like A.bl and context from episodic memory from E.hc (hippocampus.) This, essentially all of the limbic system projects to Hb.l (lateral habenula), a key node in the circuit.

Although the olfactory path (Ob to O.pir to S.ot to P.hb to Hb.l) is the most directly comparable to the fruit fly mushroom body, it’s almost certainly convergent evolution instead of a direct relation.

The most important point of this diagram is to show that mammalian give-up timing and RPE is so much more complex than the fruit fly, that results from mammalian studies don’t give much information for the fruit fly, although the reverse is certainly possible.

Reward prediction error (RPE)

Reward prediction error (RPE) itself is technically just an encoding of a reward result. A reward signal could either represent the reward directly or as a difference from a reference reward, for example the average reward. Computational reinforcement learning (RL) calls this signal RPE because RL is focused on the prediction not the signal. But an alternative perspective from the marginal value theorem (MVT) of foraging theory [Charnov 1976], suggests the animal use the RPE signal to decide when to give up.

The MVT suggests that an animal should give up on a patch when the current reward rate is lower than the average reward rate in the environment. If the RPE’s comparison reward is the average reward, then a positive RPE suggests the animal should stay in the current patch, and a negative RPE says the animal should consider giving up.

In mammals, [Montague et al. 1996] propose that RPE is used like computational reinforcement learning, specifically temporal difference (TD) learning, partly because they argue that TD can handle interval timing, which is related to the give up time that I need. However, TD’s timing representation requires a big increase in complexity.

Computational models

To see where the complexity of time comes from, let’s step back and consider computational models used by both RL and the Turing machine. While the Turing machine might seem to formal here, I think it’s useful to explore using a formal model for practical designs.

Reinforcement learning machine and Turning machine abstract diagram.

Both models above abstract the program into stateless transition tables. RL uses an intermediate value function followed by a policy table [Sutton and Barto 2018]. The Turing state is either in the current state variable (basically an integer) and the infinite tape. RL exports its entire state to the environment, making no distinction between internal state like a give-up timer and the external environment. Note the strong contrast with a neural model where every synapse can hold short-term or long-term state.

Unlike the Turing machine, the RL machine diagram is a simplification because researchers do explore beyond the static tabular model, such as using deep-learning representations for the functions. The TD algorithm itself doesn’t follow the model strictly because it updates the value and policy tables dynamically, which can create memory-like effects early in training.

The larger issue in this post’s topic is the representation of time. Both reinforcement learning and the Turing machine represent time as state transitions with a discrete ticking time clock. An interval timer or give-up timer is represented by states for each tick in the countdown.

State machine timeout

The give-up timeout is an illustration of the difference between neural circuits and state machines. In neural circuits, a single synapse can support a timeout using LTD (or STD) with biochemical processes decreasing synapse strength over time. In the fruit fly KC to MBON synapse, the timescale is on the order of minus (“interval” timing), but neural timers can implement many timescales from fractions of seconds to hours in a day (circadian).

State machines can implement timeouts as states and state transitions. Since state machines are clock based (tick-based), each transition occurs on a discrete, integral tick. For example, a timeout might look like the following diagram:

Portion of state machine for timeout.

This state isn’t a counter variable, it’s a tiny part of a state machine transition table. State machine complexity explodes with each added capability. If this timeout part of the state machine is 4 bits representing 9 states, and another mostly-independent part of the state machine has another 4 bits with 10 state, the total state machine would need 8 bits with 90-ish states, depending on the interactions between the two components because a state machine is one big table. So, while a Turing machine can theoretically implement any computation, in practice only relatively small state machines are usable.

Searle’s Chinese room

The tabular nature of state machines raises the philosophical thought experiment of Searle’s Chinese room, as an argument against computer understanding.

Searle’s Chinese room is a philosophical argument against any computational implementation of meaningful cognition. Searle imagines a person who doesn’t understand Chinese in a huge library with lookup books containing every response to every possible Chinese conversation. When the person receives a message, they find the corresponding phrase in one of the books and writes the proper response. So, the person in the Chinese room holds a conversation in Chinese without understanding a single word.

For clarity, the room’s lookup function is for the entire conversation until the last sentence, not just a sentence to sentence lookup. Essentially it’s like the input the attention/transformer deep learning as use in something like ChatGPT (with a difference that ChatGPT is non-tabular.) Because the input includes the conversational context, it can handle contextual continuity in the conversation.

The intuition behind the Chinese room is interesting because it’s an intuition against tabular state-transition systems like state machines, the Turing machine, and the reinforcement learning machine above. Searle’s intuition is basically since computer systems are all Turing computable, and Turing machines are tabular, but tabular lookup is an absurd notion of understanding Chinese (table intuition), therefore computers systems can never understand conversation. “The same arguments [Chinese room] would apply to … any Turing machine simulation of human mental processes.” [Searle 1980].

Temporal difference learning

TD learning can represent timeouts, as used in [Montague et al. 1996] to argue for TD as a model for the striatum, but this model doesn’t work at all for the fruit fly because each time step represents a new state, and therefore needs a new parameter for the value function. Since the fruit fly mushroom body only has 24 neurons, it’s implausible for each neuron to represent a new time step. Since the mammalian striatum is much larger (millions of neurons), it can encode many more values, but the low information rate from RPE (less than 1Hz), makes learning difficult.

These difficulties don’t mean TD is entirely wrong, or that some ideas from TD don’t apply to the striatum, but it does mean that a naive model of TD of the striatum might have trouble working at any significant scale.

State in VLSI circuits

Although possibly a digression, I think it’s interesting to compare state in VLSI circuits (microprocessors) to both neurons, and the reinforcement learning machine and Turing machine. In some ways, state in VLSI resembles neurons more than it does formal computing models.

VLSI logic and state. Blue is state (latches) and beige is logic.

The φ1 and φ2 are clock signals needed together with the latch state to make the system work. The clocks and latches act like gates in an airlock or a water lock on a river. In a water lock only one gate is open at a time to prevent the water from rushing through. In the VLSI circuit, only one latch phase is active at a time to keep the logic bigs from mixing togethers. Some neuroscience proposals like [Hasselmo and Eichenbaum 2005] have a similar architecture for the hippocampus (E.hc) for similar reasons (keeping memory retrieval from mixing up memory encoding.)

In a synapse the slower signals like NMDA, plateau potentials, and modulating neurotransmitters and neuropeptides have latch-like properties because their activation is slower, integrative, and more stable compared to the fast neurotransmitters. In that sense, the slower transmission is a state element (or a memory element). If memory is a hierarchy of increasing durations, these slower signals are at the bottom, but they are nevertheless a form of memory.

The point of this digression is to illustrate that formal machine state models is unusual and possibly unnatural, even when describing electronic circuits. That’s not to say that those models are useless. In fact, they’re very helpful for mental models at the smaller scales, but in larger implementations, the complexity of the necessary state machines limits their value as an architectural model.

Conclusion

This post is mostly a collection of trying to understand why the RPE model bothers me as unworkable, not a complete argument. As mentioned above, I have no issues with a reward signal relative to a predicted reward, or using that differential signal for memory and learning. Both seem quite plausible. What doesn’t work for me is the jump to particular reinforcement learning models like temporal difference, adding external signals like SCS, and without taking into account the complexities and difficulties of truly implementing reinforcement learning. This post tries to explain some of the reasons for that skepticism.

References

Aston-Jones, Gary, and Jonathan D. Cohen. An integrative theory of locus coeruleus-norepinephrine function: adaptive gain and optimal performance. Annu. Rev. Neurosci. 28 (2005): 403-450.

Berry, Jacob A., et al. Dopamine is required for learning and forgetting in Drosophila. Neuron 74.3 (2012): 530-542.

Braitenberg, V. (1984). Vehicles: Experiments in synthetic psychology. Cambridge, MA: MIT Press. “Vehicles – the MIT Press”

Charnov, Eric L. “Optimal foraging, the marginal value theorem.” Theoretical population biology 9.2 (1976): 129-136.

Hasselmo ME, Eichenbaum H. Hippocampal mechanisms for the context-dependent retrieval of episodes. Neural Netw. 2005 Nov;18(9):1172-90.

Lottem, Eran, et al. Activation of serotonin neurons promotes active persistence in a probabilistic foraging task. Nature communications 9.1 (2018): 1000.

Miyazaki, Kayoko W., et al. Optogenetic activation of dorsal raphe serotonin neurons enhances patience for future rewards. Current Biology 24.17 (2014): 2033-2040.

Montague, P. Read, Peter Dayan, and Terrence J. Sejnowski. A framework for mesencephalic dopamine systems based on predictive Hebbian learning. Journal of neuroscience 16.5 (1996): 1936-1947.

Schultz, Wolfram. Dopamine neurons and their role in reward mechanisms. Current opinion in neurobiology 7.2 (1997): 191-197.

Searle, John (1980), Minds, Brains and Programs, Behavioral and Brain Sciences, 3 (3): 417–457,

Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.

Powered by WordPress & Theme by Anders Norén