ARTUR > Theory and Discussion

HTM Reinforcement Learning


One of the important areas that Numenta's implementation of HTM hasn't covered yet is reinforcement learning (RL).  This is because reinforcement learning requires a system that is capable of taking actions based on input (i.e. sensory-motor inference) and that is where Numenta is actively researching.

Numenta's approach to HTM is to refer to neuroscience to understand how the cortex functions, and make their implementations true to their biological counterparts.  While I think this is definitely the correct long-term approach, I am not bound by this requirement for my project.  I have thought out my own HTM-based sensory-motor and RL design.  It is most likely quite different from the biological systems, but should work (based on my current understanding of HTM concepts).

At a high level, the basic idea is to have three high-order sequence memory layers.  The first is a standard temporal memory layer, for learning patterns and context from sensory input.  This layer projects into the second layer.  The second layer receives input from motor commands, and projects into the third layer.  The third layer receives reinforcement input (reward/ punishment).

Each layer's input passes through the usual spatial pooler to select the active columns (i.e. different columns are active in each layer).  Neurons in the first layer are typical, growing distal dendrites connecting to other neurons in the same layer.  The second and third layers are a bit different.  Neurons in the second layer grow distal dendrites connecting to neurons in the first layer.  Similarly, neurons in the third layer grow distal dendrites connecting to neurons in the second layer.

With this setup, the first layer can make inferences about what sensory information will come next based on the current context.  Columns represent the input, and neurons within the columns represent the context.  This is the typical HTM process as implemented by Numenta.

The second layer makes inferences about what motor commands will come next based on the current sensory context.  Columns in this layer represent the motor commands, and neurons within the columns represent the context.

The third layer makes inferences about rewards or punishments that will come next based on the current sensory-motor context from the second layer.  Columns represent the reinforcement, and neurons within the columns represent the context.  How good or bad a particular set of motor commands is in a given context consists of the immediate reward/punishment in that state plus the predicted rewards/punishments of possible next actions.  The system can then take actions based on what it predicts will happen.

Besides rewards and punishments, I have also introduced the concept of "novelty".  These columns represent the level of unknown outcomes a particular action might lead to (i.e. future actions down a particular path that the system has not yet tried).  The purpose of this is to allow the system to explore actions it hasn't tried yet, versus always only ever going with the very first positive action it has done in a particular context.

The system will have a curiosity level that grows over time, and is reduced any time it does something novel.  The more novel a path is, the more the system's curiosity is satisfied.  A combination of novelty score and curiosity level can eventually outweigh punishments that the system has encountered in the past, and cause it to try a particular action again in order to explore subsequent actions down that negative path that it hasn't tried yet (and which could lead to rewards).

I have been active on Numenta's forum lately, but keep forgetting to post on my own forum :)  Let me give a quick progress update on how things are going.

The biggest epiphany for me came from realizing that the concepts of "imagination" and "curiosity" (which were the most biologically implausible elements of my original design) can be simulated by existing functions of a spatial pooler.

Spatial poolers currently simulate inhibition by selecting a percentage of columns that best connect to the current input space, and only those columns activate. A slight modification of this function allows it to replace my earlier concept of "imagination" -- selecting a percentage of columns that best connect to the most positive reinforcement input space, and only those activate. The columns in the motor layer map to the motor commands, so the winning columns drive what actions are taken.

Spatial poolers also have a function for "boosting", which allows columns that haven't been used in a while to slowly accumulate a higher score, and eventually win out over other columns that have been used more frequently. This can be used to replace my earlier concept of "curiosity". Actions the system hasn't tried in a while, such as new actions or those which previously resulted in a negative reinforcement, will eventually be tried again, allowing the system to explore and re-attempt actions that could lead to new outcomes.

I drew up a diagram to help visualize what the current design looks like:

The sequence and feature/location layers are complimentary -- both using the same spatial pooler (same columns activate for both layers) -- i.e. both receiving proximal input from the sensors. The sequence layer receives distal input from other cells in its own layer, while the feature/location layer receives distal input from an array of cells representing an allocentric location.

The motor layer receives proximal input from the reinforcement layer, via the modified spatial pooler which chooses a percentage of motor columns which have the highest reinforcement score with boosting. This layer receives distal input from active cells in both the sequence layer and the feature/location layer. Columns represent motor commands, while cells in the column represent the sensory context.

Columns in the reinforcement layer represent how positive or negative a reinforcement is. In my implementation, I am using columns to the left to represent more negative reinforcement, while columns to the right represent more positive reinforcement (with columns near the center being neutral). This is just to make it easier to visualize. Columns represent positivity/negativity, and cells in the columns represent sensory-motor context. Cells in this layer receive distal input from active cells in the motor layer.

My current design utilizes a two-layer circuit to pool reinforcement input. This tweak eliminates the need to extend reinforcement predictions backwards through time (handled now by a function of the temporal pooler), allowing the implementation to align even more closely with traditional HTM concepts.  Output from the reinforcement pooling layer is passed through the modified spatial pooler, which chooses a percentage of the motor columns which best map to the most positive reinforcement, with boosting.

There is still some more tweaking to do, but it is definitely starting to come together.  The most recent changes I got from watching the HTM Chat with Jeff.  One is the association of the sequence and feature/location layers. Location input itself, however, is currently just an array of input cells representing an allocentric location, which the feature/location layer connects to distally. Egocentric location is still missing, as well as tighter feedback between the two regions.  The other idea from Jeff's slides is the two-layer circuit which gave me the idea for configuring reinforcement feedback with a pooling layer.


[0] Message Index

Go to full version