NN Peak Fitness
I have been experimenting with NNs and managed to create a pretty good, if simple, self-learning NN. It controls a dot in a small 2D world and should find a dot of food while not bumping into walls. Now what I notice is that the NN starts by walking around randomly but steadily learns to go towards the food. But after a while it seems to peak and then begins behaving strangely, walking into walls, always walking in the same direction etc.
Now I am very courious as to why? I've heard people talking about this before but never really understood, it all seems a bit weird.
Yeah, that would be some usefull info...
It's a normal, unbiased feed forward network. Only one layer containing four nodes (one for each action: move up/down/left/right). Each of the four nodes have 16 weighted inputs: 12 for sight (input value being: wall=-1, empty=0, food=1) and 4 for smell. The action chosen is naturally the one with the highest value after all the inputs have been summed up.
As for learning, I am sure that there is some fancy word to describe it but I don't know it..
After carrying out the action the NN recieves a result: negative if if went into a wall, positive if it found the food and slightly negative if it didn't accomplish anything. This result is then used to ajust the weights. The node of the chosen action gets it's weights adjusted using the result multiplied by the impact this weight had on the resulting sum.The nodes of the actions that weren't chosen gets their weights adjusted using the negative result.
I hope that makes sense.
It's a normal, unbiased feed forward network. Only one layer containing four nodes (one for each action: move up/down/left/right). Each of the four nodes have 16 weighted inputs: 12 for sight (input value being: wall=-1, empty=0, food=1) and 4 for smell. The action chosen is naturally the one with the highest value after all the inputs have been summed up.
As for learning, I am sure that there is some fancy word to describe it but I don't know it..
After carrying out the action the NN recieves a result: negative if if went into a wall, positive if it found the food and slightly negative if it didn't accomplish anything. This result is then used to ajust the weights. The node of the chosen action gets it's weights adjusted using the result multiplied by the impact this weight had on the resulting sum.The nodes of the actions that weren't chosen gets their weights adjusted using the negative result.
I hope that makes sense.
It sounds like you're trying to use reinforcement learning to learn a function that maps states to actions. The way it is described, I am not surprised that it isn't working well.
Your network is very limited in the functions it can learn. No bias input, no hidden layers. It can be represented by a single 4 by 16 matrix. The input is a 16 by 1 vector, and the output is found by multiplying the input by the matrix, to get the value associated with each of the 4 outputs. The elements of the input vector that represent sight are further limited to the set (-1,0,1). You didn't describe what the smell input is. Is smell binary, or does it represent how close the food is, does smell drift around corners or follow a straight line, etc. The richness of the input will affect the quality of the behavior which can be found.
One of the challenges of reinforcement learning is that the reward signal is delayed. If you're 100 meters away from the target, and you move 1 meter towards the target with every step, that's 99 steps where you are getting a slight penalty. Even if the agent somehow manages to move all 100 steps without deciding that its current strategy isn't paying off, it will only get reinforcement on that last step. There's no way for it to learn that all the steps leading up to the last step were also part of the strategy involved in getting the reward.
Your current setup might work if you use supervised learning, play the game yourself, and then have it train based on your behavior to attempt to copy what you did. It might not even be able to copy your strategy if its inputs are not rich enough for it to unambiguously figure out what caused you to act the way you did.
If you want to try supervised learning, try starting with a much simpler world, such as a world with a finite number of states. Each state might be one position on a board. The agent just tries to learn the best action to take in each state. If an action leads to a reward, then the value of that action in that state is increased. State-action values are also updated based on the value of the state they lead to. If an action leads to a state where the best action has some positive value, then that action is updated to be closer to that value. This can also work with non-determinate actions, actions that have a chance at different possible outcomes.
Exploration is another difficult issue. Even if the agent finds a good strategy (called a policy in reinforcement learning jargon), what if there is an even better one? If the agent always does what it thinks is the best move, it might never explore other possibilities that are potentially better. A chance to select a non-optimal move might be added, or possibly something even more sophisticated, to get the agent to explore other behaviors.
Now after all those issues, there is the issue of trying to train using a function approximator that is normally trained by supervised learning, like a neural network. One problem with neural networks is that their initial state is random, and they might learn in somewhat unpredictable ways. They might encounter an entirely novel state, but because of the way neural networks work, it will try to assign a value to that state even though it has never explored where it might lead. This can cause updates to go in absolutely the wrong direction, for example a state where all actions lead to a reward signal might be evaluated as being very negative, so the value of the action that lead there will be decreased, based on this faulty evaluation. It also depends on how you propagate values, in some algorithms you actually propagate the values back through several of the previous state-action pairs to try to get them to converge faster. With a function approximator like a neural network convergence is not guaranteed. In the reinforcement learning literiture, other approximators are usually used, often based on dividing the state space or state-action pair space into discrete bins and then learning values for them.
So anyway, this can obviously be a very complicated topic, and there's a book on the subject available for free on the internet somewhere if want to try to figure this stuff out that goes into detail on the stuff I talked about here.
As for why your current system behaves as it does, I think you should look at the edge weights in your network to try to figure it out. My guess is that the weights are either all going towards 0 or towards infinity, so that eventually the magnitude of the difference between them causes the system to always pick the same output.
Your network is very limited in the functions it can learn. No bias input, no hidden layers. It can be represented by a single 4 by 16 matrix. The input is a 16 by 1 vector, and the output is found by multiplying the input by the matrix, to get the value associated with each of the 4 outputs. The elements of the input vector that represent sight are further limited to the set (-1,0,1). You didn't describe what the smell input is. Is smell binary, or does it represent how close the food is, does smell drift around corners or follow a straight line, etc. The richness of the input will affect the quality of the behavior which can be found.
One of the challenges of reinforcement learning is that the reward signal is delayed. If you're 100 meters away from the target, and you move 1 meter towards the target with every step, that's 99 steps where you are getting a slight penalty. Even if the agent somehow manages to move all 100 steps without deciding that its current strategy isn't paying off, it will only get reinforcement on that last step. There's no way for it to learn that all the steps leading up to the last step were also part of the strategy involved in getting the reward.
Your current setup might work if you use supervised learning, play the game yourself, and then have it train based on your behavior to attempt to copy what you did. It might not even be able to copy your strategy if its inputs are not rich enough for it to unambiguously figure out what caused you to act the way you did.
If you want to try supervised learning, try starting with a much simpler world, such as a world with a finite number of states. Each state might be one position on a board. The agent just tries to learn the best action to take in each state. If an action leads to a reward, then the value of that action in that state is increased. State-action values are also updated based on the value of the state they lead to. If an action leads to a state where the best action has some positive value, then that action is updated to be closer to that value. This can also work with non-determinate actions, actions that have a chance at different possible outcomes.
Exploration is another difficult issue. Even if the agent finds a good strategy (called a policy in reinforcement learning jargon), what if there is an even better one? If the agent always does what it thinks is the best move, it might never explore other possibilities that are potentially better. A chance to select a non-optimal move might be added, or possibly something even more sophisticated, to get the agent to explore other behaviors.
Now after all those issues, there is the issue of trying to train using a function approximator that is normally trained by supervised learning, like a neural network. One problem with neural networks is that their initial state is random, and they might learn in somewhat unpredictable ways. They might encounter an entirely novel state, but because of the way neural networks work, it will try to assign a value to that state even though it has never explored where it might lead. This can cause updates to go in absolutely the wrong direction, for example a state where all actions lead to a reward signal might be evaluated as being very negative, so the value of the action that lead there will be decreased, based on this faulty evaluation. It also depends on how you propagate values, in some algorithms you actually propagate the values back through several of the previous state-action pairs to try to get them to converge faster. With a function approximator like a neural network convergence is not guaranteed. In the reinforcement learning literiture, other approximators are usually used, often based on dividing the state space or state-action pair space into discrete bins and then learning values for them.
So anyway, this can obviously be a very complicated topic, and there's a book on the subject available for free on the internet somewhere if want to try to figure this stuff out that goes into detail on the stuff I talked about here.
As for why your current system behaves as it does, I think you should look at the edge weights in your network to try to figure it out. My guess is that the weights are either all going towards 0 or towards infinity, so that eventually the magnitude of the difference between them causes the system to always pick the same output.
Quote: Original post by Vorpy
One of the challenges of reinforcement learning is that the reward signal is delayed. If you're 100 meters away from the target, and you move 1 meter towards the target with every step, that's 99 steps where you are getting a slight penalty. Even if the agent somehow manages to move all 100 steps without deciding that its current strategy isn't paying off, it will only get reinforcement on that last step. There's no way for it to learn that all the steps leading up to the last step were also part of the strategy involved in getting the reward.
For the OP:
There are a few acceptable solutions to this problem. In discrete state spaces you can use value iteration (or equivalently policy iteration), both of which rely on the use of an expected utility/penalty over states. You can also rely on less formalised techniques such as Holland's 'Bucket Brigade' algorithm. In continuous state spaces the problem is harder and for many problems computationally intractable without gross simplification. The details of these techniques are too complex to describe in a text-based forum. If you're interested in references, I can point you in the right direction.
Quote: Original post by Vorpy
...there's a book on the subject available for free on the internet somewhere if want to try to figure this stuff out that goes into detail on the stuff I talked about here.
By chance, would you know the name of that book?
Quote: Original post by Timkin
There are a few acceptable solutions to this problem. In discrete state spaces you can use value iteration (or equivalently policy iteration), both of which rely on the use of an expected utility/penalty over states. You can also rely on less formalised techniques such as Holland's 'Bucket Brigade' algorithm. In continuous state spaces the problem is harder and for many problems computationally intractable without gross simplification. The details of these techniques are too complex to describe in a text-based forum. If you're interested in references, I can point you in the right direction.
Thank you, that sound very interesting. I remember reading about Neural Bucket Brigades at some point...
Quote: Original post by SnOrfysQuote: Original post by Vorpy
...there's a book on the subject available for free on the internet somewhere if want to try to figure this stuff out that goes into detail on the stuff I talked about here.
By chance, would you know the name of that book?
Found it, Reinforcement Learning: An Introduction.
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement