Partially observable Markov decision processes (POMDP) provide a mathematical framework for agent planning under stochastic and partially observable environments. The classic Bayesian optimal solution can be obtained by transforming the problem into Markov decision process using belief states. However, because the belief state space is continuous, the problem is highly intractable. Many practical heuristic based methods are proposed, but most of them require a complete prior knowledge of the environment. This article presents a memory-based reinforcement learning algorithm, namely Reinforcement based U-Tree, which is not only able to learn the state transitions from experience, but also build the state model by itself based on raw sensor inputs. This article describes an enhancement of the original U-Tree's state generation process to make the generated model more compact, and demonstrate its performance using a car-driving task with 31,224 world states. The article also presents a modification to the statistical test for reward estimation, which allows the algorithm to be benchmarked against some model-based algorithms with a set of well known POMDP problems.