Anyone who has worked with java in a high end application will be well aware of the double edged sword that is java garbage collection. When it works – it is awesome but when it doesn’t – it is an absolute nightmare. We work on a ticketing system where it is imperative that the system is as near real-time as possible. The biggest issue that we have found is the running of of memory in the JVM which causes a stop the world garbage collection, which results in cluster failures since an individual node is inaccessible for long enough that it is kicked out of the cluster.
There are various ways to combat this issue and the first instinct would be suggest that there is a memory leak. After eliminating this as a possibility, the next challenge was to identify where the memory was being taken up. This took some time and effort and the hibernate second level cache was identified. We were storing far too much in the second level cache.