Understand HST Enterprise Caching
Introduction
Goal
Understand how the enterprise caching features in Bloomreach Experience Manager's delivery tier work and what their added value is over the standard caching in Bloomreach Experience Manager.
Background
Enterprise Caching makes Bloomreach Experience Manager's page caching much more powerful and efficient and makes a delivery tier cluster work more efficiently by reusing cached pages between cluster nodes. In addition, it allows for domain-specific optimization through cluster-wide caching.
This page explains what features Enterprise Caching provides, how they work, and what their added value is over Bloomreach Experience Manager's caching features.
For practical information on how to use Enterprise Caching, see Enable and Configure HST Enterprise Caching and Monitor HST Enterprise Caching.
HST Enterprise Caching Features
Enterprise Caching provides three main features:
- Stale Page Caching in versions 14.x, 15.7, 16.1 and up
- Second Level Page Caching in versions 14.x (deprecated)
- Generic Cluster-Wide Caching in versions 14.x (deprecated)
Stale Page Caching allows the delivery tier to serve a lot of visitors a stale page while just 1 visitor is waiting for the recreation of the page. After the cached page is 'refreshed' with the recreated one, all new visitors will get the fresh page.
Second Level Page Caching allowed the delivery tier to serve a cached page for much longer than Bloomreach Experience Manager's page cache (see Explanation) and per cluster, only 1 node needs to create a (cached) page, after which that page can be served by all cluster nodes.
Second Level Page Caching and Stale Page Caching work seamlessly with Hippo Relevance (personalized pages).
Generic Cluster-Wide Caching can be used for domain-specific optimization. For example, in case you don't want every cluster node to execute the same expensive remote REST call, but instead, have 1 cluster node execute the request and store the result in the shared cache. Another use case would be if you want to keep track of visitor information (for example the click path), but you do not want to use sticky session or want to stay stateless (and thus you cannot store the information in an HTTP session). In that case, you can store the visitor information in the cluster-wide cache.
All three caches work independently from each other. Depending on an implementation project's exact requirements, only those caches that are needed can be enabled.
Explanation of Page Cache Improvements with Enterprise Caching
Why do you need the Enterprise Cache add-on for improved page caching when Bloomreach Experience Manager already provides page caching? The open source page caching (First Level Page Caching) is great for serving 10,000 homepages per second instead of, for example, 500 homepages per second. It optimizes already performing pages enormously. However, it is less suited for slow pages. The reason for this is that the open source page caching is quite volatile: on any content change in the repository, the entire page cache is cleared because the delivery tier does not know which pages are potentially affected by the content change.
You could argue that instead of clearing the entire cache, we could keep serving the stale pages for, say, 5 more minutes, before recreating them. In other words, instead of flushing the cache, cache pages with a timetolive of 5 minutes. The problem with this approach is that if it is acceptable that changes might be visible on the live website after 5 minutes, it is most likely unacceptable that a single visitor gets to see alternating results for the same page for some time, depending on which cluster nodes handles the request. If you are fine with cluster node affinity for visitors (even if you don't use HTTP sessions), you could use the First Level Page Cache with
pageCache.timeToLiveSeconds = 300 pageCache.clearOnContentChange = true
as described in HST Page Caching.
With the Stale Page Caching enabled, the latency for high traffic pages dramatically decreases on average in case concurrent content changes take place. Also, there is an increased thundering herd protection. It works as follows: The open source First Level Caching is a blocking cache containing a thundering herd protection. This means that if 100 different visitors hit the same page, only 1 request builds the page, and after it has been created, all 100 visitors get instantly the same response. However, if the page takes long to create, for example, because of a slow remote rest call, all 100 visitors have to wait and take a connection thread from the container. If lots of other pages are also requested by other visitors at the same time, even the acceptCount could exceed the maximum value and new requests are being refused. This relates to Tomcat's maxConnections/acceptCount and maxThreads settings.
So even with First Level Page Caching and stampeding herd protection, when your pages take long to generate, you can run into problems.
With Stale Page Caching enabled, the aforementioned problem has been mitigated significantly: as before, in the case of 100 visitors requesting the same page, still only 1 request passes the thundering herd protection. But before this request is going to (re)create the page, it first restores the stale page in the First Level Cache from the Stale Page Cache (assuming the page has been rendered before once). Directly after the stale response has been restored in the First Level Cache, the blocking (thundering herd protection) on the First Level Cache is lifted, making the 99 waiting visitors getting instantly served the stale page. Concurrently the visitor whose request was used to restore the page from the stale cache is still being processed to render a fresh version of the page. Once it is finished, it stores the fresh version of the page in the First Level Cache and in the Stale Page Cache.
Deprecated caches
The Second Level Page cache and Generic Cluster-Wide cache are still available on Bloomreach Experience Manager version 14, but they are deprecated and removed from version 15 onwards, because they were tied to a outdated redis integration and effectively used.
With the Second Level Page Caching enabled, slow pages can be much better cached because every cluster node serves the same created page and the page can be cached much longer in general because different cluster nodes won't serve different variants of the same page. This means that if you configure the second level cache entries to have a timetolive of say 180 seconds, the cached page can be served for at least 3 minutes and every cluster node serves the exact same cached page. Content changes might thus take at most 3 minutes to become visible on a live site. If 3 minutes is too long, 1 minute might be acceptable.
Note that in general cached responses still come from the First Level Cache: Only when there is a cache miss in the First Level Cache, a lookup is done in the (cluster-wide) Second Level Cache: If this cache has a response, that response is (re)stored in the First Level Cache with an adjusted timetolive because the entry might for example already lived for 120 seconds and had a timetolive of 180 seconds.
Combining both Second Level Page Caching and Stale Page Caching with the First Level Caching makes the page caching much more effective and even more powerful than it already used to be and more generically applicable.