Optimize Repository Queries with Date Range Constraints for Performance
Avoid slow date range repository queries by specifying a date resolution.
Jackrabbit (on which Hippo Repository is based) has performance problems with date range queries (queries with a comparison constraint on properties of type Date), which are noticeable even for a limited set of documents. The problem is described in the blog post Make your date range queries in Jackrabbit go faster, including a workaround using a derived data function.
This page describes how to avoid poor performance when using date ranges in direct repository queries. For a similar approach using Hippo's delivery tier API, see Optimize Delivery Tier API Queries with Date Range Constraints for Performance.
Date Range Constraints
An example of an XPath query using date range constraint is:
//element(*,custom:document) [@custom:date >=xs:dateTime('2009-01-01T03:23:54.234Z') and @custom:date <=xs:dateTime('2013-01-01T06:41:30.056Z')] order by @custom:date descending
The xs:dateTime(...) constraint fragment is generated from a target timestamp like this:
final Calendar calendar = ...; String xsDateTimeFormat = session.getValueFactory(). createValue(calendar).getString();
Above constraints result in range queries with millisecond resolution, which results in very slow range queries or even out-of-memory conditions. Therefore, you should always use one of the supported resolutions mentioned later in this document.
Write Fast and Scalable Date Range Queries
Typically, a resolution of days, months or even years is sufficient for date range queries for documents on a website or in the CMS. For example, a visitor wants to limit the search results to documents published between 03-03-2013 and 04-03-2013, or documents last modified in 2012 or 2013.
Hippo Repository makes sure that all Calendar JCR properties get indexed just like Jackrabbit does, but also with different resolutions to support fast date range queries. The supported fast resolutions are:
The utility class DateTools of the Hippo Repository API simplifies the creation of XPath date range constraints. There are two helper methods that result in fast date range constraints:
String DateTools#getPropertyForResolution(String property, org.hippoecm.repository.util.DateTools.Resolution resolution)
String DateTools#createXPathConstraint(javax.jcr.Session session, java.util.Calendar calendar, org.hippoecm.repository.util.DateTools.Resolution roundDateBy)
getPropertyForResolution returns the name of the indexed property of the (document's) Date property, for the desired resolution, identifying the to be evaluated date value. createXPathConstraint transforms the date to compare to (expressed by calendar) into an XPath fragment, applying the desired resolution.
Example: Fast date Range with Resolution "Day"
Assume that we want to rewrite the XPath query
//element(*,custom:document) [@custom:date >=xs:dateTime('2009-01-01T04:04:56.456Z')] order by @custom:date descending
to a fast date range query using resolution "day". This can be done as follows:
Calendar calendar = ...; // 2009-01-01T34:04:56.456Z // the custom:date property for resolution 'day' String xpathProperty = DateTools.getPropertyForResolution("custom:date", DateTools.Resolution.DAY); // the xpath constraint for custom:date for resolution 'day' String xpathDate = DateTools.createXPathConstraint(session, calendar , DateTools.Resolution.DAY); String xpath = "//element(*,custom:document) [@" +xpathProperty+ " >= " + xpathDate + "] order by @custom:date descending";
Above code results in the folowing XPath query:
//element(*,custom:document) [@custom:date____day >=xs:dateTime('2009-01-01T00:00:00.000Z')] order by @custom:date descending
- The property to do the range query on has changed into @custom:date____day.
The value in xs:dateTime is rounded to day: it now ends with T00:00:00.000Z.
- The order by is done on the original @custom:date, so the sorting is still done on exact timestamps.
The rewritten query executes fast and scales to millions of documents. The more coarse the resolution, the faster the query becomes. Setting the resolution for a date range query to year typically results in even faster query execution than doing the same search without range constraint.
Note that a query with a resolution will at least give as many results as a query without or finer resolution. This can be understood from the query example above based on resolution "day". In this example, we saw that
got translated into
A document with custom:date = 2009-01-01T03:04:56.456Z (a hour before 2009-01-01T04:04:56.456Z) does match the query based on "day" resolution, but not the one without resolution.
Query support for documents on a specific Year, Month, Day or Hour
The fast scalable range query support described above can (and should) also be used for queries like
- All documents last modified in 2013
- All documents with publication date in February 2013
- All documents modified on 2013-02-03
For queries within a specific year, month, day or hour, the query can be turned into an equals comparison. For example, query 1) can (and should) be written as:
Calendar any2013Date = Calendar.getInstance(); any2013Date.set(Calendar.YEAR, 2013); // the custom:date property for resolution 'Year' String xpathProperty = DateTools.getPropertyForResolution("custom:date", DateTools.Resolution.YEAR); // the xpath constraint for custom:date for resolution 'Year' String xpathDate = DateTools.createXPathConstraint(session, any2013Date, DateTools.Resolution.YEAR); // note below a = and not a range! String xpath = "//element(*,custom:document) [@" +xpathProperty+ " = " + xpathDate + "];
Detailed explanation of the Date Range Query performance problem in Jackrabbit
When a Date property in Jackrabbit gets indexed in Lucene, its exact (non rounded) value is used. Hippo Documents store several timestamps as Date properties with millisecond resolution, such as creationDate, lastModifiedDate or publicationDate.
Doing a range query in Lucene results in a query expansion, similar to a BooleanQuery, where all unique values in the specified range are OR-ed. When the property on which the range query is performed, is a timestamp, this results in a OR term per document. Having 100.000 documents with a unique timestamp for creationDate, and then doing a range query containing a between (Jackrabbit treats the < and > as separate ranges thus instead of query expansion of values in the range, both ranges will be treated separately resulting in all terms) results in a OR query with 100.000 terms ... this will not perform.
Hippo Repository ships out of the box with rounded timestamps, on which you can do very fast efficient and scalable date range queries to solve the problem outlined above.
Note that Lucene 2.9 and higher supports NumericRangeQuery (TrieRangeQuery) to address in general fast range queries on for example date fields, however, supporting this in Jackrabbit is non trivial and has not been done (yet). Also, NumericRangeQuery's are far more efficient than normal range queries on date, but by far not as efficient as the range queries on resolution supported by Hippo Repository.