Zero-Code Integration with Enterprise Search
Kenan Salic
2019-06-06
Over the last couple of years, we've reinvented the wheel in regard to enterprise search integrations solutions with Bloomreach Experience Manager (brXM) numerous times. Every time there seems to be a better solution available. Having promised myself that I will not write another single line of code during my next projects, this integration seems inevitable.
Even though brXM currently (content search pending) supports search OOTB, it is quite limited (content search powered by brSM pending). OOTB Search (Lucene) is rather basic; an enterprise is therefore needed to utilize more advanced features.
There are several reasons why projects require more advanced features and why projects should consider an enterprise search solution:
-
Better Search Results
-
Personalization
-
Managing Synonyms
-
Content from multiple sources
-
Insights
-
Highlights
-
Advanced Querying
In one of my last projects, the request arose again to integrate with an enterprise search engine, an integration very similar to that of Elasticsearch or Apache Solr. This time, I was thinking about doing it differently than before. I'd leverage everything we currently have, i.e. plugins and practises and use that in an efficient way. The goal is to write zero code. For this article, I was very much inspired by the article previously posted by Woonsan in 2015 about how to integrate with Elasticsearch. I've used the same starting concepts of routing the message to a message broker from brXM. This blog post is using a different approach to route the index from the broker to the Search Solution.
The full solution is here: https://github.com/ksalic/brxm-zero-code-search-integration
Before architecting a solution we need to be aware of the concept of getting the data (content) from the experience manager in the search solution. There are 2 main principles:
Event-Based Indexing:
Whenever there is a document published in brXM it should index that particular document in the search solution. And whenever a document is depublished it should remove that particular index from the search solution.
Full Index:
Index all published documents on demand, frequency or any particular time in the search solution. Usually used for the initial index or at the event the index has been deleted unintentionally.
In this next section, I want to spend some time to draw an architecture for a proposed solution and create an inventory of plugins and practises to use for a solution. Starting with a solution for the Event-Based Indexing.
We need to get content during publication of a document from brXM over to the enterprise search solution.
Currently, we have several options at our disposal. We could write our custom Daemon Module and listen to workflow events using the event bus. We could also hook into the scxml document workflow and do a similar setup. Usually choosing the event bus or the scxml document is a choice you make if the feature needs to be synchronous (scxml) or can be asynchronous (event bus). In my case, it does not matter if the event is processed synchronous or asynchronous.
The only requirement which is important for me is a solution which does not require any code thus little maintenance. Again, and this is very important to me, I do not want to write a single line of code, or at least as little as possible.
Luckily we do have a solution for a no code setup. Some time ago my colleague Woonsan created an extension for Apache Camel to interoperate with Hippo Events, such as publication events.
Apache Camel is an open source framework for message-oriented middleware with a rule-based routing and mediation engine that provides a Java object-based implementation of the Enterprise Integration Patterns using an application programming interface to configure routing and mediation rules.
Camel uses a Java Domain Specific Language or DSL for creating Enterprise Integration Patterns or Routes. Camel also supports a Spring-based XML configuration which is very suitable for my no-code setup. I can write the whole integration in configuration.
Although I have chosen to write XML rather than code. This whole integration could have just as easily have been written with the Java DSL. The Java DSL is quite useful for injecting business logic anywhere in the routes and can easily co exist with XML configuration.
In my initial setup I will leverage the Apache Camel component to receive the workflow events and route these towards the search solution.
To separate the concerns i.e. processing an event based index and a full index (which can be intensive) with the Experience Platform. I am introducing a new component (microservice) in the architecture which take care of all operations in regard of indexing. I'm calling this the "search service". The search service will receive workflow events from brXM. Possibly also receive events from other sources for a search index from multiple sources. I've chosen to use Spring Boot to create this Search Service.
To secure the connection between brXM and the search service there are several solutions. Because the search service is a microservice based on spring boot I could use REST endpoints and secure these with spring security.
To be more resilient, secure and write less code I'm introducing a new component in the architecture which is used by applications to communicate without being tightly coupled to one another, an enterprise message broker i.e. ActiveMQ, Qpid, RabbitMQ, Azure Service Bus etc.
In a recent project we used Azure Service Bus. It is quite easy to make use of the same setup (route) with an alternative message broker such as Activemq.
You can instantiate an instance of activemq easily through a cloud provider such as amazon, https://aws.amazon.com/amazon-mq/.
Or install Apache ActiveMQ locally, please follow the guide in http://activemq.apache.org/getting-started.html. But, here are simplified steps for this demo testing purpose only:
-
Download the latest ActiveMQ binary from http://activemq.apache.org/download.html.
-
Extract the downloaded archive file to the project root folder. So, for example, you will have 'apache-activemq-x.x.x' subfolder in the project root folder.
-
Open a new command line console and move to ./apache-activemq-x.x.x/bin/ folder.
-
Run the following to start in console mode (You can type Ctrl+C to stop it):
$ ./activemq console
-
You can check/monitor messages in the ActiveMQ Admin Console. Visit http://localhost:8161/admin/ (Login by admin/admin by default).
The demo project is using a cloud demo instance on Amazon MQ.
Azure Service Bus and Active MQ both support JMS over the AMQP protocol. This means that with the same connection client, it can be used for different message brokers, making the camel route the same for each broker used in the architecture. Soon, we will get to route and client configuration.
Next part of the architecture is to define and model the data. Usually, with this type of integration, we directly send the data as a JSON object to the enterprise search solution. So how will we convert the workflow event object of a CMS document to a JSON document? And how to route this to the search solution? Because the workflow event is triggered asynchronously (from the hst request context) in brXM it is not that easy to generate the link of the page which the document is being displayed (e.g. requestContext required for the link creator). Not impossible but definitely will require to write some additional business logic to make this work. An alternative solution is to solve this architecturally by exposing a REST endpoint in the delivery tier in brXM. This REST endpoint would need to be able to expose a single document as JSON or XML based on the ID (derived from the workflow event) for the event based indexing and the REST endpoint would also need to expose a full list of documents for the full index.
Event-Based Indexing:
Full Index:
Again several options are available: there is the Content Rest API, Plain REST API, OAI-PMH plugin and the Content HAL plugin. It is great to have so many choices of CaaS offerings with brXM.
Quick analysis: Plain Rest API introduces additional code.. even though most of it is being generated by Essentials we would still need to tweak it. The choice is between the Content Rest API, OAI-PMH plugin and the Content HAL Plugin, which all 3 can be used directly as is.
While the OAI-PMH plugin is architecturally the best choice for this type of integration because of its features i.e. the resumption token instead of pagination. I did prefer the HAL plugin as it is more extensible and has features which fields to display in the API response to route towards the search index. For my project, which is a SPA project, I needed to change the HAL resource slightly to calculate the paths in code to make it useful for SPA search.
Time to... code! Or well let's start adding everything together in a working solution.
We will first start with routing the event object to the message broker.
We can configure the camel route in the CMS application or the Delivery tier application. Prior to brXM 13 I personally would have chosen to instantiate the route context in the delivery tier (spring dependencies, HST API etc.). But now that the HST is a part of the CMS it does not matter that much anymore.
In the demo, I've created the route context in the CMS application.
<?xml version="1.0" encoding="UTF-8"?>
<camelContext xmlns="http://camel.apache.org/schema/spring" id="search-integration">
<route id="Route-HippoEvent-to-MB">
<!-- Subscribe publish/depublish events as JSON from HippoEventBus. -->
<from uri="hippoevent:?category=workflow&action=publish,depublish" />
<!-- Convert the JSON message to String. -->
<convertBodyTo type="java.lang.String" />
<!-- Leave log. -->
<to uri="log:org.onehippo.forge.camel.demo?level=INFO" />
<!-- Route String message to the Message Broker -->
<to uri="amqp:topic:content-search" />
</route>
</camelContext>
The route starts from an event in the repository. With the hippo events plugin, there is fine-grained control where events are allowed to progress through the route by adding options as query parameters.
The body of the route after the first statement is a Hippo Event Object
org.onehippo.cms7.event.HippoEvent
The second statement converts the Hippo Event Object to a String and routes the String as a message towards the Message Broker (third statement).
With some additional tools, the message broker, in our case ActiveMQ, can be monitored to check if these messages are being delivered successfully to the queue or topic.
The client can be set up using Spring XML for client configuration using a JMS connection factory and connecting this to the AMQP component for Camel.
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="...">
<bean id="jmsConnectionFactory" class="org.apache.qpid.jms.JmsConnectionFactory">
<property name="remoteURI" value="amqps://${jms.remoteuri}?amqp.traceFrames=true&amqp.idleTimeout=120000" />
<property name="username" value="${jms.username}" />
<property name="password" value="${jms.password}" />
</bean>
<bean id="jmsCachingConnectionFactory" class="org.springframework.jms.connection.CachingConnectionFactory">
<property name="targetConnectionFactory" ref="jmsConnectionFactory" />
</bean>
<bean id="jmsConfig" class="org.apache.camel.component.jms.JmsConfiguration">
<property name="connectionFactory" ref="jmsCachingConnectionFactory" />
<property name="cacheLevelName" value="CACHE_CONSUMER" />
</bean>
<bean id="amqp" class="org.apache.camel.component.amqp.AMQPComponent">
<property name="configuration" ref="jmsConfig" />
</bean>
Next up is to install and configure the HAL plugin.
The brXM part is done now. Next is the Search Service, micro service.
For this part, we simply take an example from the camel examples and make it our own.
https://github.com/apache/camel/tree/master/examples/camel-example-spring-boot-xml
The Spring Boot application from the example is simple. We can modify the XML of the route to make sure the message from the message broker is polled by the search service, the subjectid (canonical handle id) of the workflow event is used as an identifier to retrieve the full JSON document (including page link) from the HAL API and the response of the API can be routed to the enterprise search solution, which in our case will be Elastic.
<?xml version="1.0" encoding="UTF-8"?>
<camelContext xmlns="http://camel.apache.org/schema/spring" id="SearchService">
<route id="Topic-to-Search">
<from uri="amqp:topic:content-search" />
<unmarshal ref="json-map" />
<choice>
<when>
<simple>${body[action]} == 'publish'</simple>
<log message="publish ${body}" />
<setProperty propertyName="indexId">
<simple>${body[subjectId]}</simple>
</setProperty>
<setHeader headerName="CamelHttpMethod">
<constant>GET</constant>
</setHeader>
<setBody>
<constant>null</constant>
</setBody>
<recipientList>
<simple>http4://${properties:brxm.hal.url}/api/documents/${property.indexId}</simple>
</recipientList>
<convertBodyTo type="java.lang.String" />
<setHeader headerName="indexId">
<simple>${property.indexId}</simple>
</setHeader>
<to uri="elasticsearch-rest://esDevCluster?operation=INDEX&indexName=content&indexType=document" />
</when>
</choice>
</route>
</camelContext>
In the above example, to retrieve the JSON from the HAL API, we are using the http4 component from Camel, used for HTTP service calls. And the elastic-rest component for indexing purposes with Elastic.
Client for Elastic can be setup similar way as the JMS over AMQP client
<?xml version="1.0" encoding="UTF-8"?>
<bean id="elasticsearch-rest" class="org.apache.camel.component.elasticsearch.ElasticsearchComponent">
<property name="hostAddresses" value="${elasticsearch.host}" />
</bean>
The above example will index the document in the search solution. An additional snippet in the route will check for a depublish event and will use the identifier to remove the document from the index
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="...">
<camelContext xmlns="http://camel.apache.org/schema/spring" id="SearchService">
<route id="Topic-to-Search">
<from uri="amqp:topic:content-search" />
<unmarshal ref="json-map" />
<choice>
<when>
<simple>${body[action]} == 'publish'</simple>
…..
</when>
<when>
<simple>${body[action]} == 'depublish'</simple>
<log message="depublish ${body}" />
<setHeader headerName="indexId">
<simple>${body[subjectId]}</simple>
</setHeader>
<setBody>
<simple>${body[subjectId]}</simple>
</setBody>
<to uri="elasticsearch-rest://esDevCluster?operation=DELETE&indexName=content&indexType=document" />
</when>
</choice>
</route>
</camelContext>
</beans>
What lies above covers event based indexing. To follow will be a full index for the initial index. Usually, we start with a situation where already published documents exist which need to get indexed. It could also be that every now and then we want to trigger a full reindex.
For this situation we can have the following route:
<?xml version="1.0" encoding="UTF-8"?>
<route id="Initial-Index">
<from uri="timer://runOnce?repeatCount=1&delay=5000" />
<recipientList>
<simple>http4://${properties:brxm.hal.url}/api/documents</simple>
</recipientList>
<unmarshal ref="json-map" />
<loop>
<simple>${body[_meta][totalSize]}</simple>
<log message="http - ${properties:brxm.hal.url}/api/documents?_offset=${property.CamelLoopIndex}&_limit=1" />
<setHeader headerName="CamelHttpMethod">
<constant>GET</constant>
</setHeader>
<setBody>
<constant>null</constant>
</setBody>
<recipientList>
<simple>http4://${properties:brxm.hal.url}/api/documents?_offset=${property.CamelLoopIndex}&_limit=1</simple>
</recipientList>
<unmarshal ref="json-map" />
<setBody>
<simple>${body[_embedded][documents][0]}</simple>
</setBody>
<setHeader headerName="indexId">
<simple>${body[_meta][id]}</simple>
</setHeader>
<marshal ref="json-map" />
<log message="index ${header.indexId} with body ${body}" />
<to uri="elasticsearch-rest://esDevCluster?operation=INDEX&indexName=content&indexType=document" />
</loop>
</route>
This route will trigger a full initial index during startup of the search service by making use of the HAL paginated documents API to retrieve all published documents.
Additionally, with the Camel REST DSL we could also expose a REST endpoint to trigger a full index on demand. The REST DSL could also be used as middleware between the website and Elastic to sanitize and/or secure the Elastic endpoints for searching etc.
Let’s check our Elastic Index to see if our documents have been indexed correctly.
Great Success!
As a bonus.. to close the integration full circle I’ve also included a search component in the demo using CRISP to retrieve the search results from Elastic. Using CRISP is the recommended practice to create a REST client and requires the least amount of code.