Initialize and configure the Vector Store and Ingestion process
Overview
This guide walks you through the process of setting up and configuring the BrXM AI Vector Store and Ingestion Process.
The Vector Store relies on an external vector database. Currently, Redis and Postgres databases are supported for on-prem only projects. Both can be locally tried out by installing in a virtual container, see more info for Redis in Redis documentation and for Postgres (PgVector) in PgVector container page.
A managed vector store for Search Agent is planned for Bloomreach Cloud in Q2 2026. Please reach out to your Account Manager for more information.
The Ingestion Process is a background process that runs within a cms pod, listens for wokflow events and updates the Vector Store accordingly. It can run in:
- preview mode, where it reacts whenever a Content item is saved or deleted, correspondigly updating or removing the vectors in the store. The unpublished variant of the content item is indexed, thus the vector store will contain preview content
- live mode, where it reacts whenever a Content item is published or taken offline, again updating or removing the vectors in the store. Only the published variant is indexed, thus the vector store will only contain published content
Upon the ingestion process, the configured embedding model is asked to generate embeddings for a document, and then the Vector Store is updated with the new vectors. Generating embeddings has cost and time implications, but it has no performance implications as it happens externally, at the model provider side.
Installation
The Vector store requires an external Redis or Postgres instance running. The Ingestion process is automatically installed upon installation of the AI module.
Configure via properties files
To configure the Vector Store and Ingestion in a production ready way, use properties files in any of the locations listed below. The order of this list is important: we look for properties files in all these locations but if a property is found in more than one files, the property from the location higher in this list takes precedence.
-
System properties passed on the command line
-
A properties file named xm-ai-service.properties, visible in the classpath
-
The project's platform.properties file
- For Bloomreach Cloud implementations, Set Environment Configuration Properties
- For On-premise implementations, HST-2 Container Configuration
Multiplicity of configurations
Configuring via properties allows configuring multiple different model providers and vector stores, with only one however being active.
Global Configuration options
The brxm.ai.vectorstore property is used to specify the name of the active Vector store. Possible values are:
- Redis
- PgVector
Ingestion Process Options
The Ingestion process is governed by the following properties:
| Property | Required | Type | Description | Default | Example |
|
brxm.ai.ingest.mode |
yes | enum | Ingestion operating mode. In preview mode, unpublished content is indexed upon save/rename/copy/move operations, while in live mode only published content in indexed during publication (and scheduled publication). | preview or live | |
| brxm.ai.ingest.types
|
no | list of doctypes | Comma sepatated list of fully qualified document types. Only content of those types will be indexed. If not provided, no content will be ingested. Removal from store ignores this filter |
No types allowed | myproject:bannerdocument, hippogallery:exampleAssetSet |
| brxm.ai.ingest.include-dirs | no | list of paths | Comma sepatated list of absolute paths. Unless the document is under one of those paths, it will be skipped from ingestion. Removal from store ignores this filter | Any path is considered included | /content/documents/myproject/banners/, /content/documents/myproject/news/ |
| brxm.ai.ingest.exclude-dirs | no | list of paths | Comma sepatated list of absolute paths. If the document is under any of those paths, it will be skipped from ingestion. Removal from store ignores this filter | No path is considered excluded | /content/documents/myproject/taxonomies/, /content/documents/myproject/private/ |
|
brxm.ai.ingest.initial-delay |
no | integer (seconds) | How long, in seconds, after system startup, should the Ingestion process begin | 300 | 1000 |
|
brxm.ai.ingest.interval |
no | integer (seconds) | How often the process runs | 12 | 60 |
|
brxm.ai.ingest.batch-size |
no | integer | The ingestion processes multiple documents at once and sends them all together to the Vector Store. Reduce if your Vector Store gets overloaded | 5 | 2 |
|
brxm.ai.ingest.delay |
no | integer (milli-seconds) | Back-off time after processing of every batch. Increase if your Vector Store gets overloaded |
1000 |
10000 |
Redis options
| Property | Required | Type | Description | Default | Example |
| brxm.ai.vectorstore.redis.host | yes | url | The hostname of your Redis instance |
myredis 127.0.0.1 |
|
|
brxm.ai.vectorstore.redis.port |
yes | integer | The port where your Redis is listening on | 6379 | |
|
brxm.ai.vectorstore.redis.index |
yes | string | The index name that is used in Redis to store all your embeddings | myindex | |
|
brxm.ai.vectorstore.redis.prefix |
yes | string | Each entry in your index is prefixed with this string. Used for identification purposes | my_prefix | |
|
brxm.ai.vectorstore.redis.user |
no | string | When stronger security is used, a user must be provided | 4096 | 15000 |
|
brxm.ai.vectorstore.redis.password |
no | string | When stronger security is used, a password must be provided | ||
| brxm.ai.vectorstore.redis.client-name | no | string | The name of the connecting application. Used for identification purposes. | myAppName | |
| brxm.ai.vectorstore.redis.timeout-millis | no | integer | Time-out, in milli-seconds, for connections towards the Redis database |
latest Redis default |
5000 |
PgVector options
| Property | Required | Type | Description | Default | Example |
| brxm.ai.vectorstore.pgvector.url | yes | url | The connection string to PgVector | jdbc:postgresql://myhost:5432/mydbname | |
|
brxm.ai.vectorstore.pgvector.username |
yes | integer | The username to connect to PgVector | ||
|
brxm.ai.vectorstore.pgvector.password |
yes | string | The password to connect to PgVector | ||
|
brxm.ai.vectorstore.pgvector.dimensions |
yes | string | Embeddings dimension. Dimensions are set to the embedding column upon initial table creation. If you change the dimensions your would have to re-create the vector_store table as well. | 1536 | |
|
brxm.ai.vectorstore.pgvector.index-type |
no | string |
Nearest neighbor search index type. Options are: NONE - exact nearest neighbor search, IVFFlat - index divides vectors into lists, and then searches a subset of those lists that are closest to the query vector. HNSW - creates a multilayer graph. |
HNSW | |
|
brxm.ai.vectorstore.pgvector.distance-type |
no | string | Search distance type. Defaults to COSINE_DISTANCE. But if vectors are normalized to length 1, you can use EUCLIDEAN_DISTANCE or NEGATIVE_INNER_PRODUCT for best performance. | COSINE_DISTANCE | |
| brxm.ai.vectorstore.pgvector.remove-existing-vector-store-table | no | boolean | Deletes the existing vector_store table on start up. | false | true |
| brxm.ai.vectorstore.pgvector.initialize-schema | no | boolean | Whether to initialize the required schema |
false |
true |
| brxm.ai.vectorstore.pgvector.schema-name | no | string | Vector store schema name | public | myschema |
| brxm.ai.vectorstore.pgvector.table-name | no | string | Vector store table name | vector_store | my_vector_table |
| brxm.ai.vectorstore.pgvector.schema-validation | no | boolean | Enables schema and table name validation to ensure they are valid and existing objects. | false | true |
| brxm.ai.vectorstore.pgvector.max-document-batch-size | no | integer | Maximum number of documents to process in a single batch. | 10000 |
Logging
To help with troubleshooting issues related to ingestion and processing of the ingestion queue, the following log levels can be set for <Logger name="com.bloomreach.xm.ai.service.impl.vector.ingest" level=.. />
Setting level to:
- info, will show logs when ingestion happens and the ingestion queue being used
- debug, will show more detailed logs when ingestion happens, when the scheduler runs and when the ingestion queue being used
- trace, will additionally shows logs from the event listeners in the CMS, the entry points that trigger an ingestion