Write an Updater Script
Introduction
Goal
Write a Groovy Updater Script to perform bulk changes to repository content.
Background
In order to perform bulk changes to existing content in a running repository, developers have the option to write updater scripts in the Groovy language. Updater scripts have access to the full JCR API.
With Great Power Comes Great Responsibility
Updater scripts can modify large parts of your repository. Use them with care.Security
The scripts are executed via a custom Groovy ClassLoader which protects against obvious and trivial mistakes and misuse (for example invoking System.exit()). However this is not intended to provide a fully protected Groovy sandbox. This means that technically Groovy Updater scripts can be used to execute external programs, possibly compromising the server environment.Therefore protection against incorrect usage of Groovy updater scripts must be enforced by limiting the access and usage to trusted developers and administrators only.
Create a New Script
Log into the CMS as admin.
Open the Admin perspective, select Updater Editor, and click on the New button.
Enter a Name for the script.
All other options are execution options, see Run an Updater Script for more information.
Implement NodeUpdateVisitor
Updater scripts are written in Groovy and must implement the interface NodeUpdateVisitor:
/** * Visitor for updating repository content. Replaces * {@link org.hippoecm.repository.ext.UpdaterModule}s for all update tasks * except backward incompatible node type changes. */ public interface NodeUpdateVisitor { /** * Allows initialization of this updater. Called before any other method is * called. * * @param session a JCR {@link Session} with system credentials * @throws RepositoryException when thrown, the updater will not be run by * the framework */ void initialize(Session session) throws RepositoryException; /** * Update the given node. * * @param node the {@link Node} to be updated * @return <code>true</code> if the node was changed, <code>false</code> * if not * @throws RepositoryException if an exception occurred while updating * the node */ boolean doUpdate(Node node) throws RepositoryException; /** * Revert the given node. This method is intended to be the reverse of the * {@link #doUpdate} method. * It allows update runs to be reverted in case a problem arises due to the * update. The method should throw an {@link UnsupportedOperationException} * when it is not implemented. * * @param node the node to be reverted. * @return <code>true</code> if the node was changed, <code>false</code> * if not * @throws RepositoryException if an exception occurred while reverting * the node * @throws UnsupportedOperationException if the method is not implemented */ boolean undoUpdate(Node node) throws RepositoryException, UnsupportedOperationException; /** * Allows cleanup of resources held by this updater. Called after an * updater run was completed. */ void destroy(); }
Most scripts will extend the base class BaseNodeUpdateVisitor, which provides a logger and default (no-op) implementations of the methods initialize and destroy.
The updater engine uses the visitor pattern. For each visited node, the updater engine will call the script method doUpdate. When the script modifies the node in any way, it should notify the updater engine by returning true from that method.
The default updater script only logs the paths of all visited nodes:
package org.hippoecm.frontend.plugins.cms.admin.updater import org.onehippo.repository.update.BaseNodeUpdateVisitor import javax.jcr.Node import javax.jcr.RepositoryException import javax.jcr.Session class UpdaterTemplate extends BaseNodeUpdateVisitor { boolean logSkippedNodePaths() { return false // don't log skipped node paths } boolean skipCheckoutNodes() { return false // return true for readonly visitors and/or updates unrelated to versioned content } Node firstNode(final Session session) throws RepositoryException { return null // implement when using custom node selection/navigation } Node nextNode() throws RepositoryException { return null // implement when using custom node selection/navigation } boolean doUpdate(Node node) { log.debug "Updating node ${node.path}" return false } boolean undoUpdate(Node node) { throw new UnsupportedOperationException('Updater does not implement undoUpdate method') } }
The node parameter is a javax.jcr.Node object with which to gain full JCR access to the repository.
See example 1 (Add a property) at Groovy Updater Scripts Examples for a basic implementation.
Implement Optional Features
Parameters
If your updater script can be reused multiple times without modification of the source, it is useful to set parameters and let your script read the parameters instead of using hard-coded values.
Parameters can be specified in the execution options as a valid JSON string which defines a map of parameter name (String) and parameter value (Object) pairs.
In your script, you may access the parameters by using the parametersMap variable. For example, if you set Parameters to { "basePath": "/content/documents/myproject/news", "tag" : "gogreen" }, then you can access those parameters anywhere (e.g, in #initialize(Session) or #doUpdate(Node) method) in your updater script as follows:
def basePath = parametersMap["basePath"] def tag = parametersMap["tag"] log.debug "basePath: ${basePath}, tag: ${tag}"
Undo
An updater script can support easy undo of its modifications by implementing the undoUpdate method. That method should revert a node back to the state before doUpdate was called.
Example 1 (Add a property) at Groovy Updater Scripts Examples implements undoUpdate.
Custom Node Visiting Logic
Typically, the nodes visited by the script are specified (in the execution options) by either an XPath query or a repository path. Alternatively, an updater script can provide the logic for navigating one or more nodes to visit, by implementing (overriding) the following two methods provided by the BaseNodeUpdateVisitor base class of the UpdaterTemplate script:
/** * Initiates the retrieval of the nodes when using custom, instead of path or xpath (query) based, node * selection/navigation, returning the first node to visit. Intended to be overridden, default implementation returns null. * @param session * @return first node to visit, or null if none found * @throws RepositoryException */ public Node firstNode(final Session session) throws RepositoryException { return null; } /** * Return a following node, when using custom, instead of path or xpath (query) based, node selection/navigation. * Intended to be overridden, default implementation returns null. * @return next node to visit, or null if none left * @throws RepositoryException */ public Node nextNode() throws RepositoryException { return null; }
A contrived example usage (visiting all nodes of type hippo:document, e.g. similar to just specifying a XPath query: //element(*, hippo:document) is:
private NodeIterator nodeIterator; Node firstNode(final Session session) throws RepositoryException { final javax.jcr.query.QueryManager queryManager = session.getWorkspace().getQueryManager(); final javax.jcr.query.Query jcrQuery = queryManager.createQuery("//element(*, hippo:document)", "xpath"); nodeIterator = jcrQuery.execute().getNodes(); return nextNode(); } Node nextNode() throws RepositoryException { return nodeIterator.hasNext() ? nodeIterator.next() : null; }
The difference with using a Repository path or XPath query based Updater is that those will first query/iterate through all nodes to be visited before calling the script method doUpdate(Node) method, while (in the above example) that method will be invoked during the query iteration. Which may be (in some use-cases) more efficient. In addition, this way a long running updater script can be cancelled during the query iteration and the node update process, whereas otherwise this only is possible during the node update process.
A different, not advisable, approach sometimes used is with an XPath query to select the rep:root node and implement all custom processing within the (single) doUpdate method call. Which works but cannot be cancelled!
Override Default Behavior
There are two boolean function methods provided in the BaseNodeUpdateVisitor which sometimes might be worthwhile to override the default behavior:
skipCheckoutNodes(): by default (returning false) before visiting a node through the doUpdate method, it will be checked out if necessary to ensure updating the node actually is allowed. If however the updater script only is used for querying and reporting, or performing updates unrelated to versionable content, then unnecessarily checking out nodes can cause substantial overhead. In that case, this method can be modified (overridden) to return true instead.
/** * Overridable boolean function to indicate if node checkout can be skipped (default false) * @return true if node checkout can be skipped (e.g. for readonly visitors and/or updates unrelated to versioned content) */ public boolean skipCheckoutNodes() { return false; }
logSkippedNodePaths(): by default (returning true) all visited node paths for which the doUpdate method returned false are (also) logged as a separate audit trail in the repository. If this is a substantional number of nodes skipped and the audit trail is not needed, this method can be modified (overridden) to return false instead.
/** * Overridable boolean function to indicate if skipped node paths should be logged (default true) * @return true if skipped node paths should be logged */ public boolean logSkippedNodePaths() { return true; }
Manually Report Updated/Skipped/Failed Nodes
The updater engine automatically records the updated, skipped or failed count on every invocation on #doUpdate(Node) method by default. So, if each unit task of the update process in your updater script matches with each node iteration based on either path or query configuration, this automatic recording and batch processing by the updater engine should be good enough.
However, if your updater script doesn't match with the node iteration based on either path or query configuration but it makes a query and iterates nodes manually, then the generated report would not reflect what the updater script really executed. Such a script can't take advantage of using 'Dry run' option, and its execution is not controlled by the batch processing of the updater engine with the batch size configuration, either. Even worse, it may cause an impactful system overhead (e.g, consuming too much memory) due to uncontrolled batch updates.
To address the potential problem mentioned above, an updater script may report the updated/skipped/failed nodes manually by using visitorContext variable (type of org.onehippo.repository.update.NodeUpdateVisitorContext).
Here's an example using visitorContext to report the updated news document count after changing a field in a manual node iteration:
/** * ExampleNewsDocumentDateFieldUpdateDemoVisitor is a script that does manual node iteration * in an original iteration cycle and reports updated node manually in order to be aligned * with the built-in batch commit/revert feature of the updater engine for demonstration purpose. */ package org.hippoecm.frontend.plugins.cms.admin.updater import org.onehippo.repository.update.BaseNodeUpdateVisitor import java.util.* import javax.jcr.* import javax.jcr.query.* class ExampleNewsDocumentDateFieldUpdateDemoVisitor extends BaseNodeUpdateVisitor { boolean doUpdate(Node node) { log.debug "Visiting node at ${node.path} just as an entry point in this demo." // new date field value from the current time def now = Calendar.getInstance() // do manual query and node iteration def query = node.session.workspace.queryManager.createQuery("//element(*,demosite:newsdocument)", "xpath") def result = query.execute() for (NodeIterator nodeIt = result.getNodes(); nodeIt.hasNext(); ) { def newsNode = nodeIt.nextNode() newsNode.setProperty("demosite:date", now) // report updated to the engine manually here. visitorContext.reportUpdated(newsNode.path) } return false } boolean undoUpdate(Node node) { throw new UnsupportedOperationException('Updater does not implement undoUpdate method') } }
In the example shown above, it invokes visitorContext.reportUpdated(path) method after setting "demosite:date" property. And so, the updater engine can be aware of how many nodes were updated and do the batch processing (either save or discard session) properly based on the batch size configuration.
Remarks
Default Imports
By default all of the main JCR API packages are already imported by the script classloader: javax.jcr, javax.jcr.nodetype, javax.jcr.security, and javax.jcr.version. You should not have to import package members explicitly anymore.
Restrictions
Some basic restrictions apply to the calls you can make and the classes you can use from your script. Interaction with the local filesystem has been disabled, the following classes cannot be used: java.io.File, java.io.FileDescriptor, java.io.FileInputStream, java.io.FileInputStream, java.io.FileOutputStream, java.io.FileWriter, java.io.FileReader, along with the following packages: java.nio.file, java.net, javax.net, javax.net.ssl. It is also not possible to use reflection, calling Class.forName is illegal and you can't use the package java.lang.reflect. Calling System.exit is also prevented.
There can be additional limitations with respect to the accessible classpath when automatically executing an updater script at startup (see Run an Updater Script), depending on in which environment it is executed.
In a delivery-tier-only environment, only the functionality provided by the Hippo Repository might be available on the classpath.
Portability
The scripts, when executed from within the Updater Editor, are using a classloader in the CMS application context. Therefore, all libraries packaged with your CMS application are available to use by your script. If, however, you wish to develop scripts that can be reused in multiple projects you should take care not to use libraries that are only packaged with that project. The safest bet would be to only use libraries and APIs that are available in the shared class loader only but availability of libraries such as commons-collections and guava can be depended on with some confidence as well.
Furthermore, for automatically executed scripts during startup (see Run an Updater Script) possibly only classes in the Repository context might be available in a delivery-tier only environment.