Over the past 9 months my colleagues and I have been developing an enterprise portal using JBoss Portaland Alfresco. Currently, Alfresco does not offer out of the box content portlets so we created our own. Working closely with Alfresco we chose to use their new WCM product for its user sandbox and workflow approval features. This article explains some interesting findings related to performance and scalability when integrating with Alfresco web content.
We have a fairly simple physical architecture: two heavy-duty 64-bit servers each running RHEL, two instances of JBoss Portal 2.7, and a single instance of Alfresco 3.0. The result is a total of four clustered portal nodes and two clustered Alfresco nodes. Each portal node configuration points to its local Alfresco instance. The idea was that we would improve performance either by leveraging the loopback network interface or by eliminating the network altogether by running Alfresco within the same application server as portal. Using Alfresco webscripts, we implemented several REST-based services to handle all the integration points. Our approach couldn't be more lightweight.
Our solution utilizes both Alfresco DM and WCM. In the clustered environment, we began to notice several issues with clustering WCM. Most notably, content was not replicating between nodes. We were informed by Alfresco that clustering in WCM was broken and that the 3.1 release would resolve these issues. With only a few weeks before deploying to production, we decided to launch with a single shared instance of Alfresco until after we carefully test 3.1.
Our REST services still performed very well. Each portlet making a remote call executed in less than 50 milliseconds. Server side processing of content-heavy pages executed in about 300 milliseconds. I was reluctant to add a layer of content caching because performance was decent even after we switched to the shared Alfresco node. I figured I'd wait and see what load testing would turn up.
We usedApache JMeterto run a thorough set of tests against the portal. The use cases were about half content pages and half custom application pages. We started out running a load test with 200 concurrent users with 5-15 seconds between requests. The results were as expected. Content pages took 300-400 milliseconds and application pages took anywhere from 500-1500 milliseconds. We proceeded to run a 400 concurrent user test with 5-15 seconds between requests. This time I was surprised by the results. The response times of the application pages doubled to 3000 milliseconds and the content page response times averaged about 12 seconds - a huge increase!
So this begs the question... When going from 200 to 400 concurrent users, why did the performance degrade significantly? While running the 200 user test there were about 2,000 context switches per second. The 400 user test made the number of context switches leap to about 24,000. Any level that high will hamper performance across the board. It's hard to say if the inefficiencies occurred as a result of Alfresco or if it was simply the number of times the portal was making fine-grained remote calls.
With these results, it's clear that caching content is essential to performance. It's easy to implement (we usedEHCache) but it does raise some interesting questions. How long is the user willing to wait to see their content changes in portal once they are approved? If the timeout is to short, will performance still be impacted? I decided to keep the timeout short with a 5 minute setting. Even though it is a short duration, no matter how many users request the content within 5 minutes, the portal will only need to retrieve it once, which essentially removes the bottleneck entirely. As the solution evolves we will be handling evicting content instantly after an update.
Re-running the 400 user test yielded better results. Not only did our content portlets execute in ~1 millisecond by pulling the content out of memory but the response times on those pages went down to under 1 second.
The whole experience reminded me of Martin Fowler's First Law of Distributed Object Design: Don't distribute. The nature of portal running several discrete applications in a shared environment often makes performance an issue. Even if remote calls seem lightweight developers should always consider caching in order to avoid them. It's all about being a good constituent in a portal environment.