Case Study: Middleware Health Check

Project Description

An online invoicing and payment management company (client) requested a WebSphere Systems Health Check to ensure their systems were running optimally and prepared for an expected increase in volume. The project scope called for onsite current state data collection and analysis followed by offsite detailed data analysis and White Paper preparation and delivery. Next, our consultant performed the recommended changed (WebSphere patches and version upgrades).

Click here to learn more about our Middleware Health Check.

The Situation

The client described their systems as “running well” but they were concerned they may have problems as they experienced additional load (700% estimated); memory consumption was a particular concern. TxMQ completed the Health Check and worked with representatives from the client for access to production Linux boxes, web servers, application servers, portal servers, database servers and LDAP servers. Additional client representatives would be available for application-specific questions.

The Response

Monitoring of the components had to be completed during a normal production day. The web servers, application servers, database server, directory server and Tomcat server all needed to be monitored for several hours. The normal production day typically showed a low volume of transactions so the when the monitoring statistics began they were all very normal; resource usage on the boxes was very low. Log files were extracted from the web servers, directory server, database server, deployment manager, application servers and Tomcat (batch) server. Verbose garbage collection was enabled for one of the application servers for analysis and a Javacore and Heap Dump was generated on an application server to analyze threads and find potential memory leaks.

Monitoring and analysis tool options were discussed with the client. TxMQ recommended additional IBM tools and gave a tutorial on the WebSphere Performance Viewer (built into the WebSphere Admin Console). In addition, TxMQ’s consultant sent members of the client’s development team links to download IBM’s Heap Analyzer and Log Analyzer (very useful for analyzing WAS System Logs and Heap Dumps). TxMQ’s consultants met with the client’s development and QA staff to debrief them on the data gathered.

The Results

Overall, the architecture was sound and running well but the WebSphere software had not been patched for several years and code-related errors filled the system logs. There were many potential memory leaks which could have caused serious response and stability problems as the application scales for more users.

The QA team ran stress tests which indicated that the response times would get worse very quickly as more users were added. Further, the version of software and web server plugin was 6.1.0 and vulnerable to many security risks.

The HTTP access and error logs had no unusual or excessive entries. The http_plugin logs were very large and were rotated – making it faster and easier to access the most recent activity.

One of the web servers was using much more memory than the other although it should have been configured exactly the same. The production application servers were monitored over a three-day period and didn’t exhibit any outward signs of stress; the CPU was very low, memory was not maxed out, and the threads & pools were minimally used. There were a few configuration errors and warnings to research but the Admin Console settings were all well within norms.

Items of concern:

1) A large number of application code related errors in the logs; and
2) The memory consumption grows dramatically during the day.

These conditions can be caused by unapplied software patches and code-related issues. In a 24-hour period, Portal node 3 experienced 66 errors and 227 warnings in the SystemOut log and 1396 errors in the SystemErr log. These errors take system resources to process, will cause unpredictable application behavior, and can cause hung threads and memory leaks. The production database server was not stressed – it has plenty of available CPU, memory and disk space. The DB2 diagnostic log had recorded around 4536 errors and 17,854 warnings in the previous few months. The Tivoli Directory server was not stressed – plenty of available CPU, memory and disk space. The SystemOut log recorded 107 errors and 8 warnings in the previous year - many of these could be fixed by applying the latest Tivoli Directory Server patch (6.1.0.53). The Batch Job (Tomcat) server was not stressed – plenty of available CPU, memory and disk space. The catalina.out log file is 64Mb and contained many errors and warnings.

The HealthCheck written analysis was delivered to the client with recommended patches and application classes to investigate for errors and memory leaks. In addition, a long-term plan was outlined to upgrade to a newer version of WebSphere Application Server and migrate off WebSphere Portal Server (since its features were not needed).

Photo courtesy of Flickr contributor Tristan Schmurr