Large financial services firm (client) grew and began experiencing debilitating outages and application slowdowns – which were blamed on their WebSphere Application Server and entire WebSphere infrastructure. The client and IBM called TxMQ to open an investigation, identify what was causing the problem, and determine how to put systems, technology and a solution in place to prevent the problems from recurring in the future, while at the same time allow the client to scale for continued planned growth.
As transaction volume increased, and more locations came online, the situation got worse and worse, at times, actually shutting down access completely at some sites. TxMQ proposed a one-week onsite current state analysis to be followed by one?week configuration change testing in the QA region and then one week of document preparation and presentation.
The primary function of the application is to support around 550 of the company locations financial processes; the average number of terminals in a location is around five, so more than 2,500 connections are possible. Our client suspected the HTTP sessions were large and thus interfering with their ability to turn on session replication. The code running on the multiple WebSphere Application servers was Java/JEE with a Chordiant framework (an older version not currently in support). There are 48 external web services including CNU and Veritec – mostly batch-oriented. The Oracle database was running on an IBM p770 and was heavily utilized during slowdowns. Slowdowns could not be simulated in the pre-production environment; transaction types and workflow testing had not been automated (a future project would do just that).
TxMQ’s team met with members of the client’s WebSphere production environment. There were two IBM HTTP web servers with NetScaler as the front-end IP sprayer; the web servers ran 3-5% CPU and were not suspected to be a bottleneck. The web servers round robin to multiple WebSphere Application Servers configured the same way – except two servers had a few small additional applications. The application servers ran on IBM JS21 blade servers which were approximately 10 years old. Some recent diagnostics indicated a 60% session overhead (garbage collection), so more memory was added to the servers (total 12 GB per server) and the WebSphere JVM heap size was increased to 4GB; some performance improvement was realized. The daily processing peak times were from 11 am – 1 pm and 4 – 5 pm with Fridays the busiest. Oracle 11g served as the database with an instance for operational processing and a separate instance for DR and BI processing; the client drivers are version 10g. Our team met with client team members to discuss the AIX environment. The client ran running multiple monitoring tools, so some data was available to analyze the situation at multiple levels. The blade servers were suspected to be underpowered for the application and future growth. Our consultants learned that there was an initiative to upgrade the servers to IBM Power PS700 blades planned in the first quarter of the next year. The client also indicated that the HTTP sessions may be very large and the database was experiencing a heavy load possibly from un-tuned SQLs or DB locks.
TxMQ’s team began an analysis of the environment, including working with the client to collect baseline (i.e. normal processing day) analysis data. We observed the monitoring dashboard, checked WAS settings and collected log files and Javacores with the verbose garbage collection option ‘on.’ In addition, we collected the system topology documents. The following day, TxMQ continued to monitor the well-performing production systems, analyzed the systems analysis data collected the previous day, and met with team members about the Oracle Database. TxMQ’s SME noted that the WAS database connection pools for Chordiant were using up to 40 of the 100 possible connections, which was not an indication of a saturated database. The client explained that they use Quest monitoring tools and showed the production database current status. The database was running on a p770 and could take as many of the 22 CPUs as needed - they had seen up to 18 CPUs used on bad days. The client’s DBAs have a good relationship with the development group and monitor resource utilization regularly – daily reports are sent to development with the SQLs consuming the most CPU. No long-running SQLs were observed that day; most were running fewer than 2?5 seconds. Our SME then met with the client’s middleware group and communicated preliminary findings. In addition, he met with the development group, since they had insight into the Chordiant framework. Pegasystems purchased Chordiant in 2010 and then abandoned the product. The database calls were SQL, not stored procedures. The application code has a mix of Hibernate (10%), Chordiant (45%), and direct JDBC (45%) database accesses. Large HTTP session sizes were noticed and the development group noted that the session size could likely be reduced greatly. The client’s application developers didn’t change any Chordiant code - they are programming to the APIs. The developers used RAD but have not run the application profiler on their application. Rational Rose modeler provided the application architecture (business objects, business services, worker objects, service layer). In addition, the application used JSPs but was enhanced with Café tags.
Applications worthy of code review/rewrite included the population of GL events into the shopping cart during POS transactions. On the following day the slowdown event occurred; by 10:00 am all application servers were over 85% CPU usage and the user response times were climbing over 10 seconds. At 10:30 am and again at 12:15 pm database locks and some hung sessions were terminated by the DBAs. The customer history table was getting long IO wait times. One application server was restarted at the operations group request. The SIB queue filled up at 50,000 (due to CPU starvation). A full set of diagnostic logs and dumps were created for analysis. By 4:30 pm the situation had somewhat stabilized. TxMQ SME’s observed that the short-term problem appeared to be a lack of application server processing power and the long-term problems could best be addressed after dealing with the short-term problem. They recommended an immediate processor upgrade. Plans were made to upgrade the processors with a backup box. Over the weekend the client moved their WebSphere application servers to a backup p770 server. A load similar to the problem load was experienced again several days later and it was reported that the WAS instances ran around 40% CPU load and user response times were markedly better than Friday.
TxMQ presented a series of recommendation for the client’s executive board, including but not limited to:
Chordiant framework and client application code should be the highest priority due to the number and type of errors.
The Chordiant framework has been made obsolete by Pegasystems and should be replaced with newer, SOA-friendly framework(s) such as Spring and Hibernate.
Review of application code. There are many errors in the error logs, which can be fixed in the code.
Reduce the size of the HTTP Session. Less than 20K is a good target.
WAS 6.1 should be upgraded to Version 7 or 8 before it goes out of support.
IBM extended support or a third-party (TxMQ) support is available
Upgrade may not be possible until Chordiant framework replaced
Create a process for tracing issues to the root cause. An example would be the DB locks, which had to be terminated (and some user sessions terminated). These issues should be followed up to determine the cause and remedial actions.
Enhance the System Testing to emulate realistic user loads and workflows. This will allow more thorough application testing and gives the administrators a chance to tweak configurations under production-like loads.
Photo courtesy of Flickr contributor “Espos”