Case Study: Client experiences WebSphere Business Integrator Outage

Project Description

Regional grocery chain (CLIENT) experienced outage in their WebSphere Business Integrator (WBI) application. WBI is no longer supported and the application developer was no longer available.

The Situation

This CLIENT was using an older WebSphere® product, WebSphere Business Integrator (WBI) any they had developed an application called Item Sync (developed by IBM Global). Item Sync was not working properly and the CLIENT needed to take steps to correct.
The application flow is work initiated by vendors and that work is visible in the Vendor Transaction List (VTL) screen. The WBI application is responsible for routing the work to the next step in the approval process. This routing was not occurring and was part of the overall problems.
The other side of the problem was that the sync application MQ Archive queue filled up. Because it was filled up, it was not accepting new messages. The initial corrective steps taken by the customer was to purge the archive queue. They then proceeded to reboot the WBI server; the MQ collectors were verified as active and in their correct state and sequence. New work was then being seen in the VTL screen, but it was not being seen in the next step.

The Response

One week earlier the CLIENT had experienced problems when its archive queue became full and workflow process was not working. They proceeded to purge the queue, which included all messages. As these messages were not persistent the messages had not been logged and were therefore lost.
As part of the initial response the CLIENT proceeded to reboot the WBI server and the MQ collectors were verified as running in their correct state and sequence.
A conference call between TxMQ and the CLIENT was conducted. During this call, in addition to hearing the issue, TxMQ’s consultant also made a recommendation to restore the backed up configuration and determine if the connectivity for the workflow processing would begin working. During this call the recommendation was made to begin using the native MQ alerts which provide an early warning system in the event of problem like queue filling.
The same evening after the conclusion of the conference call, the CLIENT proceeded to restore the configuration, which included MQ, connectors and WBI. The CLIENT then tested the environment by bringing up the environment and testing. The environment came up and was operational.
The Queue depth on the archive queue had also been expanded so that the issue with the archive queue filling would not happen again.
The next morning, the CLIENT and TxMQ reconvened. The current situation had been noted and the last steps included recreating the messages deleted from the archive queue. The customer was looking for TxMQ to help recreate the lost application messages. Unfortunately since TxMQ was unfamiliar with the application schema and process, the restore for the application messages would be better left to the business owners.

The Results

In this scenario, queues should not be purged in an overflow condition. The correct action to would have been to copy the messages to a backup queue or file system to be replayed later.
Before steps were taken to partially initiate a fix, care should have gone into making sure that it was a complete fix that would work.

Case Study: Medical Mutual Reduces Fees By $500K Through Real-Time Processing, Security

by Natalie Miller | @natalieatWIS for IBM Insight Magazine

Project Description

Ohio healthcare provider Medical Mutual wanted to take on more trading partners and more easily align with government protocols, but was without the proper robust and secure infrastructure needed to support the company’s operations.“We needed to set up trading partner software and a B2B infrastructure so we could move the data inside and outside the company,” says Eleanor Danser, EDI Manager, Medical Mutual of Ohio. “The parts that we were missing were the trading partner software and the communications piece to support all the real-time protocols that are required from the ACA, which is the Affordable Care Act.
”Medical Mutual already had IBM WebSphere MQ and IBM WebSphere Message Broker, as well as IBM WebSphere Transformation Extender (TX) in its arsenal to move the company’s hundreds of daily file transfer protocol (FTP) transactions. Healthcare providers are constantly moving data and setting up connections between different industry sectors—efforts that involve securing information from providers and employers who then send out to clearinghouses and providers.
“It’s constantly moving data back and forth between different entities—from claims data, membership data, eligibility and benefit information, claims status—all the transactions that the healthcare industry uses today,” says Danser.
However, as the healthcare industry evolves, so does its need for streamlined and easy communication. Medical Mutual also realized that their current infrastructure didn’t provide the company with the necessary authentication and security. It needed a Partner Gateway solution with batch and real-time processing that could match or exceed the 20-second response window in order to stay HIPAA compliant.
Medical Mutual sought a solution to aid with the communications piece of the transaction, or the “the handshake of the data,” explains Danser. “You must build thorough and robust security and protocols built the authentication of a trading partner to be able to sign in and drop data off to our systems, or for us to be able to drop data off to their systems …. It’s the authentication and security of the process that must take place in order to move the data.”
Without the proper in-house expertise for such a project, Medical Mutual called upon TxMQ, an IBM Premier business partner and provider of systems integration, implementation, consultation and training.

Choosing a solution and assembling a team

Since Medical Mutual already had an existing infrastructure in place using IBM software, choosing an IBM solution for the missing trading partner software and the communication piece was a practical decision.
We went out and looked at various vendor options. If we went outside of IBM we would have had to change certain parts of our infrastructure, which we really didn’t want to do. So this solution allowed us to use our existing infrastructure and simply build around it and enhance it. It was very cost effective to do that..
“We went out and looked at various vendor options,” explains Danser. “If we went outside of IBM we would have had to change certain parts of our infrastructure, which we really didn’t want to do. So this solution allowed us to use our existing infrastructure and simply build around it and enhance it. It was very cost effective to do that.”
In December 2012, Danser and her team received approval to move forward with IBM WebSphere DataPower B2B Appliance XB62 — a solution widely used in the healthcare industry with the built-in trading partner setup and configurations Medical Mutual wanted to implement.

TxMQ’s experience and connections set Medical Mutual up for success

The project kicked off in early 2013 with the help of four experts from TxMQ. The TxMQ team of four worked alongside Danser’s team of four fulltime staff members from project start through the September 2013 launch of the system.
“[TxMQ] possessed the expertise we needed to support what we were trying to do,” says Danser of the TxMQ team, which consisted of an IBM WebSphere DataPower project manager, an IBM WebSphere DataPower expert, an IBM WebSphere TX translator expert, and an IBM WebSphere Message Broker expert. “They helped us with the design of the infrastructure and the layout of the project. “
The design process wrapped up in April 2013, after which implementation began. According to Danser, the TxMQ project manager was on-site in the Ohio office once a week for the first few months. The Message Broker expert was onsite for almost four months. Some of the experts, for IBM WebSphere DataPower as one example, had weekly meetings from an offsite location.

Overcoming Implementation Challenges

TxMQ stayed on until the project went live in September 2013— two-and-a-half months past Danser’s original delivery date estimate. The biggest challenge that contributed to the delay was Medical Mutual’s limited experience with the technology, which required cross-training.
“We didn’t have any expertise in-house,” explains Danser, adding that the IBM WebSphere DataPower systems and the MQFTE were the steepest parts of the learning curve. “We relied a lot on the consultants to fill that gap for us until we were up to speed. We did bring in some of the MQ training from outside, but primarily it was learning on the job, so that slowed us down quite a bit. We knew how our old infrastructure worked and this was completely different.”
Another issue that contributed to delay was the need to search and identify system-platform ownership. “Laying out ownership of the pieces … took a while, given the resources and time required,” explains Danser. “It involved trying to lay out how the new infrastructure should work and then putting the processes we had in place into that new infrastructure. We knew what we wanted it to do—it was figuring out how to do that.”

   We also wanted to make sure that the solution would support us for years to come, not just a year or two. By the time we were done, we were pretty confident with the decision that we made. Overall we feel the solution was appropriate for Medical Mutual.   

– Eleanor Danser, EDI Manager, Medical Mutual of Ohio

And because Danser’s team wanted the system to work the same way as the existing infrastructure, heavy customization was also needed. “There was a lot of homegrown code that went into the process,” she adds.

Project realizes cost savings, increased efficiency

Since the implementation, Medical Mutual reports real cost savings and increased efficiency. As was the goal from the beginning, the company can now more easily take on trading partners. According to Danser, the use of IBM WebSphere DataPower creates an infrastructure that greatly improves the time needed to set up those trading partner connections, including a recent connection with the Federal Exchange. Medical Mutual is now able to shorten testing with trading partners’ and move data more quickly.
“Before, it would take weeks to [take on a new partner], and now we are down to days,” says Danser.
“We’re not restricted to just the EDI transactions anymore,” she continues, explaining that Medical Mutual’s infrastructure is now not only more robust, but more flexible. “We can use XML [Management Interface] and tools like that to move data also.”
IBM WebSphere DataPower additionally moved Medical Mutual from batch processing into a real-time environment. The new system gives trading partners the ability to manage their own transactions and automates the process into a browser-based view for them, so onboarding new partners is now a faster, more scalable process.
Additionally, Medical Mutual has been able to significantly reduce transaction fees for claims data by going direct with clearinghouses or other providers. According to Danser, Medical Mutual expects an annual savings of $250,000 to $500,000 in transactional fees.
Photo courtesy of Flickr contributor Michael Roper

Case Study: Middleware HealthCheck

Project Description

An online invoicing and payment management company (client) requested a WebSphere Systems HealthCheck to ensure their systems were running optimally and prepared for an expected increase in volume. The project scope called for onsite current state data collection and analysis followed by offsite detailed data analysis and White Paper preparation and delivery. Next, our consultant performed the recommended changed (WebSphere patches and version upgrades).
[callout type=”left” title=”Download Case Study” message=”Middleware HealthCheck” button_text=”Download Now!” href=”https://txmq.com/wp-content/uploads/2014/05/MiddlewareHealthCheck.pdf”]

The Situation

The client described their systems as “running well” but they were concerned they may have problems as they experienced additional load (700% estimated); memory consumption was a particular concern. TxMQ completed the HealthCheck and worked with representatives from the client for access to production Linux boxes, web servers, application servers, portal servers, database servers and LDAP servers. Additional client representatives would be available for application-specific questions.

The Response

Monitoring of the components had to be completed during a normal production day. The web servers, application servers, database server, directory server and Tomcat server all needed to be monitored for several hours. The normal production day typically showed a low volume of transactions so the when the monitoring statistics began they were all very normal; resource usage on the boxes was very low. Log files were extracted from the web servers, directory server, database server, deployment manager, application servers and Tomcat (batch) server. Verbose garbage collection was enabled for one of the application servers for analysis and a Javacore and Heap Dump was generated on an application server to analyze threads and find potential memory leaks. Monitoring and analysis tool options were discussed. TxMQ recommended additional IBM tools and gave the client a tutorial on the WebSphere Performance Viewer (built into the WebSphere Admin Console). In addition, TxMQ’s consultant sent members of the client’s development team links to download IBM’s Heap Analyzer and Log Analyzer (very useful for analyzing WAS System Logs and Heap Dumps). TxMQ’s consultants met with the client’s development and QA staff to debrief them on the data gathered.

The Results

Overall, the architecture was sound and running well but the WebSphere software had not been patched for several years and code-related errors filled the system logs.
There were many potential memory leaks which could have caused serious response and stability problems as the application scales for more users. The QA team ran stress tests which indicated that the response times would get worse very quickly as more users were added. Further, the version of software and web server plugin was 6.1.0 and vulnerable to many security risks.
The HTTP access and error logs had no unusual or excessive entries. The http_plugin logs were very large and were rotated – making it faster and easier to access the most recent activity.
One of the web servers was using much more memory than the other although it should have been configured exactly the same.
The production application servers were monitored over a three-day period and didn’t exhibit any outward signs of stress; the CPU was very low, memory was not maxed out, and the threads & pools were minimally used.
There were a few configuration errors and warnings to research but the Admin Console settings were all well within norms.
Items of concern:
1) a large number of application code related errors in the logs; and
2) the memory consumption grows dramatically during the day.
These conditions can be caused by unapplied software patches and code-related issues. In a 24-hour period, Portal node 3 experienced 66 errors and 227 warnings in the SystemOut log and 1396 errors in the SystemErr log. These errors take system resources to process, will cause unpredictable application behavior, and can cause hung threads and memory leaks. The production database server was not stressed – it has plenty of available CPU, memory and disk space. The DB2 diagnostic log had recorded around 4536 errors and 17,854 warnings in the previous few months. The Tivoli Directory server was not stressed – plenty of available CPU, memory and disk space. The SystemOut log recorded 107 errors and 8 warnings in the previous year -­ many of these could be fixed by applying the latest Tivoli Directory Server patch (6.1.0.53). The Batch Job (Tomcat) server was not stressed – plenty of available CPU, memory and disk space. The catalina.out log file is 64Mb and contained many errors and warnings. The HealthCheck written analysis was delivered to the client with recommended patches and application classes to investigate for errors and memory leaks. In addition, a long-­term plan was outlined to upgrade to a newer version of WebSphere Application Server and migrate off WebSphere Portal Server (since its features were not needed).
Photo courtesy of Flickr contributor Tristan Schmurr

Case Study: WAS Infrastructure Review

Project Description

Large financial services firm (client) grew and began experiencing debilitating outages and application slowdowns – which were blamed on their WebSphere Application Server and entire WebSphere infrastructure. The client and IBM called TxMQ to open an investigation, identify what was causing the problem, and determine how to put systems, technology and a solution in place to prevent the problems from recurring in the future, while at the same time allow the client to scale for continued planned growth.

The Situation

As transaction volume increased, and more locations came online, the situation got worse and worse, at times, actually shutting down access completely at some sites. TxMQ proposed a one-week onsite current state analysis to be followed by one?week configuration change testing in the QA region and then one week of document preparation and presentation.

The Challenge

The primary function of the application is to support around 550 of the company locations financial processes; the average number of terminals in a location is around five, so more than 2,500 connections are possible. Our client suspected the HTTP sessions were large and thus interfering with their ability to turn on session replication. The code running on the multiple WebSphere Application servers was Java/JEE with a Chordiant framework (an older version not currently in support). There are 48 external web services including CNU and Veritec – mostly batch-oriented. The Oracle database was running on an IBM p770 and was heavily utilized during slowdowns. Slowdowns could not be simulated in the pre-­production environment; transaction types and workflow testing had not been automated (a future project would do just that).

The Response

TxMQ’s team met with members of the client’s WebSphere production environment. There were two IBM HTTP web servers with NetScaler as the front-­end IP sprayer; the web servers ran 3-­5% CPU and were not suspected to be a bottleneck. The web servers round robin to multiple WebSphere Application Servers configured the same way – except two servers had a few small additional applications. The application servers ran on IBM JS21 blade servers which were approximately 10 years old. Some recent diagnostics indicated a 60% session overhead (garbage collection), so more memory was added to the servers (total 12 GB per server) and the WebSphere JVM heap size was increased to 4GB; some performance improvement was realized. The daily processing peak times were from 11 am – ­1 pm and 4 – ­5 pm with Fridays the busiest. Oracle 11g served as the database with an instance for operational processing and a separate instance for DR and BI processing; the client drivers are version 10g. Our team met with client team members to discuss the AIX environment. The client ran running multiple monitoring tools, so some data was available to analyze the situation at multiple levels. The blade servers were suspected to be underpowered for the application and future growth. Our consultants learned that there was an initiative to upgrade the servers to IBM Power PS700 blades planned in the first quarter of the next year. The client also indicated that the HTTP sessions may be very large and the database was experiencing a heavy load possibly from un-tuned SQLs or DB locks.

The Results

TxMQ’s team began an analysis of the environment, including working with the client to collect baseline (i.e. normal processing day) analysis data. We observed the monitoring dashboard, checked WAS settings and collected log files and Javacores with the verbose garbage collection option ‘on.’ In addition, we collected the system topology documents. The following day, TxMQ continued to monitor the well-­performing production systems, analyzed the systems analysis data collected the previous day, and met with team members about the Oracle Database. TxMQ’s SME noted that the WAS database connection pools for Chordiant were using up to 40 of the 100 possible connections, which was not an indication of a saturated database. The client explained that they use Quest monitoring tools and showed the production database current status. The database was running on a p770 and could take as many of the 22 CPUs as needed -­ they had seen up to 18 CPUs used on bad days. The client’s DBAs have a good relationship with the development group and monitor resource utilization regularly – daily reports are sent to development with the SQLs consuming the most CPU. No long-­running SQLs were observed that day; most were running fewer than 2­?5 seconds. Our SME then met with the client’s middleware group and communicated preliminary findings. In addition, he met with the development group, since they had insight into the Chordiant framework. Pegasystems purchased Chordiant in 2010 and then abandoned the product. The database calls were SQL, not stored procedures. The application code has a mix of Hibernate (10%), Chordiant (45%), and direct JDBC (45%) database accesses. Large HTTP session sizes were noticed and the development group noted that the session size could likely be reduced greatly. The client’s application developers didn’t change any Chordiant code -­ they are programming to the APIs. The developers used RAD but have not run the application profiler on their application. Rational Rose modeler provided the application architecture (business objects, business services, worker objects, service layer). In addition, the application used JSPs but was enhanced with Café tags.

Applications worthy of code review/rewrite included the population of GL events into the shopping cart during POS transactions. On the following day the slowdown event occurred; by 10:00 am all application servers were over 85% CPU usage and the user response times were climbing over 10 seconds. At 10:30 am and again at 12:15 pm database locks and some hung sessions were terminated by the DBAs. The customer history table was getting long IO wait times. One application server was restarted at the operations group request. The SIB queue filled up at 50,000 (due to CPU starvation). A full set of diagnostic logs and dumps were created for analysis. By 4:30 pm the situation had somewhat stabilized. TxMQ SME’s observed that the short-­term problem appeared to be a lack of application server processing power and the long-­term problems could best be addressed after dealing with the short-­term problem. They recommended an immediate processor upgrade. Plans were made to upgrade the processors with a backup box. Over the weekend the client moved their WebSphere application servers to a backup p770 server. A load similar to the problem load was experienced again several days later and it was reported that the WAS instances ran around 40% CPU load and user response times were markedly better than Friday.

Recommendations

TxMQ presented a series of recommendation for the client’s executive board, including but not limited to:

Chordiant framework and client application code should be the highest priority due to the number and type of errors.

The Chordiant framework has been made obsolete by Pegasystems and should be replaced with newer, SOA-­friendly framework(s) such as Spring and Hibernate.

Review of application code. There are many errors in the error logs, which can be fixed in the code.

Reduce the size of the HTTP Session. Less than 20K is a good target.

WAS 6.1 should be upgraded to Version 7 or 8 before it goes out of support.

IBM extended support or a third-­party (TxMQ) support is available

Upgrade may not be possible until Chordiant framework replaced

Create a process for tracing issues to the root cause. An example would be the DB locks, which had to be terminated (and some user sessions terminated). These issues should be followed up to determine the cause and remedial actions.

Enhance the System Testing to emulate realistic user loads and workflows. This will allow more thorough application testing and gives the administrators a chance to tweak configurations under production-­like loads.

Photo courtesy of Flickr contributor “Espos”

Case Study: WAS ND 7 Migration

Project Description

Reputable national life insurance company (client) prepares its back end development for the installation of TemenosTM application on the WebSphere Application Servicer (WAS) installation. Client sought a provider to migrate WAS 7 ND from jBoss on a Windows 2008 platform. With more than 1,000,000 insurance policies, customers rely on the company for life insurance, annuity and travel accident insurance.

The main issue within the migration was making jBoss classes visible in the EARs. Due to the difference in default binding patterns, the JDNI naming had to be modified. In addition, specific open source libraries had to be included through packaging within the application archive file or using shared libraries. Adding to the challenge was a persistence.xml file within the app_EJB/META-­? INF folder that had to be modified to work with WebSphere.

The successful migration began with a thorough analysis of jBoss deployment artifacts via IBM RAD. It was imperative that the TxMQ consultants understood the hardware runtime environment and that the tuning was completed accordingly. One of the most important components was infrastructure management. This management included logging, monitoring and deployment. TxMQ consultants also created several comparable sandbox environments in which to configure, deploy and properly test and tune the migration. Security also needed to be decided upon. After careful consideration SiteMinder and WebSphere LTPA were chosen.

The outset of the project saw migration for 16 JVMs across environments. TxMQ consultants completed performance testing to set JVM heaps/tuning parameters by using SOAPui toolsets for web services. The client chose Alfresco as the content management system and TxMQ’s consultant wrote deployment scripts to manage content into web-­?servers using Alfresco API calls. IBM ISA v4.0 was used for WebSphere monitoring, Javacore analysis and GC analysis and WebSphere’s JVM’s were configured for monitoring via SNMP.
Photo courtesy of Flickr contributor Claus Rebler

Case Study: Developing New Claims Process

Project Description

Third party insurance claims provider (client) needed to create a new proprietary claims process, at the request of Aetna, with a hard HIPAA deadline in four months. At the onset of the partnership, our client was responsible for the submission, processing and payment of claims and then pertinent information regarding the claim would be directed to the appropriate insurance company.

The Challenge

In May 2012, Aetna requested a change to this process to be completed on or before January 1, 2013. The process change included our client submitting and processing the claims, however the check to pay the claim would then be paid through Aetna with strict adherence to HIPAA regulations. A proprietary program needed to be written that would allow them to create this new process flow.

The Response

Work for the project began in September of 2012 after much go-around with the requirements and expectations leaving a four-month window for the completion of the project that would be compliant with HIPAA standards. The client was using a homegrown PowerBuilder based .Net system with the data stored in Oracle Database. The data needed to be readjusted within the file transfer process and claims processing.
TxMQ assembled a team of four members, including three PowerBuilder specialists and one Oracle specialist to create a new program that would allow the system to record claims, send them electronically, adjudicate the claim, and update the file to send back to Aetna for payment. Our PowerBuilder specialists worked long and hard to write and re-write the program several times ensuring its compatibility with Aetna systems. Our Oracle specialist continually worked to make updates to the copy scripts as the project progressed.

The Result

The team was forced to request a deadline extension when it was discovered that Aetna was not running the most recent version of HIPAA. However, within a month’s time, the adjustments to the new system had been implemented and the system went live with no issue.
Photo courtesy of Flickr contributor “Army Medicine”

Case Study: WAS ND 6 Install & Configuration

Project Description

American grocery store (client) operating approximately 1,300 supermarkets in 11 states needed assistance with implementing WAS ND v6 configuration and installation.

The Situation

Client was handling several open problem tickets with IBM in conjunction with handling an open issue with the deployment manager hanging. The client needed a Subject Matter Expert (SME) to oversee the project, determine the root cause and provide recommendations and assistance with implementing changes in the WebSphere Application Server ND V6 configuration and installation. TxMQ was asked to provide project leadership and the SME.
The Project Manager was responsible for providing a system analysis of the WAS-­ND application and system logs to identify the problem origin and recommend a course of action to resolve the issue. It was expected that the SME analyze WAS-­?ND security and Tivoli Directory setup, propose changes and technical explanations for potential solutions and then oversee the implementation of said solutions. The expectation at the end of the project was that all IBM open tickets and issues would be resolved and WAS-­ND would be installed and configured properly.

The Response & Methodology

The first step on a long list of ‘to-do’s” was to meet with the client’s on-staff system administrators to discuss all installation procedures and pre?install requirements. This was an imperative step to ensure that procedures were discussed for resolving problems via IBM PMRs for WebSphere issues. After the analysis of the project, a technical explanation was developed which outlined several possible solutions. After the client evaluated recommendations and a solution was chosen our solution was implemented. Subsequent testing was necessary as the system needed to remain in compliance with the client’s change management policies and procedures. Throughout the duration of the project, the TxMQ consultants facilitated technical meetings with onsite resources and provided status reports to all parties involved in the project.

The Result

The TxMQ consultant ensured that the project was completed and the configuration of WAS?ND remained compliant with all of the client’s change management policies and procedures.
Photo courtesy of Flickr contributor Jay Gooby

Case Study: Middleware Gap Analysis

Prepared By: Allan Bartleywood, TxMQ Subject Matter Expert, Senior Consultant and Architect, MQ

Project Description

“Regional Bank A” has a technical infrastructure supporting application integration through the use of an Enterprise Service Bus (“ESB”) serving as mediator between application endpoints and other backend systems. This tier consists of several of the IBM WebSphere products including WebSphere MQ, WTX and Message Broker.
Working together, these products provide data transformation and routing so that data exchange occurs in native-application formats in near-real-time conditions, with data transformation occurring primarily in WebSphere Message Broker with connectivity to WTX for EDI format map transformation following pre-packaged EDI standards. Message flows are created by “Regional Bank A” projects for defining routing and data delivery rules for new or changed applications.
This environment requires regular, ongoing development support as well as quarterly software maintenance for regular applying of software patches related to Linux and Microsoft Windows operating-system software.

Overview of Findings

The recent reviews conducted by the performing consultant include the infrastructure components indicated in subsection
1. In review of infrastructure best practices for the financial services industry, the following findings were noted:
2.1 Monitoring Optimization for Performance Management & Capacity Planning
Generally speaking, there is significant opportunity to improve the monitoring approach to attain monitoring and management objectives in a way that is considerably more cost-effective than what is presently being practiced.
2.2 Infrastructure Security Strategy Following Pre-Regulatory Standards
Of notice with regard to companies operating in the financial services industry, the security-regulatory environment has changed significantly in the past 10 years. The reported number of breaches in 2012 was astoundingly high at more than 1,000 occurrences. With such voracity of hacking efforts focused on financial services companies, it is imperative that security vulnerabilities be addressed as a priority, and that highest standards and practices are implemented to ensure against such attacks.
The areas identified for improvement are reviewed in the Security subsection below. There are several major components that must be addressed for “Regional Bank A” in the very near future.
2.3 Standards & Best Practices
Within the WebSphere product portfolio, there are several IBM standards and recommendations for installation, configuration and performance tuning for the infrastructure stack. In particular, the standards around the middleware-messaging components (“MQ”) were found to be inconsistent and in need of configuration management. Additionally, Java applications brokered on WebSphere Application Server were found to be running on Java Virtual Machines (“JVM”) that were not configured according to best practices across the board.
This type of situation generally occurs when multiple people are involved with installation and configuration activities, without the guidance and oversight of a middleware architect who would generally ensure that such standards are applied and documented across the topology. More observations and recommendations are shared in the subsections below.
2.4 Software Distribution and Deployment Automation
A review of “Regional Bank A’s” application-release process – i.e., how changes are made to the middleware environment – found the current process to be very informal. Because the environment is small, the implementation of automation at this time will provide significant process improvement and thus positioning “Regional Bank A” for growth. Without this automation, the ongoing cost of development efforts will continue to increase without accompanying levels of development output, due to increasing the complexity of changes and the effort required to manage so many moving parts. This area has been identified as a strategic area of investment for “Regional Bank A” organization and application-growth enablement.

Monitoring Observations

For infrastructures that include an ESB, the standard monitoring approach should encompass the entire end-to-end view of the production technical components at both base server level and application level. This will capture end-to-end business transaction success or failure to complete, providing the ability to identify where specific failures are occurring. The approach should also include the ability to capture relevant data used for planning capacity, to understand and characterizing the behavior of the end-to-end system, and provide information used for middleware performance tuning.
“Regional Bank A’s” monitoring was found to be somewhat component focused with primary focus at the hardware level. Some stats are being captured at all levels, but not in a consistent way in terms of granularity or storage of information that would make the data useful for analysis.
Examples of what is being monitored today include:

  • Real-time usage by PID using TOP
  • Some collection of server stats in the O/S

The areas of suggested improvements include:

  • At Operating-System Level – Capture state and usage of each host (physical or virtual); if running virtually, it is critical that the state is known for the physical mapping to virtual.
  • At Application-Monitor Level – Critically available information depends on knowing the state (up/down/hung) of the application stack.
  • At Transaction-Monitor Level – Service management is dependent on knowing three things:
      Number of transactions completed in the SLA
    1. How many failed?
      How many were delayed?
  • It is also useful to know the service-response times, and stats concerning known bottlenecks such as page-load time, JVM utilization and metrics such as user-response time and invocation stats.
  • Proactive Monitoring – The plan for capacity high/low thresholds needs to be defined and regularly evaluated in response to events and situations where thresholds are exceeded but before an outage has actually occurred.
  • Performance Management & Capacity Planning – For effective cost management of this infrastructure, the initial implementation for the environments may be a subset of full capacity, with the intent to add to the environment as application growth occurs. To accompany this strategy, monitoring data must be captured and stored (using a data warehouse) for trending, tuning, and capacity-planning purposes.
  • “Regional Bank A” is currently not storing monitoring data for any significant length of time. Additionally, a data-maintenance strategy and centralized group to analyze and review performance data on a regular basis should be incorporated into the growth strategy.
  • Security – With recent regulatory changes, all unauthorized access of data must be reported. In order to comply, IT must have a logging strategy and log retention of security events expanded into this tier of infrastructure where application messages are currently passing through and security could be compromised.
Security Observations

The IT Security components involved with this particular infrastructure include:

  • SSL Certificate Management
  • Operating System Level Security
  • Message Security
  • Secure Connection Management
  • General Application Level Security
  • Period of Access.

4.1 SSL Certificate Management Observations
There does not appear to be a centralized authority to govern the way certificates are issued, installed and managed for “Regional Bank A.” General process around certificate management includes: certificate issuance (i.e. purchase and download), installation/configuration by administrator, tracking and renewal of expired certs, secure and re-issuance process to avoid multiple use and/or counterfeit certs.
It was observed that SSL certificates were found in various directories on the server. Moving forward, the recommendation is that certificates be stored immediately upon receipt in a secure Key Store. Certificate files should then be deleted from all other locations and system files.

Message Security Observations
  • MQM group on UNIX should not contain any members other than system IDs
  • All application IDs and people-user IDs should be placed in other group IDs that are specifically configured for their access and usage alone
  • Root should never be a member of the MQM group
  • “Minimum privilege” groups should be created and used for “read” access and configured in MQ Security to the objects required for usage
  • In outsourced IT environments, support groups should have minimum access privileges to prevent outages related to accidental operational support activity
  • Best practice is to use an MQ Change Request ID to access the MQM ID via the Unix “sudo” command for applying any changes or maintenance to MQ objects. This approach is also commonly referred to as granting access using a “Firecall” ID for specific instances when access is actually required while fully logging all activities performed by the ID during the period of access.

4.2 Application Connectivity For Message Queuing
MQ Client connectivity provides access to applications running remotely (on the application servers) with the ability to put and get from MQ queues. During the review, it was suggested that all consumers of the MQ environment should use only a single Client Channel definition. This is not recommended and falls outside of best practice for the following reasons:

  • Lack of application association on each individual connect and disconnect.
  • Security Authorization Records become extremely difficult to manage (for example, identifying who had access when an actual breach occurred).
  • Operational support resolution will require longer and possibly multiple outages to identify root cause of connection issues (applications that are long running).
  • Heightened risk of outages to larger groups of users: When a single consumer encounters a connection issue, there is higher risk that all consumers will be “kicked off” while a channel bounce is done to resolve connectivity issues.

4.3 Application Server Management Observations include:

  • WAS processes running on the servers using Root ID – this is a major security violation in financial-services industry.
  • A “wasadmin” Unix non-expiry ID should be used for the running of all WAS processes.
  • Access to the “wasadmin” ID should be managed operationally, granting a “firecall ID” in the same manner as outlined above for access to the MQM ID for changes and support.

4.4 Middleware Security Using “sudo”
In UNIX, the sudo command is enabled to control access via groups or user ids. Sudo can be focused to just explicit commands and options, and should always have full audit enabled for logging of user activity.

Standards and Best Practices

Throughout all aspects of the review, there appeared to be a disconnect between the “Regional Bank A” teams and the managed-services provider teams that were implementing and providing first-level support for both WAS and MQ. This disconnect can be resolved by:
1. Defining a single set of Standards, Practices and Guidelines issued by “Regional Bank A” that require unilateral adherence by MSP as well as by internal teams;
2. Setting up regular reviews of such policies and standards on a quarterly or project-by-project basis.
Architecture standards should exist in an ESB Architecture Guide, including the security policies for connectivity and access.
Other concerns and best practice observations are as follows:
5.1 WebSphere MQ
The key resource manager for all incoming and outgoing data for the ESB is controlled by the WebSphere MQ Queue Managers. Queue Manager base definitions were not found to be consistent and varied from default settings for what appear to be arbitrary reasons with high levels of inconsistencies across system and application-object configurations. These configurations do require some level of cleanup and maintenance for best practices environment management.
Use of NFS within the Linux/VM environment could be a regular source of compromise regarding high availability. When all other attempts have failed to resolve an NFS issue, the last resort is to bounce the NAS server, which results in immediate outage of all NAS services to all system consumers.
Instead, moving to a direct-storage product like Veritas™ Volume Manager is a cost-effective and reliable practice for ensuring high availability across clusters.
Also, consideration should be given to implementing MQ AMS (Advanced Message Security) to ensure compliance with PCI-DSS standards. This product is used to enforce encryption of messages at rest in the MQ queues to ensure that any and all access to queues will not provide access to readable message content. AMS in conjunction with MQ Security restriction of access will go far in preventing unauthorized access within this tier of the overall application architecture.
5.2 WebSphere Application Server
Several concerns were noted with the WAS implementation supporting “Regional Bank A’s” Java applications:

  • Operating systems not tuned according to minimum IBM standards
  • JVMs not tuned
  • Environment variables not being set
  • Single application/JVM profiles used on the assumption of securing data segregation of application data
Software Configuration Management and Deployment Automation

When changes are introduced into the ESB for software maintenance, when new applications are introduced, or when changes are made to enable better performance or transaction growth, a key area of concern for problem reduction and ongoing stability is to look at how such changes are introduced, tested and validated prior to deployment into the production environment where business transactions are running – the environment where interruption may involve loss of revenue for “Regional Bank A.”
Improvements in the following areas could be explored further for future engagement scope:
6.1 Software Configuration Management
How the application code is stored and version controlled is critical in the practice of software-configuration management. In addition, how the code is migrated to production is an area of extreme scrutiny for most financial-services companies. PCI compliance generally requires the demonstration of secure and formal access control around all source code and code-migration activities to production systems to ensure against introduction of rogue code or malware on financial systems.
Generally speaking, this is an area where best practice is quite mature as related to CMM and pre-Y2K efforts to manage the deployment of massive amounts of code change without business interruption – more from a stability and availability-management perspective.
Since most problems are related to changes made within the environment, most financial-services IT organizations are quite strict and process-oriented, with significant automation around the software-development life cycle (“SDLC”) to ensure against business disruption due to release testing in an environment that is not managed and controlled with the same configuration as production.
At “Regional Bank A,” application deployments are a highly manual effort with some utilization of homegrown scripts, which are subject to human error, inconsistent configurations, and are time-consuming to manage and support.
Key concerns with the current software-distribution strategy include:

  • High degree of error and inconsistencies
  • High labor cost in deployment process
  • High risk of losing skills relating to custom-deployment process and administrator knowledge of each application’s configuration and deployment requirements

Use of automation tools should be considered where:

  • Application changes are packaged into deployment “bundle” with clear associations between the configuration changes and application release dates to each environment
  • Automation tracks all individual components that constitutes a “bundle”
  • Fully automated backend process (including automated back-out of changes)
  • Provides push-button controlled/approved and self-service levels of deployment process
  • Logging of changes for configuration-management auditing
  • Maximizes access control for PCI-DSS compliance

6.2 Deployment Automation
In addition to managing the source code repository itself, management of deployment through deployment automation that encompasses both application changes as well as system changes is considered a best practice.
Though it is conceivable that scripts could be written to accommodate all of the various types of changes to all of the possible WebSphere products and components involved in the “REGIONAL BANK A” ESB configuration, it is not recommended due to the high complexity and amount of time required, which contributes to the overall cost of maintaining homegrown deployment scripts.
This reason alone is perhaps why “REGIONAL BANK A” deployments continue to be manual in nature.
In evaluating the available tools and utilities for automation deployments for WebSphere, consider that deployment of ESB changes are generally of two types:

  • Application Changes – Application changes include new message flows, new application queues, new .ear files or WTX maps, all of which have association with each other with regard to version control with the application bundle.
  • System Changes. System-level changes include the applying of hot fixes, fix packs, and major-release software-version levels. They could also involve environment-configuration settings such as adding a new application ID, group access, database driver, resource-connection pooling configuration and other parameters that enable better performance and throughput. Additionally, WebSphere and Java version levels are somewhat independent of each other, though many times showing critical dependencies with each other in terms of application functionality and thus require configuration management with application bundles.

As a result of the above, it is recommended that “Regional Bank A” consider packaged products that will automate systems as well as application deployments and manage technical dependencies without the use and ongoing maintenance of deployment scripts.
Because of the complexity of the ESB configuration, such products as Rational UDeploy in conjunction with Rational Team Concert are now considered a best-of-breed product combination for managing application configurations and software distribution for complex multi-product ESB customers.
In closing, the review and recommendations above should be considered for initiating infrastructure projects that will address and close the items of key concern. Additionally, the initiation of projects for addressing future automation, performance management and growth of the ESB should also be considered in both the near future and beyond for strategic reasons, as well as for ongoing compliance and growth supportability.
Photo courtesy of Flickr contributor “Info Cash”

Case Study: Middleware Compliance For Government Agencies

Project Description

Government client required oversight for proper implementation and development of integral middleware products.

The Situation

Our government client required a project manager to oversee, provide guidance and collaborate with MIP team regarding problems with the MIP and non-MIP application and system issues to include in the design, development, and implementation of enterprise applications using a variety of middleware products.
It was imperative that the job run smoothly as our client’s engineering services provide highly- efficient program management, financial management, systems engineering, and Integrated Logistic Support (ILS) to boost Navy crew safety and streamline processes.

Project Description

TxMQ’s consultant’s job requirement included managing a diversely skilled group of developers, including IBM subcontractors, to identify roles and responsibilities for the development team to execute the correct implementation and development of MIP Hosting Site Migration and System Upgrade.
The guidance for the integration, design and development incorporated advice on using the following middleware products:

  • WebSphere Process Server
  • WebSphere Portal Server
  • DB2 Content Manager
  • IBM HTTP Server
  • Tivoli Web Seal
  • Tivoli Access Manager
  • DB2
  • IBM Directory Server
  • Active Directory
The Response/Methodology

The initial step in the project included dividing and handing off development activities to members of the upgrade test team for MIP and Non-MIP applications.
Coordination and collaboration was imperative through internal cross function teams and external vendors. The project manager supervised these exchanges ensuring that each party had the information needed to complete the task at hand. By overseeing the development team and managing expectations and providing direction, the process of code maintenance and technical deployment of new code and patches began.
Throughout the process our project manager updated the client via status meetings and written internal project reports.

Compliance

It was the responsibility of our project manager to ensure compliance with CMMI SW Level 3 processes for the work managed at ZAI.
Photo by Flickr contributor Andy Piper.

Case Study: WMQ HealthCheck

Project Description

Large regional bank commissioned TxMQ to conduct a WebSphere MQ environment review to maximize use of resources and technology.

The Situation

TxMQ’s client, a large regional bank, requested a WebSphere MQ-environment HealthCheck. The Bank uses WMQ as a shared middleware infrastructure for all critical business applications to send and receive critical data between them.
In addition, the bank requested that TxMQ identify a solution configuration to meet the bank’s needs.

The Objective

The first key objective for the HealthCheck was to provide senior architecture consulting and a broad assessment and review. The assessment and review would be used to provide flow documents and possible recommendations including, but not limited to:

  • Software and system-configuration changes or upgrades/support packs needed
  • Technical impact and performance analysisResponse to solutions impactWritten recommendations based on meetings with stakeholdersStakeholder concerns mapped to specific infrastructure objectivesA second key objective for the HealthCheck was to analyze the WMQ environment and configuration to eliminate any potential performance, availability and security exposure, and to ensure there were no roadblocks for future growth and new applications. This included an analysis of current staffing structure, number and size.
    The final objective was to analyze the proposed capacity-increase solutions and the impact of each solution to support increases in business processes up to 100% of the existing transaction volumes. The analysis would help the client effectively plan MQ-capacity management strategy, identify technical dependencies and assist with identifying the costs and benefits.

    The Response

    TxMQ’s consultant – an MQ Subject Matter Expert – spent a little more than a week onsite with the MQ team at the bank. TxMQ dove into a diagnosis of the current OS and middleware environment, which included CICS V3.2, WebSphere MQ V7.01 and DB2 V8.
    The client had more than 23 MQ Managers spread over Mainframe, AIX and Windows. Some managers were used for development and testing and were standalone. For production, many of the applications went through more than one queue manager.
    In conjunction with the bank’s internal teams, the TxMQ consultant was able to deep-dive into the environment and architecture and create a list of observations and recommendations.

    Findings

    The HealthCheck revealed no major or current issues. Our consultant’s general observations were as follows:

    • Sound architecture and topology
    • WMQ V6 is due for upgrade
    • Uniform procedures and proper validation for software promotion
    • Add additional load-test environments to facilitate updates
    • Excellent logic and implementation of naming conventions
    • Offsite z/OS image available for regular testing
    • Offsite z/OS image available for regular disaster-recovery exercises
    • Solid business-continuity plans
    • Infrastructure is ready, should disaster recovery be needed
    • Tivoli OmegaMon XA used for monitoring
    • No high-level application and architectural documents readily available
    • System is well-tuned
    • Security is solid with only a small number of areas that need to be addressed
    • Excellent IT skills team with good problem-determination skills
    • Excellent working group with positive attitude
    • Staffing is lean compared to similar organizations. ?In the last 13 months there were three system outages due to application issues. There is no concern with the number of instances and it was determined that there is a well-designed and functioning system in place for tracking and logging instances.
    Additional Recommendations

    The TxMQ consultant provided the following additional recommendations to the bank:

    • Evaluate recent WMQ security features
    • Plan for WMQ migration to V 7.1 or 7.5
    • Modify the current WMQ recovery topology to handle possibility of improved recovery time for a mainframe outage
    • Implement shared queue when Sysplex is available
    • Evaluate additional ESB architecture
    • Review staffing

    Photo courtesy of FLickr contributor Howard Lake.