How We Manage Amtrak`s Challenging Infrastructure

April 25, 2008

Managing a data center is tough. Managing many data centers is tougher. But managing three data centers, along with servers and clients spread out across hundreds of remote locations, is almost impossible. Amtrak's Karen Shockley says the railroad does it all—by trying to do one thing very well—through standardization. 

Amtrak has always been a combination of the new and the old. Formed by combining nine remaining passenger railroads, Amtrak brings with it decades of railroad lore. Although formed by the past, it was fashioned for the future.

Like the company, Amtrak's IT infrastructure embraces the new, while retaining much of the old. As an overview, Amtrak's IT supports the following: TPF (Transaction Processing Facility), which is a mainframe operating system developed explicitly for the fast response time required by the reservation industry; a z/OS (operating system for IBM's z/900 series of large mainframes) for business applications such as inventory management, time recording, etc.; Solaris servers for the e-commerce applications, the Work Management System and HR/Finance applications; and the ever-present Windows servers for applications from the company's intranet to e-mail servers.

As Amtrak's IT has grown, the need to consolidate has introduced the use of VMware servers for virtual systems, Citrix for virtual desktops, SAN (storage area network), NAS (network-attached storage) for shared directions and file systems, and enterprisewide-database servers. To add to the complexity, Amtrak's IT equipment resides at three main data centers and hundreds of remote sites across the country.

The Secret is Standardization

The "secret" to managing such diverse entities is, as you may have guessed, standardization. This means standardization of server builds, software components, hardware, backups and monitoring software, as well as policies and procedures.

Another way of thinking of this is the "boring is good" philosophy. If every time you look at a SQL Server and see that it has the same files on the same drives in the same directories, then you know you have reached the epitome of boredom. And, in the computer management arena, boring is good.

Tools and capabilities may change from server to mainframe, but the procedure is always the same. Here are some examples:

Customer Interface for Support

Amtrak has one (and only one) Help Desk for all user concerns: desktop, applications and hardware. The same software tool captures the data, as well as the "who, what, when and where." (After we solve the problem, we try to determine the "why.")

If the Help Desk cannot resolve the issue, then the same process is followed for all issues. Critical issues are paged out to support personnel and certain customers. Each type of support required is sent to the appropriate "queue" of individuals, based on the application and location of hardware. As the issue is worked, the customer is contacted for problem details and to confirm resolution.

Health of Systems

We use the same monitoring tool on every midrange server in the enterprise. The tool is configured to monitor various items, including system uptime, services that are running, available storage and CPU utilization. Should any event pass the threshold, an alarm is sent to the main console, which is monitored 24x7 in one-hour shifts. An alarm is also sent via e-mail to the system administrator for that system.  

The goal here is to be proactive by addressing situations before they become problems. Using one tool across the enterprise allows for one picture of the state of the enterprise for the server environment. While the mainframes use different tools, the concept is the same: Operators are alerted before a problem occurs.

Back Up Strategy

Amtrak again chose the same "boring" solution across the board for all midrange servers—whether they are Windows, VMware guest systems, Solaris, SQL DB (database) Servers or an Oracle Enterprise cluster. This way, we have one report on which to check for errors, one product on which to train people and one process for storing and restoring data.

Change Control

Once again, as you may have guessed, there is one change control process for all changes. This is the case whether they are for application upgrades, installing new software, adding hardware or routing the network. A weekly change meeting is held with representatives from all the IT walks of life. Each record is reviewed for its impact on not only the requested program, but on others that may be affected. Also, while each record has its own list of required approvers and its own risk category, the process ensures that all items will receive required attention.

Another important note—we have kept our Change process very simple. There are only three risk categories:

1. Emergency (used to fix a critical problem)

2. Standard (planned changes for two weeks in advance)

3. Administrative (tasks that don't change the environment itself, such as adding users or updating train station data that need only a three-day lead time)

Share Information

Once again, our philosophy is, if everyone has access to the same information—whether they are a customer or manager or support person—then a problem will never have the opportunity to be "exciting." In other words, everyone's response will be more or less relegated to, "Yes, I know about the problem already; tell me something new."

To illustrate, we have status reports that are sent out six times a day—with updates on problems, as well as on the status of our critical systems. Additionally, our internal Web site carries the same status reports 24x7. While this doesn't guarantee that everything is received as routine, it does go a long way towards satisfying our customers. They know when we have a problem and that we are working on it.

We also use a VRU (Voice Response Unit) to share information. If a critical system is not working or a site is down, we provide a message that the user will hear when they call the Help Desk to report a problem. This is but another venue in which to share the fact that we know that there is a problem and that we are working on it. After all, there is almost nothing more frustrating than having a problem and not being able to tell someone about it. Also frustrating is the inability to know if "anybody up there" in charge is doing anything about it.

Standardization

As I have alluded to previously, we have standardized where possible. We use only Windows and Solaris servers. They are built according to standards (as much as possible, depending upon the application). Also, we exclusively use SQL Server for Windows databases, as well as Oracle running under Unix. This means that when a technician researches a problem, the same techniques can be used almost across the board.

Continuity of Operations

We have pretty much taken the "excitement" out of maintaining and operating Amtrak servers. We use the same customer procedures for support. We use the same tools for backup and monitoring. We use the same change-control procedures. We prevent excited rumors from developing by keeping customers informed. Plus, we have given ourselves a limited amount of different hardware and software to troubleshoot.

Keeping Murphy's Law in Mind

However, Murphy being who he is, there is always the chance that a system will fail—despite RAID and redundant power and redundant-network connections. For those systems we consider critical, we have clustered solutions and different levels of automated failover, depending upon business requirements. That way, even though we still have to address the issue, we are doing it in a way that is transparent to the customer.

Yes, Amtrak still has legacy applications that are twenty-years old. But we also have some of the newest technology such as VMware, SANs and NAS. And, through the methods explained above, we all manage to happily coexist.

 Karen Shockley has an extensive IT background that includes all phases of the systems engineering lifecycle. Currently Director of Amtrak's Enterprise Data Centers, she is responsible for the 24x7 operation of over 1,000-midrange servers and mainframes.

Shockley began her career in the U.S. Air Force, learning structured programming methodologies and concentrating on Quality Assurance. After nine years in the Air Force, she moved to SAIC, where she led an integration effort for a Corporate Executive Information System.

Shockley led Tiger Team efforts at Amtrak, was the mainframe liaison to IBM, managed the operational effort for an SAP implementation and has achieved her Project Management Professional, INCOSE certification and ITIL Foundation certification.

Shockley has a degree in Physics from MiamiUniversity of Ohio and a Master of Science in Computers from the University of Oklahoma. She has published articles on Data Warehousing, Meta Data and Customer Relationship Management. She can be reached at shocklk@amtrak.com. 

Scalable NAS Resources

  • Windows File Server Consolidation: Reference Architecture and Configurations

    Organizations that deploy Microsoft Windows file servers receive many useful services. Traditional file servers, however, lack scalability, so organizations must add file servers as their data storage needs grow. This results in server sprawl, which leads to low utilization of the file servers and sub-optimal availability of storage. Learn how organizations benefit from consolidating their Windows file serving environments using HP Scalable NAS, a highly scalable, manageable and available storage solution.

  • Data Mobility Group TCO Study on ExDS

    Storage administrators are being challenged to manage enterprise data growth and maintain increasing service level commitments while keeping within budgets. This study examines the total cost of ownership of the new HP StorageWorks 9100 Extreme Data Storage System (ExDS9100) and compares it to three competitive approaches. Learn how the HP ExDS9100 is well positioned to deliver massive scalability in both capacity and performance, yet offers considerable cost advantages to meet today¿s storage challenges.

  • Managing Exponential Storage Growth

    In this IT Link podcast hosted by Mike Vizard, Scott Campbell, HP manager of solutions architects, explains why HP is taking a different approach to managing storage using a new XDS architecture specifically designed to handle the requirements of rapidly growing unstructured data storage.

  • Comprehending NAS Clusters

    In this IT Link podcast hosted by Mike Vizard, Efren Molina, PolyServe technical specialist for HP, explains how NAS cluster technology is being used to help customers keep costs in line even as their storage requirements continue to balloon.

  • Coming to Terms with Storage Management

    In this IT Link podcast hosted by Mike Vizard, Logicalis vice president of consulting Eric Linxweiler explains why storage management software is becoming a strategic issue as the amount and types of data that needs to be managed continues to explode.

  • Massively Scalable NAS: Pre-Empting Tomorrow’s Data Overload with Today’s Technology

    NAS has always been simple, unless IT managers wanted to grow their NAS storage significantly. For the first time, storage administrators are thinking in terms of managing petabytes of storage, making massive storage build-outs a necessity. Learn how companies can affordably meet these demands with a simply managed, highly scalable NAS environment.

  • Transparent Business Continuity and Availability through HP Scalable NAS

    This solution brief explores HP’s next generation of Scalable NAS and how it caters to every business continuity need by being highly available and easy to deploy while adding levels of affordable, fault tolerant data protection and availability.

  • Scalable NAS: Insights from customers, analysts and HP

    When IT administrators are looking for networked storage solutions, they often look to NAS because they can use the Ethernet infrastructure they are familiar with to build pools of storage for significantly less money than SAN with equivalent capacity. Unfortunately, traditional NAS doesn't scale and administrators find themselves having to add NAS platforms to keep up with growing storage demands. As a result, many administrators have started looking for alternative solutions.

  • Scalable, Always Available Solution for Digital Media

    Learn how HP's Scalable NAS solution offers central management and administration, scalable capacity and improved utilization, with a lower total cost of ownership (TCO)

  • Create an On-demand Streaming Media Storage Solution with HP Scalable NAS

    Watch this demo and learn how HP's next generation of Scalable NAS is well suited for streaming media serving solutions.

  • Roswell Park Cancer Institute Improves Scalability and Performance with HP Storage Solution

    When Roswell Park Cancer Institute (RPCI) needed to remain on the front line of research and to continue providing high-quality care for patients, they chose a comprehensive HP storage solution and improved storage capacity, performance and scalability.

  • HP Storage Removes Bottlenecks, Consolidates Storage and Increases Revenue for Crest Animation

    When Crest Animation looked to take on an increased workload and handle High Definition and 2K film animations, the company chose a comprehensive HP storage solution that has given the company a unified, highly reliable storage infrastructure.

  • Create a Scalable Infrastructure for Oracle

    Oracle Database and the Oracle E-Business Suite are at the heart of most commercial data centers. HP's Scalable NAS solution Create a scalable infrastructure for Oracle consolidation and file serving.

  • Streaming Media Content Reference Architecture

    The new Web 2.0 business model, where the data is the business, utilizes the Internet to disseminate information in many different ways.

  • Scalable NAS for Oracle Demo

    NAS has been rapidly evolving as a storage alternative for Oracle databases, and for good reason: NAS is often the simplest, most cost-effective storage approach for Oracle.

  • Consolidation for an Optimized Storage Environment

    Windows File Server and Storage Consolidation using HP EVA File Services.

  • Scalable, Fault-Tolerant NAS for Oracle: The Next Generation

    For several years NAS has been evolving as a storage alternative for Oracle databases, and for good reason