|
Comments
Did you read today's front page stories & breaking news?
SYS-CON.TV
|
Virtual case study: How an airline can find efficiency with Unix
How can an airline fly today? By going way off scale with Unix. The story of SafetyJet International
By: Paul Murphy
Jan. 4, 2002 12:00 AM
(LinuxWorld) -- Editor's note: The following scenario is that of a consultant brought in early in the business planning cycle for a new airline. Unlike the other articles in this series, this example does not reflect the author's real-world consulting experience. Murphy worked for an airline, but nothing on this scale. SafetyJet is a wholly imaginary company constructed solely to illustrate three outstanding Unix characteristics: The image shown is of a 21-inch NCD900. Companies like Sun, IBM, and Thinknic make others. Typical MTBF ratings are in the 300,000-hour range and there are neither moving parts nor user-accessible OS components to cause failures. In this case, the problems are real, the proposed solutions arguable, and the consulting assignment described is not based on real events. A group of investors has reacted to recent upheaval in the airline industry by commissioning development of a business plan for a new airline. Their terms of reference to the design team, of whom I represent the IT end of things, are: On this basis, our team has agreed that: My job is to design and approximate the cost of an information architecture to support their vision. The airline business, perhaps more than any other industry, is about large numbers. Some examples: There are several high-volume, limited-distance routes in Canada. The Edmonton to Calgary distance, for example, is about 180 miles downtown to downtown but only about 145 miles airport to airport. The local discount carrier's one-way ticket costs from $54 to about $95. Inclusive of parking and cab fares a typical day trip costs from $160 to about $220 before the new tax adds about 20 percent to the ticket price and 15 percent to the lowest net cost. For most people, most of the time, the end-to-end trip takes about three hours -- but that same average person can, especially if he remembers the inevitable police speed traps around Red Deer, make that same end-to-end trip by car in about 3.5 hours. At a re-imbursement rate of 27 cents per kilometer and $20 for destination parking, driving costs about $177. This compares to $184 for the lowest net cost for air transportation after the new tax is added. Since most people feel safer and more in control of their own schedules in their cars than in an airliner the effect of the new tax will be to push people onto the highway -- thereby penalizing the airline industry and raising the overall rate of death, injury, and property loss among people making the trip.
Airport authorities pose the biggest challenge to SafetyJet's business plan because they control access to their local markets and can invoke literally thousands of regulations to smother almost any change initiative. Our plans will arouse their hostility because bussing passengers around airport delays reduces both their revenues and their control. This, I'm assured by the experts responsible for the project, will be the primary regulatory battleground on which the airline's potential for success or failure is going to hang. Within this context the major IT challenges are: None of this is very difficult when you have one or two small airplanes that zip back and forth on domestic routes of four hours or less. However, the resource-scheduling problem gets exponentially more complex as you scale up. By the time you get to eight airplanes, 45 daily flights, and 120 flight-crew, the problem has exceeded human capacity. At 100 airplanes, inefficiencies in the solutions used can add up to 5 percent or more of total operating costs -- more than the bottom line -- and it just gets worse as you get larger. When airlines started to experience this complexity, in the 1920s, computers had not been invented and people simply did the best they could with manual means. By 1970, airlines had invested heavily in the use of computers for reservations but computerization, outside of military logistics planning, had yet to make inroads into scheduling. By 1980 that had changed with major airlines investing in Cray and other Supercomputer gear to attack this problem. The operations research groups created within airlines to do this were not, of course, up to solving the entire problem and so concentrated on specific subsets where they hoped to have the greatest short-term impact on profitability. As a result organizational structures and technical disciplines evolved around easily identifiable problem sub-sets. These are mainly maintenance scheduling, flight scheduling, crew rostering, and pairings (matching inbound and outbound routes to bring crews back to home bases). The classic text in the field of optimization is Harvey Wagner's Principles of Operations Research (Prentice Hall, New Jersey, 1969) although many people will find Claude McMillan's Mathematical Programming (Wiley, New York; 1970) rather easier to follow.
There has been tremendous progress in both the theory and practice behind the computation of actual solutions to various scheduling models. A problem run on a 300-MHz Sun UltraSPARC IIi takes about 83 hours to solve using one of the best available 1990 "Solvers" (CPLEX 1). The same problem with the same hardware now completes in less than 3 minutes using the latest CPLEX release. For larger problems susceptible to the best modern algorithms, the improvement on identical hardware over the decade is approximately 4,000 times. Coupled with improvements in hardware, those algorithm gains make it possible to solve problems that were once considered unthinkably complex -- including the original integrated problem that, because it could not be solved, led to the segmented approach still deeply embedded in most airline organizations today. The most fundamental problems to be addressed by the technology solution are: In this context, security is an aspect of reliability, and completeness an aspect of all three main requirements: reliability, speed, and accuracy. Initially, I expected to be able to break the IT components down into major sections each of which could then be dealt with separately through purchase and deployment of one or more commercial packages. Those beliefs, however, turned out to be extremely naive. I had assumed, for example, that we could buy or license so called "revenue cycle" software and resource allocation software. That's the way most airlines do it but, on closer review, the two sets of problems turn out to be complementary and so logically part of one system. The revenue cycle starts when a passenger makes a service request and ends when that service has been delivered and all consequent liabilities have been satisfied. The resource allocation process controls how that service is delivered. Basically this is just a very-large-scale ERP problem, but breaking these things up into the widely separated islands of automation needed to address them with 1970s solutions guarantees that attempts to fit them back together produce unnecessary inefficiencies. What gradually dawned on me as we toured airline data centers and talked to both sellers and users of this software is the extent to which airline data processing traditions hold airline operators hostage. On the surface, the problems these guys deal with are huge and the solutions, experientially evolved over decades, hold together well enough. Look a bit deeper, however, and several things stand out: SafetyJet isn't in the business of fulfilling travel agent orders, nor do we need to route transcontinental passengers through a half dozen short hops. We work, instead, directly for the passenger and conceptualize the airline operation as nothing more than a fast link in a bus service between major downtown points of arrival and departure anywhere in Canada or the US. To SafetyJet, the fundamental service issue isn't fulfilling flight orders but getting passengers from where they are to where they want to go. As a result, we eventually decided to recommend custom development instead of licensing in order to: That decision wasn't easy to make and will be even harder to defend if, or when, the bosses get the regulatory agreements they need to proceed and all of this suddenly acquires reality. The key computing workloads to be addressed within the integrated database framework are: There are, in addition, a range of smaller, limited-purpose systems that interact with the database framework at one remove -- i.e., with a filtration step. These include: The most basic of the ideas underlying the software development project is that the passenger is part of the optimization equation. Flying an airplane from one place to another, and all the organizational complexity it takes to deliver that flight, is a means to an end, not an end in itself. The airline's business is about moving people, not airplanes. In effect, the optimizer will direct minute-by-minute operations in an attempt to minimize cost while picking up, and delivering both people and freight according to a defined schedule. In that context, our fundamental passenger story is: The airline's job is to make that happen with a minimum of risk, effort, or complexity within the period set by the customer. For example: In a perfect world, this would be easy but, in the real world, complications can include: In theory, the Real-Time Airline Operating System (RTAOS) design can mitigate the overall impact of third-party delay. For example, a 25-minute take-off delay in a United Flight from New Orleans to Denver can intersect a 3-minute Denver landing window for a SafetyJet arriving from Winnipeg -- potentially delaying landing by 15 minutes and causing two crews, and possibly 384 passengers, to wait. This has obvious direct costs to the airline for things like fuel and maintenance but may also have the indirect effect of requiring that the incoming crew take a four-hour rest period -- because, without it, they would be over-time on arrival in Winnipeg. That means a different crew has to be sent -- putting three crews off-schedule and two out of place. If alerted early enough, Operations can burn slightly more fuel to get that aircraft into United's nominal slot -- avoiding the problem and saving money for both airlines.
The obvious best solution, if technically feasible, to the cost trade-offs implicit in these processes is to integrate consideration of passenger (and freight) issues along with all other operating parameters in the dynamic programming model used for resource allocation. If: Then, operations should be able to maintain an overall schedule optimized in terms of passenger needs while giving up a minimum in cycle time, fuel, or other operational penalties. Overall optimization for passengers does not, of course, mean individual optimization. The occasional passenger may find herself temporarily re-routed to Alaska or stranded in Saskatoon but the system would continually adjust itself to produce the best possible result for the majority of passengers the majority of the time. As it happens, that's also the best possible result for the airline because minimizing passenger waiting times and ground travel distances is usually the same as keeping the fastest gear, i.e., the airplanes, busy earning money. The critical design question is therefore clear: can the scheduling problem be formulated in such a way as to be sufficiently inclusive to generate usable results and yet solve in near real-time? Particularly if we define the latter as "generally less than one minute"? Formulating the scheduling problem requires considerable expertise and an immense amount of data -- both of which should be available. Since there is no compelling reason to believe that the problem cannot be properly formulated we'll assume that it can be and concentrate on options for solving it. The actual problem size is difficult to predict, inclusion of passenger concerns will add less complexity than might be expected because many passenger constraints are linearly dependent -- meaning that a full linear program might have 100 million rows and 200 million columns but the subset of interest will usually be at least an order of magnitude smaller on each dimension. There are some givens in solving this. For example, the use of the Informix database with Tuxedo is a given in view of the reliability requirements for the transactions environment and the consequent need to keep the two data centers fully synchronized. Since this requirement also amounts to a Solaris specification for the primary transactions processing and database hosting jobs, the real architectural issue for the Solver lies between: It is clear that the single machine approach would work with today's computing gear for a ten-airplane operation -- what SafetyJet will be at start-up. The real question, however, is what happens eight months or a year later: when second round financing enables explosive growth to the 100-airplane level? It would be suicidal to build the greatest piece of airline software ever, only to have to abandon it as unworkable if growth drives the problem size past hardware capabilities. What I need to predict, therefore is whether or not the problem can be run, with a target solution time of one minute or less, on hardware available about two years from now. The best guide to that is, of course, performance today but I don't have good information on the distribution of solution times as a function of problem complexity on this scale. It is relatively easy to predict how long it will take, for any given set of hardware, to load the problem and to run pre-solve (collapsing redundant row and column information) preparatory to invoking barrier or other algorithms to produce the optimal solution. Beyond that, however, the actual time-to-solution depends far more on the applicability of the algorithms used to the specific data set attempted than on the hardware. The easy approach to scaling up is to add machines. There are several research projects aimed at making use of thousands of individual machines and even an off-the-shelf product such as Sun's GRID engine can be used to push the problem out across a network of cooperating machines. Our requirement for near-realtime answers means, however, that large-scale, Internet-based, compute sharing isn't going to be viable because: For us, therefore, a distributed approach means a Linux or Solaris/Intel compute farm with racks of dedicated processors. Although the actual compute time under either hardware scenario depends more relationships in the data for a specific problem than on its size, we know that the primary predictor of relative efficiency between the two solutions is the number of iterations that require information from outside the local compute block to proceed. The more linkage has to be accounted for, the better the single-machine approach will look. Unfortunately, however, experience with smaller problems, on the order of a million rows, does not translate directly to problems with 40 million or more rows and so we won't know the answer to this until we run actual trials. A benchmark that may describe a best case uses a real-world fluid dynamics program. This has reasonable complexity and scale along with known high separability between components and should, therefore, mark the upper limit on the effectiveness of a distributed processing solution. The benchmark data posted at Fluid.com may, therefore, provide a useful guide for our decision. Although the numbers as presented need some interpretation, the results show that, for the largest problem benchmarked: In the time since this particular test was run, two significant changes have been made to the Starfire: Since both of these address critical components of our problem, we can reasonably speculate that a newly configured Starfire 15K with all 106 possible CPUs installed should score around 65.This is about what you'd get if you replaced each of the gigahertz P3 boxes in the IBM X-series with a dual-CPU Xeon chasing a gigabyte of RAM at about 1.7 GHz. The worst case would occur if separability is minimized. A benchmark that tracks that is provided by the Transactions Processing Council's analytical processing test. Results are not linearly comparable across database sizes but reasonable estimates suggest that the workload increases by about 3.3 times per transaction as the database grows from 300 GB to 1 TB. On that basis, an eight-processor Proliant with Microsoft Windows 2000 and SQL Server, which achieved 1,506 QphH on the 300-GB test, could be expected to produce about 456 QphH on the 1-TB test. Sun didn't report a 300-GB test, but posts a QphH score of 4,735 for a 24 processor 6800 on the 1 TB test -- making that machine seem about 10 times faster than the Proliant. On a CPU basis alone this doesn't make sense because the Proliant offers 31 percent of the raw cycles of the Sun 6800. What made the difference was the 9.6-GB-per-second data exchange rate for the UltraSPARC CPUs in the 6800 versus the 2.4-GB-per-second for the Proliant. There's a hidden danger to the airline in a big-machine solution. The Starfire choice invites downstream mismanagement because it makes it easier for the board to eventually choose a completely unsuitable CIO who will then destroy operational efficiency by doing all the things experience has taught him to do -- like hierarchal staffing, stove-pipe decision making, isolation of users from technical staff, and the imposition of rigid operational controls. These methods are appropriate to an MVS/XA environment but wholly counterproductive in a Unix one and will, if forcefully applied, first raise costs, then freeze adaptation to external change, and, eventually, kill the airline's ability to compete. That risk comes about because machines like the Starfire 15K are key pieces in Sun's metamorphosis from Unix guerilla to data center gorilla. To get the dollars available from mainframe managers to whom a $5 million machine looks cheap, but who demand that it replicate all of their favorite VM/370 facilities, Sun has added things that use resources pointlessly but enable these people to treat the machines as if they were cheaper, faster, mainframes. As a result, the board will eventually be looking at resumes from people who are clueless about Unix, user relationships, or making money, but claim expertise with the Starfire along with 25 years or more of "progressively more senior experience" in airline data centers. This is a much bigger issue than most people believe. Management methods are not independent of technology; the right organizational posture for an MVS-based operation is radically different from the right structure for a Unix-based solution. Resource-limited environments require careful management and control of user access to services. Unix-based infrastructures simply don't face those limits and so benefit from staffing strategies, leadership, and working relationships with the user community, that are anathema in traditional data shops. Please see my Unix Guide to Defenestration for a detailed discussion of what this means and how it works.
The core Solver application and its relationship to both the revenue cycle and operational systems constitutes the largest single component of the real-time airline operating system we're considering, but it isn't the only critical piece. SafetyJet will be sold in part on its merit as a safe airline. Part of the safety factor applies to passenger apprehension about hijacks and other malicious action affecting operations. The security systems are, therefore, both extremely sensitive and mission-critical. The crew security plan functionally requires Solaris -- it's one reason each operating office will have a Sun V880 (or 280R) local computer. That's because Solaris lets us use Java cards with SunRays to make user identification both easier and more certain. That capability means that each local center will have locally booting SunRays that run their X environments on the local processor. That machine, in turn, will connect those X-servers to the client end running in the data centers. It is the need to make this process foolproof (and secure against man-in-the-middle attacks originating within SafetyJet) that makes the strongest argument for using the simplest possible processing architecture. One of the things that this makes easy to implement, for example, is crew vouching and verification. The enabling SunRay feature here is that the user's card identifies a session history and is independent of terminal or location. Thus, a driver can pull his card from a SunRay in Omaha and resume exactly the same session when he plugs in again in Denver an hour later. Add biometric smarts to the card (in this case, a temperature-sensitive embedded fingerprint reader is envisaged) to uniquely identify its owner and the system becomes both hassle-free to its users and reasonably secure. "Reasonably secure" is not, of course, quite good enough for people who are being given charge of 192 passengers and 80,000 pounds of jet fuel, so all team members with flight line responsibilities follow check-in procedures that require them to vouch for the identity of the other people checking in with them. The security application used for this is one of only two applications in the airline that are not automatically mirrored between the two data centers. In both cases, the front-end machines used to run the SunRays randomly switch sessions between the two data centers. As a result it is essentially impossible capture a SafetyJet by stealth -- a well-designed ground assault can succeed, but the operations control center will be instantly alerted to the problem and the jet won't get off the ground even if the hijackers have trained people on board -- because the external wheel locks require that the crew chief get unlock codes from the operations center. The other redundant application involves the passenger identification side of security and the protection is as much against official (and internal) misuse of data as it is against coordinated digital and physical attacks. Each workstation used for pre-departure passenger check-in has both document and portrait cameras. When travel papers are required (e.g., identification for transborder flights) the departure clerk requests the document to copy key information from it into an on-screen form and, to facilitate that, puts it face up in a pre-determined position. As soon as the first field is entered on screen, the computer grabs images of both the document and the passenger for transmission to the customs service in the destination country. Both images are also stored in our database and matched to passenger information with database access cross-reported to audit services in the two data centers as a control on misuse of the information. Reader Feedback: Page 1 of 1
Latest Cloud Developer Stories
Subscribe to the World's Most Powerful Newsletters
Subscribe to Our Rss Feeds & Get Your SYS-CON News Live!
|
SYS-CON Featured Whitepapers
Most Read This Week
Breaking Cloud Computing News
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||