The Practical Nomad: Edward Hasbrouck's blog

Tuesday, 13 July 2010

Google buys ITA Software (Part 2: What does ITA Software do?)

Yesterday in Part 1 I told some of the history of ITA Software, the air travel pricing and reservation software company bought this month by Google for US$700 million.

Today I’ll discuss ITA Software’s strengths, weaknesses, and strategic approach to the airfare puzzle (Part 2, below). Tomorrow I’ll finish up by describing how ITA Software’s acquisition by Google might affect travellers (Part 3).

You may wonder why I go into such detail about what ITA Software does, or why Google is buying them for US$700 million. But as I said yesterday, “Google’s purchase of ITA Software is likely to be a bad thing for travellers”, and the technical background below is necessary to understand why:

Airline ticket prices are determined by the “fares” in a published “tariff”. A fare is not a price tag on a seat, but a price associated with a set of rules . The rules of a fare always include rules about what route(s) on what airline(s), reserved in what “booking classes”, qualify for the fare, and usually include a variety of other rules. Any reservation or ticket that satisfies that set of rules is eligible for that price.

Rather than allocating seats by price (or more precisely “reservation confirmations”, since airlines overbook and there isn’t a 1-to-1 correspondence between how many reservations an airline is willing to confirm on a flight and how many seats there are on the plane), airlines allocate the “availability” of confirmations by “booking class” designated by letter. So as of a particular moment, for a particular flight on a particular date several weeks in the future between New York JFK and Chicago O’Hare, an airline may be willing to confirm up to 9 seats in “Y” class, up to 3 seats in “Q” class, and none at all in “Z” class. There isn’t a 1-to-1 correspondence between booking class and price, either: a Q seat JFK-ORD may be ticketed as part of a through one-way fare JFK-ORD-PDG, as part of a JFK-ORD-JFK round-trip, as part of the return leg of a DSM-ORD-JFK-ORD-DSM trip, or as part of millions (or orders of magnitude higher) of other possible journeys at different prices specified by those different fares.

There is no database of availability, either. Airlines determine the availability of confirmations in particular booking classes on particular flights in real time, in response to queries transmitted through computerized reservations systems (CRS’s) from reservation offices and call centers and travel agents.

So the price for a specific ticket is a function of both the fares (prices and associated sets of routing and other rules) in currently published tariffs and the real-time willingness of the airline(s) to confirm reservations in specific booking classes on specific flights.

To a skilled human travel agent, this looks like a heuristic sequential query problem, not a database search problem. If such an agent is being paid enough to make their best effort, they look first at the tariff of published fares (typically accessed through a CRS using a complex query language with, at least from the command line, many categories of modifiers and qualifiers). They pick the lowest of the fares in the tariff that will be applicable and for which they think (based on knowledge, experience, and practiced intuition) that there will be availability for flights (airlines, dates, times, route, etc.) acceptable to the traveller, and then search for availability on those flights in the booking class(es) required by that fare. If they can’t confirm reservations that qualify for that fare on an acceptable schedule, they go back to the fares (adjusting their expectations based on what they have found), and search for availability for the next higher potentially acceptable fare for which they hope to find qualifying seats available.

(In practice, travel agents less and less often actually go through this process, for a variety of reasons including the elimination of commissions paid by airlines to travel agents, the reluctance of travellers to pay travel agents fees commensurate with the required skills, the degradation of the tools and training made available to travel agents by the CRS’s, and the replacement of command-line travel agent CRS interfaces with easy-to-learn but functionally crippled GUI’s. But that’s another story.)

Central to the break-up of ITA Software’s founding partnership, as I discussed yesterday in Part 1, was the decision to abandon any effort to replicate this methodology, and instead to seek a “brute force” solution to airline ticket pricing.

For what it’s worth, this didn’t have to be a binary choice. Just as some chess-playing programs combine heuristic and brute-force components, or work in partnership with human chess players, the other major recent independent developer of airline ticket pricing software and systems, Airtreks.com — where I used to work and with whom I am still affiliated — uses an intermediate “travel consultant cyborg” approach in which some functions are performed by human experts and some by robots, in a complex symbiosis. Airtreks.com has invested almost as much effort in developing proprietary software tools to enhance and extend the abilities of its human experts as in its purely robotic first-order price-estimation software.

But supposing that you want to take an entirely brute force rather than heuristic approach to airline ticket pricing, how do you go about it in the absence of a database of price tags for seats? ITA Software’s “solution” was to use a series of availability queries to create a cached database of pseudo-price tags for pseudo-seats. Once that was done, the problem remained difficult mainly because of its scale and the number of permutations to be considered (again, as with “look-ahead” brute-force chess analysis), but amenable to ITA Software’s signature “cleverness” in algorithms and software implementation.

So the essence of ITA Software’s system (all the elements of which are visible in their patents and patent applications) is:

A ‘bot that queries airlines for availability, flight by flight, mainly through CRS’s although in some cases through direct connections to airlines’ in-house reservation systems, to compile a cache of availability information. This process has been described to me by ITA Software CEO Jeremy Wertheimer, and in ITA Software’s patents and pending patent applications. (I see Wertheimer each year at the PhoCusWright conference, and I’ve pressed him on how often a new query is made to update the cache for each flight. Wertheimer won’t say, but it appears to be measured in hours for most flights, probably less for some of the flights of greatest interest in the next few days or weeks, and perhaps as infrequently as daily or less for some flights in other parts of the world of little interest to ITA Software’s core customer base in the USA.)
A database and index of the cached responses to these availability queries.
A search module that responds to user queries with guesses about current availability made on the basis of that index and cache, without the need to query any external data sources unless and until the user tries to confirm reservations on specific flights on the basis of an option offered from the cache.

Through the clever kludge of the CRS crawler and availability cache, ITA Software transforms a real-time problem of third-party queries into a simpler search of a locally resident and already indexed database.

[This description is, of necessity, somewhat simplified, but I fear that greater detail would render it incomprehensible to anyone outside the industry.]

What’s perhaps most obvious about this methodology is how closely analogous it is to Google’s approach to the problem of “searching” constantly-changing Web sites not stored on Google’s servers. Rather than try to query potentially responsive Web pages in real time in response to user search requests, Google conducts a periodic “crawl” of HTTP queries of third-party pages, constructs a “cached” database of responses, indexes that cache database, and searches the index — not the cache and certainly not the Web itself — in response to each user query. Only when you click through the search results to the Web site do you see the current page content or find out if it is still the same. It’s clear why the approach adopted by ITA Software would seem particularly logical and appropriate to Google’s engineers. ITA Software’s key problem is also like Google’s: How do you index dynamically-generated or personalized Web pages, or real-time dynamic responses to availability queries?

Eliminating real-time availability queries except for customers who have already agreed to a price estimate for specific flights, and are ready to make reservations, saves money for ITA Software and its customers , who are airlines and travel agents — travellers aren’t its customers. Travel agents — including ITA Software’s online travel agency customers — are charged a fraction of a cent for each query or command they execute from the command line. Human travel agents can’t execute commands fast enough for the charges to justify fundamental changes in their procedures, but they can be prohibitive for an online travel agency with a high “look to book” ratio making rapid-fire robotic queries on behalf of comparison shoppers only a small percentage of whom complete purchases.

ITA Software doesn’t (yet) host any of the airlines’ reservation databases or operate their availability-decision systems. [Update: Several readers have pointed out that this may no longer be entirely true, depending on the manner in which ITA Software’s “Dynamic Availability Calculating System” (DACS) has been deployed and is being used as a replacement for, rather than merely an emulator or mirror of, airlines’ “legacy” availability management systems.] Reservation database hosting and availability management is either outsourced to CRS’s (by most airlines) or handled in-house. Like other CRS users, ITA Software pays per-query fees to compile or update its cached pseudo-availability database. That has several consequences:

There are enormous economies of scale and barriers to entry for a would-be competitor using the same methodology, since the same number and cost of queries is required to build the pseudo-availability cache regardless of how many people are using the system. It’s unclear if the whole concept would have been commercially viable without a launch customer for ITA Software with the sales volume of Orbitz.com.
ITA Software has a substantial financial incentive to query availability as infrequently as it thinks it can get away with, exploring the limits of consumers’ willingness to put up with seemingly “bait and switch” results when what appears to be an offer to sell a ticket turns out to be only an estimate based on an outdated availability cache or incorrect availability projections from responses to past queries. (As an aside, one of the things the USA Department of Transportation has yet to address in its failure to enforce truth-in-advertising law in the sale of airline tickets is the misleading labeling of price and availability estimates as though they were firm offers to sell at a specific price.)
ITA Software has an even greater financial incentive to eliminate these CRS and airline query fees entirely by developing an airline reservations hosting and availability decision-making (“revenue management”) capability of its own, and wooing airlines away from existing CRS’s or airlines’ in-house systems. Currently, ITA Software uses a bombardment of individual queries to try to assemble an inevitably-imperfect copy, constantly being rendered out-of-date, of each airline’s willingness to confirm reservations on each flight in each possible booking class. If ITA Software were hosting or operating that system itself for a particular airline, that entire process would be unnecessary.

This last point is perhaps the most significant: ITA Software’s technical approach to the airline ticket pricing problem has created a particular compulsion — independent of any interest in the CRS, airline hosting, or revenue management problems (not that they aren’t all interesting and hard) or belief that they could build a better CRS or hosting platform — for vertical integration with airline hosting and revenue/availability management.

While ITA Software might see its availability caching and prediction systems as “clever”, a critic might see them as a kludge to adapt database search techniques to an information ecosystem in which ITA Software has only indirect query-based access to (constantly changing) third-party databases. But both ITA Software and its critics would likely agree that the ultimate solution for ITA Software lies in vertical integration with hosting/CRS functionality, to give ITA Software direct (non query-based) access to real-time availability for hosted airlines. The analogy for a search provider like Google would be Web sites hosted by Google, which Google doesn’t need to query and “crawl” because it already has them on its servers.

When ITA Software got US$100 million in 2006 in its last round of venture capital investment before the sale to Google, its main use of the money was to try to develop its own airline hosting system to eliminate the need for pseudo-availability caching. ITA Software claims it now has an airline reservation hosting system ready to launch. But Air Canada, who was to be the launch customer, backed out, and ITA Software hasn’t yet found any other airline willing to risk its operations and revenue stream to beta test a new provider for the most critical component of its IT infrastructure.

Now Google has stepped in, US$700 million in cash in hand, to buy ITA Software. What will happen next? And what will this deal mean for travellers? Stay tuned for Part 3 and my conclusions tomorrow.

[Update: There’s more background on how airline ticket pricing works in my books, and more comments on Slashdot.]

Link | Posted by Edward on Tuesday, 13 July 2010, 15:15 ( 3:15 PM)

Comments

You do not understand how ITA gathers data.

Posted by: Mario Z, 18 July 2010, 10:12 (10:12 AM)

Mario Z: What is it that you feel I don't understand?

The description I gave was based on my interviews with ITA Software principals, including CEO Jeremy Wertheirmer, as well as the process descriptions in ITA Software's patents and patent applications. But if you feel I've gotten something wrong, I'd be very interested to know what that is, how you think they do gather data, and your basis for your belief that they do things differently than I have described.

Posted by: Edward Hasbrouck, 18 July 2010, 10:53 (10:53 AM)

Edward, this is a fascinating read. Thanks for the articles, and for providing an overview of ITA. Kudo's to you sir!

Warm regards;

Posted by: Timothy Taylor, 20 July 2010, 11:31 (11:31 AM)

I worked in the travel industry as an IT people and I know what GDS/CRS do and the workings of "fares".

I once found a pdf with an interesting description of ITA decision-making engine, that does really more than simple brute-forcing fares data... from what I remember there is little brute-force, but instead the data is arranged in a way that make the search feasible...

Posted by: francesco, 21 July 2010, 02:09 ( 2:09 AM)

It is unlikely that Google will try to force it's implementation of web crawling onto the ITA pseudo-availability cache database. What they are likely after is the access to the airline query systems which ITA already has agreements with. The pseudo-availability cache idea is sound business plan if you have large enough potential paying user base, which Google has. Google could even write off the cost of constantly re-querying the airlines' databases against its revenue of ad sales on its search engine. Google's current business plans seem to be providing services for free or almost free via it's search engine portal so that the user is also confronted with ads which Google gets it's main revenue from. Google will likely also be interested in the commission that it can get for the booking of the flight but that is likely to be a different business unit.

Posted by: Chris M, 21 July 2010, 04:00 ( 4:00 AM)

Edward,

Thanks for posting this series of articles about ITA; it's certainly the most comprehensive analysis that I've seen published anywhere so far.

Regarding ITA and availability, I would find it very surprising if they are not making heavy use of the NAVS protocol. (In a nutshell, this replaces the periodic robotic polling of availability data with a real-time "push" notification whenever the availability data for a particular booking class changes.) From what I can gather, this is how the majority of airlines are connected to the GDS -- thus the GDS always knows the state of the airline inventory.

If the airlines already provide NAVS feeds to the GDS, it would seem that there is nothing stopping ITA from also getting access to these feeds. There's no need for ITA to host the availability themselves in this case.

You would seem to have a huge wealth of information about ITA, and I'd be very interested to hear your thoughts. However, since it appears that you haven't been directly involved in the industry for a few years now, is it possible that things have changed at ITA in the intervening time?

Regards,
-Andrew

Posted by: Andrew Tipton, 21 July 2010, 07:51 ( 7:51 AM)

Some responses to comments:

@francesco: I linked to the presentation you mention by Carl de Marcken:

http://carl.demarcken.org/papers/ITA-software-travel-complexity/ITA-software-travel-complexity.html

It's more interesting as an indication of how ITA Software conceptualizes the problem than as a treatise in how airfares work, although it does quite well at the latter considering that de Marcken is a programmer, not an air tariff expert. I agree with his note that there is no good publicly available introductory treatise on this subject -- something he no doubt wished he had while developing ITA Software's early systems, especially after the departure from the company of Richard Aiken.

The descriptions in my articles are, of necessity, simplified to keep the series of articles to manageable length and to render them at least minimally comprehnsible to lay readers. What ITA Software actually queries and caches is availability. Converting availability to prices for specific sets of flights still requires quite sophisticated routing and other rule parsing systems. But there is still a fundamental difference between real-time responses to availability queries and a local pseudo-availability cache.

@Andrew Tipton: Yes, ITA Software may have (and I would speculate, as you do, that they probably have) used NAVS data feeds to reduce (not eliminate) their "CRS-crawling" as a source of data for their pseudo-availability cache. They still need a local cache in order to enable their "searches" to run in real time. And NAVS data is not, and cannot be, a complete substitute for availability queries, since NAVS data isn't fully real-time and can itself indicate that availability on certain low-availability or volatile-availability flights is only available on request (i.e in response to a real-time query), just as CRS's have from their earliest days had an option to show availability on a "request" basis (typically by "R" rather than a number in the available-seat count for a booking class).

See, for example, this description of Travelport's NAVS and YAVS:

http://www.travelport.com/us/customer_solutions/products.aspx?productid=%7B1C7C1462-D487-42E2-8BF9-0DFA6355E035%7D

"Yield Availability Status (YAVS) ... information from the airline instructs the Galileo system as to when and where to retrieve real-time availability status from the airline's host system."

Posted by: Edward Hasbrouck, 21 July 2010, 08:33 ( 8:33 AM)