TravelSky Technology(China Airline): Predicting Airfare Price by SVM
We
were faced with the large dataset with no explicit links between records,
making it a very challenging task to analyze price changes of an individual
round-trip.
It
was much more practical to develop a model that generalizes the properties of
all records in the dataset, and to train a SVM as a binary pricing classifier
to distinguish between ”expensive” and ”cheap” of all tickets (transaction
records) processed.
Part I. General
Introduction
Travelsky
technologies is one of the largest global distribution system in the
travel/tourism industry: it sells tickets for all airlines (also hotels) and
processes millions billable transactions per month.
Project Goals
1
Construct and train a general classifier so that it can distinguish between
expensive and cheap tickets.
2.
Use this classifier to predict the prices of future tickets.
3.
Determine which factors have the greatest impact on price by analyzing the
trained classifier.
Exploratory data
analysis
Extent
of the dataset: billion records, 132.2
GiB (uncompressed) , hundreds departure
airports, hundreds destinations, hundreds routes, hundreds airlines.
Lots
of fields: “Buy” date: When was this
price current? “Fly” date: When does the flight leave? Price. Cabin class Economy/Business/First (98%
economy tickets) . Booking class A-Z …
Airline The airline selling the ticket. some data looks like a time series,
tickets are linked over time
Classification &
Prediction methods
Implemented
two different classifiers: Support
vector machine (SVM), L1- regularized
linear regression. Both are convex
minimization problems that can be solved online by employing the stochastic
gradient descent (SGD) method.
SVM: binary linear classifier. Goal: Find
maximum-margin hyperplane that divides the points with label y “+1” from those
with label y “-1”.
Training:
Generate training label yi for i-th data point xi. Choose hyperplane parameters so the margin is
maximal and the training data is still correctly classified.
Preprocessing
For
each route r, calculate the arithmetic mean (and standard deviation) of the
price over all tickets.
Assign
labels: Label +: “Above mean price for this route”. Label -: “Below mean price for this route” .Only
store mean/std-dev, do not actually store labels.
Feature Selection
Extract
features from plaintext records (x).
Each
plaintext record is transformed into a 990-dimensional vector.
Each
dimension contains a numerical value corresponding to a feature such as: Number of days between “Buy” and “Fly” dates,
Week of day (for all dates) , Is the day on a weekend (for all dates), Dates
isMonday, isWeekend, isWinter, weekOfYear, . . . .
Each
dimension is normalized to zero mean and unit variance (per route r).
Part II. More Detailed
Descriptions of Our Model
Classification methods
In
order to identify which records represent cheap tickets and which records have
traits identifying them as expensive tickets, a classifier able to distinguish
between ”expensive” and ”cheap” records is necessary.
It
should be possible to train such a classifier on all records at once,
identifying the features making a record cheaper or more expensive than other
records. As some routes are more expensive than others, it does not make sense
to include the route as a feature, but rather normalize prices per route. This enables
comparison of prices across all routes without simply marking all records of a
particular route as expensive. Each record is then labeled according to the
normalized price.
In
short, a record for a particular route is labeled as ”expensive” (+1) if its
price is higher than the average price of all records for that specific route.
Otherwise it is labeled as ”cheap” (-1).
After
training the classifier, it should be able to predict a label from and assign this label to a new record with an
unknown price. As the route of the new record is known, a numerical minimal or
maximal price (the afore-mentioned average price per route) can be directly
inferred from the predicted label.
Additionally,
the model parameters of the trained classifier should contain information on
how much each feature contributes to a record being cheap or expensive.
Online algorithms for
classification
Due
to the large amount of data, algorithms using more than a constant amount of
memory are not suitable. Two algorithms were implemented, one for online
support vector machines and the other for online-regularized logistic
regression. This allows efficient training of the classifier on a parallel
system with limited memory.
Some
definitions and terminology as they are used in the following sections:
Xi
Each
data point Xi represents the features of
a single record Ri and is also called the feature
vector
for record Ri
Each
component contains information about a single aspect of the record Ri . The
contents are described previously are derived from the fields of Ri
Yi
Each
label yi represents the label (classification) of a single record Ri, with two
values: ”expensive” (+1), “cheap” (-1).
Record
Ri always consists of a pair (Xi Yi) , both values are known for trading
dataset.
For new data points Xi,
the value of the labels Yi is initially unknown, and is the result of the
classification/prediction. A label has only two possible values: −1 and 1.
W
The
weight vector w is the model parameter of the classifier to be estimated and is
initially
unknown.
In both classifiers discussed below, w has the same number of dimensions as a
data
point
Xi and determines the effect each value in Xi has on the classification result.
Feature
vector generation
For each record, a feature vector consisting of 990
features was created. Before normalization, each entry was set to either 1.0
(boolean true), 0 (boolean false) or a value associated with a numerical field
in the record.
The feature vector represents each record as a 990-dimensional
vector.
Some
examples of features, and record fields utilized:
Dates Request Date, Departure Date, Return Date
Date differences Return-Departure Date,
Departure-Request Date
Categorical values Passenger Type, Airline
Numerical values Number of passengers, Number of hops
Sequences of categorical values Cabin Classes, Booking
Classes, Availabilities
Sequences of numerical values Flight numbers
Feature
vector normalization
As with price normalization, each of the 990 features
fm was normalized in two steps. In a first MapReduce job, the arithmetic means
µfm and standard deviations σfm were calculated using the same methods as for
price normalization and subsequently stored to disk.
All following MapReduce jobs loaded the 990 means and
standard deviations from disk and calculated the normalized feature vector x 0
i on-the-fly by calculating the standard score of each feature fm: f 0 m = fm −
µfm σfm , m ∈
1, . . . , 990
Stochastic gradient
descent (SGD)
Given
a convex set S and a convex function f, we can estimate the parameter w in min
f(w)
is of the form f(w) = Pn
a
single observed data point from the dataset. Finding w is done iteratively, by
using one random
sample
data point from the dataset per iteration. For regularization, w ∈ S needs to be ensured,
thus
a projection onto S is necessary.
Let
w0 ∈ S be the starting value. Then each iteration t consists of the
update step
t=1
ft(w). Usually, each summand ft represents the loss function for
wt+1
= P rojS(wt − ηt∇ft(wt))
where
P rojS is a projection onto the set S, ηt is the current step size (learning
rate), and ∇ft is the
gradient
of f approximated at the sample data point for iteration t.
It
is possible to only use a subsample of the full dataset if the data points used
for training are picked
at
random from the dataset. Training can then either be halted after a fixed
number of iterations or as soon as sufficient accuracy is achieved.