Saturday, October 22, 2016

TravelSky Technology(China Airline): Predicting Airfare Price by SVM

TravelSky Technology(China Airline):  Predicting Airfare Price by SVM


We were faced with the large dataset with no explicit links between records, making it a very challenging task to analyze price changes of an individual round-trip.
It was much more practical to develop a model that generalizes the properties of all records in the dataset, and to train a SVM as a binary pricing classifier to distinguish between ”expensive” and ”cheap” of all tickets (transaction records) processed. 

Part I. General Introduction
Travelsky technologies is one of the largest global distribution system in the travel/tourism industry: it sells tickets for all airlines (also hotels) and processes millions billable transactions per month.

Project Goals

1 Construct and train a general classifier so that it can distinguish between expensive and cheap tickets.

2. Use this classifier to predict the prices of future tickets.

3. Determine which factors have the greatest impact on price by analyzing the trained classifier.

Exploratory data analysis
Extent of the dataset:  billion records, 132.2 GiB (uncompressed) , hundreds  departure airports, hundreds destinations, hundreds routes, hundreds airlines.

Lots of fields:  “Buy” date: When was this price current? “Fly” date: When does the flight leave? Price.  Cabin class Economy/Business/First (98% economy tickets) . Booking class A-Z  … Airline The airline selling the ticket. some data looks like a time series, tickets are linked over time

Classification & Prediction methods

Implemented two different classifiers:  Support vector machine (SVM),  L1- regularized linear regression.  Both are convex minimization problems that can be solved online by employing the stochastic gradient descent (SGD) method.

 SVM: binary linear classifier. Goal: Find maximum-margin hyperplane that divides the points with label y “+1” from those with label y “-1”. 
Training: Generate training label yi for i-th data point xi.  Choose hyperplane parameters so the margin is maximal and the training data is still correctly classified.

Preprocessing
For each route r, calculate the arithmetic mean (and standard deviation) of the price over all tickets. 
Assign labels: Label +: “Above mean price for this route”.  Label -: “Below mean price for this route” .Only store mean/std-dev, do not actually store labels.

Feature Selection
Extract features from plaintext records (x).
Each plaintext record is transformed into a 990-dimensional vector.
Each dimension contains a numerical value corresponding to a feature such as:  Number of days between “Buy” and “Fly” dates, Week of day (for all dates) , Is the day on a weekend (for all dates), Dates isMonday, isWeekend, isWinter, weekOfYear, . . . .
Each dimension is normalized to zero mean and unit variance (per route r).

Part II. More Detailed Descriptions of Our Model

Classification methods


In order to identify which records represent cheap tickets and which records have traits identifying them as expensive tickets, a classifier able to distinguish between ”expensive” and ”cheap” records is necessary.

It should be possible to train such a classifier on all records at once, identifying the features making a record cheaper or more expensive than other records. As some routes are more expensive than others, it does not make sense to include the route as a feature, but rather normalize prices per route. This enables comparison of prices across all routes without simply marking all records of a particular route as expensive. Each record is then labeled according to the normalized price.
In short, a record for a particular route is labeled as ”expensive” (+1) if its price is higher than the average price of all records for that specific route. Otherwise it is labeled as ”cheap” (-1).
After training the classifier, it should be able to predict a label from  and assign this label to a new record with an unknown price. As the route of the new record is known, a numerical minimal or maximal price (the afore-mentioned average price per route) can be directly inferred from the predicted label.
Additionally, the model parameters of the trained classifier should contain information on how much each feature contributes to a record being cheap or expensive.

Online algorithms for classification

Due to the large amount of data, algorithms using more than a constant amount of memory are not suitable. Two algorithms were implemented, one for online support vector machines and the other for online-regularized logistic regression. This allows efficient training of the classifier on a parallel system with limited memory.

Some definitions and terminology as they are used in the following sections:

Xi
Each data point Xi  represents the features of a single record Ri and is also called the feature
vector for record Ri
Each component contains information about a single aspect of the record Ri . The contents are described previously are derived from the fields of Ri


Yi

Each label yi represents the label (classification) of a single record Ri, with two values: ”expensive” (+1), “cheap” (-1).

Record Ri always consists of a pair (Xi Yi) , both values are known for trading dataset.
For new data points Xi,  the value of the labels Yi is initially unknown,  and is the result of the classification/prediction. A label has only two possible values: −1 and 1.

W

The weight vector w is the model parameter of the classifier to be estimated and is initially

unknown. In both classifiers discussed below, w has the same number of dimensions as a data

point Xi and determines the effect each value in Xi has on the classification result.

Feature vector generation
For each record, a feature vector consisting of 990 features was created. Before normalization, each entry was set to either 1.0 (boolean true), 0 (boolean false) or a value associated with a numerical field in the record.
The feature vector represents each record as a 990-dimensional vector.
Some examples of features, and record fields utilized:
Dates Request Date, Departure Date, Return Date
Date differences Return-Departure Date, Departure-Request Date
Categorical values Passenger Type, Airline
Numerical values Number of passengers, Number of hops
Sequences of categorical values Cabin Classes, Booking Classes, Availabilities
Sequences of numerical values Flight numbers
Feature vector normalization
As with price normalization, each of the 990 features fm was normalized in two steps. In a first MapReduce job, the arithmetic means µfm and standard deviations σfm were calculated using the same methods as for price normalization and subsequently stored to disk.
All following MapReduce jobs loaded the 990 means and standard deviations from disk and calculated the normalized feature vector x 0 i on-the-fly by calculating the standard score of each feature fm: f 0 m = fm − µfm σfm , m 1, . . . , 990

Stochastic gradient descent (SGD)

Given a convex set S and a convex function f, we can estimate the parameter w in min

f(w) is of the form f(w) = Pn

a single observed data point from the dataset. Finding w is done iteratively, by using one random

sample data point from the dataset per iteration. For regularization, w S needs to be ensured,

thus a projection onto S is necessary.

Let w0 S be the starting value. Then each iteration t consists of the update step
t=1 ft(w). Usually, each summand ft represents the loss function for

wt+1 = P rojS(wt − ηtft(wt))

where P rojS is a projection onto the set S, ηt is the current step size (learning rate), and ft is the

gradient of f approximated at the sample data point for iteration t.

It is possible to only use a subsample of the full dataset if the data points used for training are picked


at random from the dataset. Training can then either be halted after a fixed number of iterations or as soon as sufficient accuracy is achieved.