Tuesday, January 3, 2017

David Colquhoun: The problem with p-values


Academic psychology and medical testing are both dogged by unreliability. The reason is clear: we got probability wrong

False positives; numerous microscopic cancerous and non-cancerous human tissue samples. Photo courtesy Wellcome Images
David Colquhoun
is a professor of pharmacology at University College London and a Fellow of the Royal Society. He is the author of Lectures on Biostatistics (1971) and blogs at DC’s Improbable Science.
2,400 words
Edited by Sally Davies
What should be done to improve statistical literacy?
The aim of science is to establish facts, as accurately as possible. It is therefore crucially important to determine whether an observed phenomenon is real, or whether it’s the result of pure chance. If you declare that you’ve discovered something when in fact it’s just random, that’s called a false discovery or a false positive. And false positives are alarmingly common in some areas of medical science.
In 2005, the epidemiologist John Ioannidis at Stanford caused a storm when he wrote the paper ‘Why Most Published Research Findings Are False’, focusing on results in certain areas of biomedicine. He’s been vindicated by subsequent investigations. For example, a recent article found that repeating 100 different results in experimental psychology confirmed the original conclusions in only 38 per cent of casesIt’s probably at least as bad for brain-imaging studies and cognitive neuroscience. How can this happen?
The problem of how to distinguish a genuine observation from random chance is a very old one. It’s been debated for centuries by philosophers and, more fruitfully, by statisticians. It turns on the distinction between induction and deduction. Science is an exercise in inductive reasoning: we are making observations and trying to infer general rules from them. Induction can never be certain. In contrast, deductive reasoning is easier: you deduce what you would expect to observe if some general rule were true and then compare it with what you actually see. The problem is that, for a scientist, deductive arguments don’t directly answer the question that you want to ask.
What matters to a scientific observer is how often youll be wrong if you claim that an effect is real, rather than being merely random. That’s a question of induction, so it’s hard. In the early 20th century, it became the custom to avoid induction, by changing the question into one that used only deductive reasoning. In the 1920s, the statistician Ronald Fisher did this by advocating tests of statistical significance. These are wholly deductive and so sidestep the philosophical problems of induction.
Tests of statistical significance proceed by calculating the probability of making our observations (or the more extreme ones) if there were no real effect. This isn’t an assertion that there is no real effect, but rather a calculation of what would be expected if there were no real effect. The postulate that there is no real effect is called the null hypothesis, and the probability is called the p-value. Clearly the smaller the p-value, the less plausible the null hypothesis, so the more likely it is that there is, in fact, a real effect. All you have to do is to decide how small the p-value must be before you declare that you’ve made a discovery. But that turns out to be very difficult.
The problem is that the p-value gives the right answer to the wrong question. What we really want to know is not the probability of the observations given a hypothesis about the existence of a real effect, but rather the probability that there is a real effect – that the hypothesis is true – given the observations. And that is a problem of induction.
Confusion between these two quite different probabilities lies at the heart of why p-values are so often misinterpreted. It’s called the error of the transposed conditional. Even quite respectable sources will tell you that the p-value is the probability that your observations occurred by chance. And that is plain wrong.
Get Aeon straight to your inbox
Suppose, for example, that you give a pill to each of 10 people. You measure some response (such as their blood pressure). Each person will give a different response. And you give a different pill to 10 other people, and again get 10 different responses. How do you tell whether the two pills are really different?
The conventional procedure would be to follow Fisher and calculate the probability of making the observations (or the more extreme ones) if there were no true difference between the two pills. That’s the p-value, based on deductive reasoning. P-values of less than 5 per cent have come to be called ‘statistically significant’, a term that’s ubiquitous in the biomedical literature, and is now used to suggest that an effect is real, not just chance.
But the dichotomy between ‘significant’ and ‘not significant’ is absurd. There’s obviously very little difference between the implication of a p-value of 4.7 per cent and of 5.3 per cent, yet the former has come to be regarded as success and the latter as failure. And ‘success’ will get your work published, even in the most prestigious journals. That’s bad enough, but the real killer is that, if you observe a ‘just significant’ result, say P = 0.047 (4.7 per cent) in a single test, and claim to have made a discovery, the chance that you are wrong is at least 26 per cent, and could easily be more than 80 per cent. How can this be so?Take the proposition that the Earth goes round the Sun. It either does or it doesn’t, so it’s hard to see how we could pick a probability for this statement
For one, it’s of little use to say that your observations would be rare if there were no real difference between the pills (which is what the p-value tells you), unless you can say whether or not the observations would also be rare when there is a true difference between the pills. Which brings us back to induction.
The problem of induction was solved, in principle, by the Reverend Thomas Bayes in the middle of the 18th century. He showed how to convert the probability of the observations given a hypothesis (the deductive problem) to what we actually want, the probability that the hypothesis is true given some observations (the inductive problem). But how to use his famous theorem in practice has been the subject of heated debate ever since.
Take the proposition that the Earth goes round the Sun. It either does or it doesn’t, so it’s hard to see how we could pick a probability for this statement. Furthermore, the Bayesian conversion involves assigning a value to the probability that your hypothesis is right beforeany observations have been made (the ‘prior probability’). Bayes’s theorem allows that prior probability to be converted to what we want, the probability that the hypothesis is true given some relevant observations, which is known as the ‘posterior probability’.
These intangible probabilities persuaded Fisher that Bayes’s approach wasn’t feasible. Instead, he proposed the wholly deductive process of null hypothesis significance testing. The realisation that this method, as it is commonly used, gives alarmingly large numbers of false positive results has spurred several recent attempts to bridge the gap.
There is one uncontroversial application of Bayes’s theorem: diagnostic screening, the tests that doctors give healthy people to detect warning signs of disease. They’re a good way to understand the perils of the deductive approach.
In theory, picking up on the early signs of illness is obviously good. But in practice there are usually so many false positive diagnoses that it just doesn’t work very well. Take dementia. Roughly 1 per cent of the population suffer from mild cognitive impairment, which might, but doesn’t always, lead to dementia. Suppose that the test is quite a good one, in the sense that 95 per cent of the time it gives the right (negative) answer for people who are free of the condition. That means that 5 per cent of the people who don’t have cognitive impairment will test, falsely, as positive. That doesn’t sound bad. Its directly analogous to tests of significance which will give 5 per cent of false positives when there is no real effect, if we use a p-value of less than 5 per cent to mean statistically significant.
But in fact the screening test is not good  its actually appallingly bad, because 86 per cent, not 5 per cent, of all positive tests are false positives. So only 14 per cent of positive tests are correct. This happens because most people dont have the condition, and so the false positives from these people (5 per cent of 99 per cent of the people), outweigh the number of true positives that arise from the much smaller number of people who have the condition (80 per cent of 1 per cent of the people, if we assume 80 per cent of people with the disease are detected successfully). Theres a YouTube video of my attempt to explain this principle, or you can read my recent paper on the subject.
The number of false positives in the tests where there is no real effect outweighs the number of true positives that arise from the cases in which there is a real effect
Notice, though, that it’s possible to calculate the disastrous false-positive rate for screening tests only because we have estimates for the prevalence of the condition in the whole population being tested. This is the prior probability that we need to use Bayes’s theorem. If we return to the problem of tests of significance, it’s not so easy. The analogue of the prevalence of disease in the population becomes, in the case of significance tests, the probability that there is a real difference between the pills before the experiment is done – the prior probability that there’s a real effect. And it’s usually impossible to make a good guess at the value of this figure.
An example should make the idea more concrete. Imagine testing 1,000 different drugs, one at a time, to sort out which works and which doesn’t. You’d be lucky if 10 per cent of them were effective, so let’s proceed by assuming a prevalence or prior probability of 10 per cent. Say we observe a ‘just significant’ result, for example, a P = 0.047 in a single test, and declare that this is evidence that we have made a discovery. That claim will be wrong, not in 5 per cent of cases, as is commonly believed, but in 76 per cent of cases. That is disastrously high. Just as in screening tests, the reason for this large number of mistakes is that the number of false positives in the tests where there is no real effect outweighs the number of true positives that arise from the cases in which there is a real effect.
In general, though, we don’t know the real prevalence of true effects. So, although we can calculate the p-value, we can’t calculate the number of false positives. But what we can do is give a minimum value for the false positive rate. To do this, we need only assume that it’s not legitimate to say, before the observations are made, that the odds that an effect is real are any higher than 50:50. To do so would be to assume you’re more likely than not to be right before the experiment even begins.
If we repeat the drug calculations using a prevalence of 50 per cent rather than 10 per cent, we get a false positive rate of 26 per cent, still much bigger than 5 per cent. Any lower prevalence will result in an even higher false positive rate.
The upshot is that, if a scientist observes a just significant’ result in a single test, say P = 0.047, and declares that she’s made a discovery, that claim will be wrong at least 26 per cent of the time, and probably more. No wonder then that there are problems with reproducibility in areas of science that rely on tests of significance.
What is to be done? For a start, it’s high time that we abandoned the well-worn term statistically significant. The cut-off of P < 0.05 that’s almost universal in biomedical sciences is entirely arbitrary – and, as we’ve seen, it’s quite inadequate as evidence for a real effect. Although it’s common to blame Fisher for the magic value of 0.05, in fact Fisher said, in 1926that P = 0.05 was a ‘low standard of significance’ and that a scientific fact should be regarded as experimentally established only if repeating the experiment ‘rarely fails to give this level of significance’.
The rarely fails’ bit, emphasised by Fisher 90 years ago, has been forgotten. A single experiment that gives P = 0.045 will get a ‘discovery’ published in the most glamorous journals. So it’s not fair to blame Fisher, but nonetheless there’s an uncomfortable amount of truth in what the physicist Robert Matthews at Aston University in Birmingham had to say in 1998: ‘The plain fact is that 70 years ago Ronald Fisher gave scientists a mathematical machine for turning baloney into breakthroughs, and flukes into funding. It is time to pull the plug.
The underlying problem is that universities around the world press their staff to write whether or not they have anything to say. This amounts to pressure to cut corners, to value quantity rather than quality, to exaggerate the consequences of their work and, occasionally, to cheat. People are under such pressure to produce papers that they have neither the time nor the motivation to learn about statistics, or to replicate experiments. Until something is done about these perverse incentives, biomedical science will be distrusted by the public, and rightly so. Senior scientists, vice-chancellors and politicians have set a very bad example to young researchers. As the zoologist Peter Lawrence at the University of Cambridge put it in 2007:
hype your work, slice the findings up as much as possible (four papers good, two papers bad), compress the results (most top journals have little space, a typical Nature letter now has the density of a black hole), simplify your conclusions but complexify the material (more difficult for reviewers to fault it!)
But there is good news too. Most of the problems occur only in certain areas of medicine and psychology. And despite the statistical mishaps, there have been enormous advances in biomedicine. The reproducibility crisis is being tackled. All we need to do now is to stop vice-chancellors and grant-giving agencies imposing incentives for researchers to behave badly.

Wednesday, December 28, 2016

An introduction to statistics principle behind the A/B testing

Seeing is believing: if computer cannot see, then it has to guess.



"Why Computer Vision is Challenging? 

what about the animal vision system ?  

 •perspecIve projecIon (from 3D to 2D)  
•object reflectance  
 •visual features "


Part I. what is A/B testing?

"A/B testing is comparing two versions of a web page to see which one performs better. You compare two web pages by showing the two variants (let's call them A and B) to similar visitors at the same time. The one that gives better conversion rate, wins!"

For example, we are doing A/B testing for Holberton School webpage. In one version (A version, the old one), we have the 'click to apply immediately" on the menu, and another version (B version, the new one) the button is right next to the picture of "cisfun$" (coding your own shell). 2500 views are tested and the application rate is 12.5% in version A and 14% in version B.

Now, Julien asks us, "Hi, guys, how do we know if version B is really better?" 


click on link #1: https://www.optimizely.com/ab-testing/

Part II. statistics as we know is all about a null distribution 

A / B test is a comparative test, in the course of the experiment, we extract some samples from the population to collect data, and then come to an overall population parameter estimation. The scientific basis from which we can derive valid conclusions from experimental data is based on statistical principles.

1. statistical hypothesis testing and significance testing: basic concepts

In order to answer Julien's question, we need to set up two hypotheses and then test them.

Null Hypothesis (Ho): There is no difference between two webpages. We hope to overturn this hypothesis by our test results.
Alternative Hypothesis (Ha): We wish to validate the hypothesis by our test results

 "Statistical Methodology - General A) Null hypotheses - Ho 1) In most sciences, we are faced with: a) NOT whether something is true or false (Popperian decision) b) BUT rather the degree to which an effect exists (if at all) - a statistical decision. 2) Therefore 2 competing statistical hypotheses are posed: a) Ha: there is a difference in effect between (usually posed as < or >) b) Ho: there is no difference in effect between"

click on link#2, page 13:

2. The most important concept behind is null distribution 

Page 8:

"Almost all ordinary statistics are based on a null distribution • If you understand a null distribution and what the correct null distribution is then statistical inference is straight-forward. • If you don’t, ordinary statistical inference is bewildering • A null distribution is the distribution of events that could occur if the null hypothesis is true"


3. We, of course, do  hope to reject the null hypothesis Ho by our test results

But how do we do that statistically?

Logic:  The null hypotheses and alternative hypotheses are a complete set of events, and they are opposite of  each other. In a hypothesis test,  either the null hypothesis H0 or the alternative hypothesis Ha must hold, and if one does not hold, one must accept the other one unconditionally. So, it is either Ho true, or Ha true.

In the A / B test, the purpose of our experiment is to demonstrate that the new version and the old version are statistically significantly different, in terms of application rate.

Again, the original null hypothesis Ho in this scenario is that there is no difference between the old version and the new version of webpage, Yes, there may be some differences in terms of application data we collected, but this difference in data is due to random fluctuations in a null distribution: meaning the webpage viewers as an population has certain "statistical and random fluctuations" in terms of their visits and filling out of applications, and this kind of fluctuations is bidirectional or with two -tails, sometime is more positive than negative, other times is more negative than positive.

Part III. Type I error is critical

Now we have a much better understanding of  A/B testing:  we want to do a  A / B test to reject the null hypothesis Ho of no difference, proving Ho is false,  and therefore proving  that the alternative hypothesis Ha is true.

1. What is Type I error?

Type I error: The null hypothesis is rejected when the null hypothesis is true

This is saying that we did all the A/B testing, and in our report we rejected Ho,  proving Ho is false,  and therefore proving  that the alternative hypothesis (Ha) is true.

But there is a probability that we could have made some sampling errors somewhere, and the probability of making this kind of error is called α.

"An analogy that some people find helpful (but others don't) in understanding the two types of error is to consider a defendant in a trial. The null hypothesis is "defendant is not guilty;" the alternate is "defendant is guilty."A Type I error would correspond to convicting an innocent person; a Type II error would correspond to setting a guilty person free." 

click on link #3:

Again, the logic is:  The null hypotheses and alternative hypotheses are a complete set of events, and they are opposite of  each other. In a hypothesis test,  either the null hypothesis H0 or the alternative hypothesis Ha must hold, and if one does not hold, one must accept the other one unconditionally. So, it is either Ho true, or Ha true.

2.   Type I error and  p-value

A standard value of 0.05 is the generally accepted probability that we may commit Type I error.

This 5% probability or significance level in statistics is called α, also represents the confidence level of our test results. If α is 0.05, then the confidence is 0.95, that is, if our A/B testing shows that our  chance of Type I error is  <5%, then  we are confident with a > 95% probability that the positive feedback we get from  new our new web page is due to our newly improved  webpage design.

If we set  α to 0.01, then we have a much tougher job proving that our new webpage actually made a difference.

 α is set by the industry standard, and we compare to it with out own p-value from our sample data.

3. Definition of p-value.

For a p-value (significance level) of 0.05, one expects to obtain sample means in the critical region 5% of the time when the null hypotheses is true.

Or in other words, for every 100 tests  (independent and random) we conduct, there are 5 chances that we actually get the positive test results  we hope for, but these 5 positive test results are not statistically significant at all for us to reject the null hypotheses Ho.

If p-value calculated from our sample data  <= α (set by industry standard) , we can say that our tests have yielded statistically significant positive results we hope for, and we therefore can reject null hypothesis Ho, and  accept alternative hypothesis Ha.

4. P-value calculation.

click on link #4: page 17, page 20, page 23, page 26.


Part IV. Type II Error: Ho is false, but we have accepted it, & therefore wrongly rejected  Ha

There is real difference between the two versions of Holberton school webpage, but  we mistakenly believe that there is  no real difference, and we think any difference is accidental, due to general viewer population's "random fluctuations".   The probability of committing Type II error or β can be relatively large, with an industry's standard 10% and 20% , meaning we are more likely to underestimate the the probability that we can actually redesign and improve our webpage.

click on link #5: page 24


As with the significance level or Type I error, in order to avoid Type II error, we need to compute β by calculating another parameter to give us a reference, that is, statistical power, similar to the calculating confidence interval,  the statistical power to be calculated is   1 - β.

Suppose that the two versions of webpage do differ, and we can correctly reject the null hypothesis and obtain the probability of a statistically significant result, with a statistical power of 80-90 percent.

We calculate statistical power by analyzing  sample size, variance, α, and minimum variance or lower confidence interval.

click on link #6: page 27


Now, we have finally accomplished  the A / B test,  and out test results indicate that our newly improved webpage has significantly more application clicks from viewers, with  a 95% confidence level, and 80% -90% of the statistical power,


1.calculating p values


Here we look at some examples of calculating p values. The examples are for both normal and t distributions. We assume that you can enter data and know the commands associated with basic probability. We first show how to do the calculations the hard way and show how to do the calculations. The last method makes use of the t.test command and demonstrates an easier way to calculate a p value.

2. Standard deviations and standard errors

Saturday, October 22, 2016

TravelSky Technology(China Airline): Predicting Airfare Price by SVM

TravelSky Technology(China Airline):  Predicting Airfare Price by SVM

We were faced with the large dataset with no explicit links between records, making it a very challenging task to analyze price changes of an individual round-trip.
It was much more practical to develop a model that generalizes the properties of all records in the dataset, and to train a SVM as a binary pricing classifier to distinguish between ”expensive” and ”cheap” of all tickets (transaction records) processed. 

Part I. General Introduction
Travelsky technologies is one of the largest global distribution system in the travel/tourism industry: it sells tickets for all airlines (also hotels) and processes millions billable transactions per month.

Project Goals

1 Construct and train a general classifier so that it can distinguish between expensive and cheap tickets.

2. Use this classifier to predict the prices of future tickets.

3. Determine which factors have the greatest impact on price by analyzing the trained classifier.

Exploratory data analysis
Extent of the dataset:  billion records, 132.2 GiB (uncompressed) , hundreds  departure airports, hundreds destinations, hundreds routes, hundreds airlines.

Lots of fields:  “Buy” date: When was this price current? “Fly” date: When does the flight leave? Price.  Cabin class Economy/Business/First (98% economy tickets) . Booking class A-Z  … Airline The airline selling the ticket. some data looks like a time series, tickets are linked over time

Classification & Prediction methods

Implemented two different classifiers:  Support vector machine (SVM),  L1- regularized linear regression.  Both are convex minimization problems that can be solved online by employing the stochastic gradient descent (SGD) method.

 SVM: binary linear classifier. Goal: Find maximum-margin hyperplane that divides the points with label y “+1” from those with label y “-1”. 
Training: Generate training label yi for i-th data point xi.  Choose hyperplane parameters so the margin is maximal and the training data is still correctly classified.

For each route r, calculate the arithmetic mean (and standard deviation) of the price over all tickets. 
Assign labels: Label +: “Above mean price for this route”.  Label -: “Below mean price for this route” .Only store mean/std-dev, do not actually store labels.

Feature Selection
Extract features from plaintext records (x).
Each plaintext record is transformed into a 990-dimensional vector.
Each dimension contains a numerical value corresponding to a feature such as:  Number of days between “Buy” and “Fly” dates, Week of day (for all dates) , Is the day on a weekend (for all dates), Dates isMonday, isWeekend, isWinter, weekOfYear, . . . .
Each dimension is normalized to zero mean and unit variance (per route r).

Part II. More Detailed Descriptions of Our Model

Classification methods

In order to identify which records represent cheap tickets and which records have traits identifying them as expensive tickets, a classifier able to distinguish between ”expensive” and ”cheap” records is necessary.

It should be possible to train such a classifier on all records at once, identifying the features making a record cheaper or more expensive than other records. As some routes are more expensive than others, it does not make sense to include the route as a feature, but rather normalize prices per route. This enables comparison of prices across all routes without simply marking all records of a particular route as expensive. Each record is then labeled according to the normalized price.
In short, a record for a particular route is labeled as ”expensive” (+1) if its price is higher than the average price of all records for that specific route. Otherwise it is labeled as ”cheap” (-1).
After training the classifier, it should be able to predict a label from  and assign this label to a new record with an unknown price. As the route of the new record is known, a numerical minimal or maximal price (the afore-mentioned average price per route) can be directly inferred from the predicted label.
Additionally, the model parameters of the trained classifier should contain information on how much each feature contributes to a record being cheap or expensive.

Online algorithms for classification

Due to the large amount of data, algorithms using more than a constant amount of memory are not suitable. Two algorithms were implemented, one for online support vector machines and the other for online-regularized logistic regression. This allows efficient training of the classifier on a parallel system with limited memory.

Some definitions and terminology as they are used in the following sections:

Each data point Xi  represents the features of a single record Ri and is also called the feature
vector for record Ri
Each component contains information about a single aspect of the record Ri . The contents are described previously are derived from the fields of Ri


Each label yi represents the label (classification) of a single record Ri, with two values: ”expensive” (+1), “cheap” (-1).

Record Ri always consists of a pair (Xi Yi) , both values are known for trading dataset.
For new data points Xi,  the value of the labels Yi is initially unknown,  and is the result of the classification/prediction. A label has only two possible values: −1 and 1.


The weight vector w is the model parameter of the classifier to be estimated and is initially

unknown. In both classifiers discussed below, w has the same number of dimensions as a data

point Xi and determines the effect each value in Xi has on the classification result.

Feature vector generation
For each record, a feature vector consisting of 990 features was created. Before normalization, each entry was set to either 1.0 (boolean true), 0 (boolean false) or a value associated with a numerical field in the record.
The feature vector represents each record as a 990-dimensional vector.
Some examples of features, and record fields utilized:
Dates Request Date, Departure Date, Return Date
Date differences Return-Departure Date, Departure-Request Date
Categorical values Passenger Type, Airline
Numerical values Number of passengers, Number of hops
Sequences of categorical values Cabin Classes, Booking Classes, Availabilities
Sequences of numerical values Flight numbers
Feature vector normalization
As with price normalization, each of the 990 features fm was normalized in two steps. In a first MapReduce job, the arithmetic means µfm and standard deviations σfm were calculated using the same methods as for price normalization and subsequently stored to disk.
All following MapReduce jobs loaded the 990 means and standard deviations from disk and calculated the normalized feature vector x 0 i on-the-fly by calculating the standard score of each feature fm: f 0 m = fm − µfm σfm , m 1, . . . , 990

Stochastic gradient descent (SGD)

Given a convex set S and a convex function f, we can estimate the parameter w in min

f(w) is of the form f(w) = Pn

a single observed data point from the dataset. Finding w is done iteratively, by using one random

sample data point from the dataset per iteration. For regularization, w S needs to be ensured,

thus a projection onto S is necessary.

Let w0 S be the starting value. Then each iteration t consists of the update step
t=1 ft(w). Usually, each summand ft represents the loss function for

wt+1 = P rojS(wt − ηtft(wt))

where P rojS is a projection onto the set S, ηt is the current step size (learning rate), and ft is the

gradient of f approximated at the sample data point for iteration t.

It is possible to only use a subsample of the full dataset if the data points used for training are picked

at random from the dataset. Training can then either be halted after a fixed number of iterations or as soon as sufficient accuracy is achieved.