Wednesday, December 28, 2016

An introduction to basic statistics principle behind the A/B testing


Part I. what is A/B testing?

"A/B testing is comparing two versions of a web page to see which one performs better. You compare two web pages by showing the two variants (let's call them A and B) to similar visitors at the same time. The one that gives better conversion rate, wins!"

For example, we are doing A/B testing for Holberton School webpage. In one version (A version, the old one), we have the 'click to apply immediately" on the menu, and another version (B version, the new one) the button is right next to the picture of "cisfun$" (coding your own shell). 2500 views are tested and the application rate is 12.5% in version A and 14% in version B.


Now, Julien asks us, "Hi, guys, how do we know if version B is really better?" 


https://www.holbertonschool.com/


click on link #1: https://www.optimizely.com/ab-testing/

Part II. statistics as we know is all about a null distribution 

A / B test is a comparative test, in the course of the experiment, we extract some samples from the population to collect data, and then come to an overall population parameter estimation. The scientific basis from which we can derive valid conclusions from experimental data is based on statistical principles.

1. statistical hypothesis testing and significance testing: basic concepts

In order to answer Julien's question, we need to set up two hypotheses and then test them.


Null Hypothesis (Ho): There is no difference between two webpages. We hope to overturn this hypothesis by our test results.
Alternative Hypothesis (Ha): We wish to validate the hypothesis by our test results

 "Statistical Methodology - General A) Null hypotheses - Ho 1) In most sciences, we are faced with: a) NOT whether something is true or false (Popperian decision) b) BUT rather the degree to which an effect exists (if at all) - a statistical decision. 2) Therefore 2 competing statistical hypotheses are posed: a) Ha: there is a difference in effect between (usually posed as < or >) b) Ho: there is no difference in effect between"

click on link#2, page 13:
http://courses.pbsci.ucsc.edu/eeb/bioe286/Lecture%20Handouts/The%20linkage%20between%20Popperian%20Science%20and%20Statistical%20Analysis%20-%202016.pdf

2. The most important concept behind is null distribution 

Page 8:

"Almost all ordinary statistics are based on a null distribution • If you understand a null distribution and what the correct null distribution is then statistical inference is straight-forward. • If you don’t, ordinary statistical inference is bewildering • A null distribution is the distribution of events that could occur if the null hypothesis is true"

https://www.google.com/search?q=oak+seedings+controlled+studies&biw=1600&bih=794&tbm=isch&source=lnms&sa=X&ved=0ahUKEwirp6eglZrRAhWG2yYKHeI0CpoQ_AUICCgD&dpr=1#imgrc=aczLFFR0vh9JsM%3A


3. We, of course, do  hope to reject the null hypothesis Ho by our test results

But how do we do that statistically?

Logic:  The null hypotheses and alternative hypotheses are a complete set of events, and they are opposite of  each other. In a hypothesis test,  either the null hypothesis H0 or the alternative hypothesis Ha must hold, and if one does not hold, one must accept the other one unconditionally. So, it is either Ho true, or Ha true.

In the A / B test, the purpose of our experiment is to demonstrate that the new version and the old version are statistically significantly different, in terms of application rate.

Again, the original null hypothesis Ho in this scenario is that there is no difference between the old version and the new version of webpage, Yes, there may be some differences in terms of application data we collected, but this difference in data is due to random fluctuations in a null distribution: meaning the webpage viewers as an population has certain "statistical and random fluctuations" in terms of their visits and filling out of applications, and this kind of fluctuations is bidirectional or with two -tails, sometime is more positive than negative, other times is more negative than positive.

Part III. Type I error is critical

Now we have a much better understanding of  A/B testing:  we want to do a  A / B test to reject the null hypothesis Ho of no difference, proving Ho is false,  and therefore proving  that the alternative hypothesis Ha is true.

1. What is Type I error?


Type I error: The null hypothesis is rejected when the null hypothesis is true

This is saying that we did all the A/B testing, and in our report we rejected Ho,  proving Ho is false,  and therefore proving  that the alternative hypothesis (Ha) is true.

But there is a probability that we could have made some sampling errors somewhere, and the probability of making this kind of error is called α.

"An analogy that some people find helpful (but others don't) in understanding the two types of error is to consider a defendant in a trial. The null hypothesis is "defendant is not guilty;" the alternate is "defendant is guilty."A Type I error would correspond to convicting an innocent person; a Type II error would correspond to setting a guilty person free." 

click on link #3:
https://www.ma.utexas.edu/users/mks/statmistakes/errortypes.html


Again, the logic is:  The null hypotheses and alternative hypotheses are a complete set of events, and they are opposite of  each other. In a hypothesis test,  either the null hypothesis H0 or the alternative hypothesis Ha must hold, and if one does not hold, one must accept the other one unconditionally. So, it is either Ho true, or Ha true.


2.   Type I error and  p-value

A standard value of 0.05 is the generally accepted probability that we may commit Type I error.

This 5% probability or significance level in statistics is called α, also represents the confidence level of our test results. If α is 0.05, then the confidence is 0.95, that is, if our A/B testing shows that our  chance of Type I error is  <5%, then  we are confident with a > 95% probability that the positive feedback we get from  new our new web page is due to our newly improved  webpage design.

If we set  α to 0.01, then we have a much tougher job proving that our new webpage actually made a difference.

 α is set by the industry standard, and we compare to it with out own p-value from our sample data.

3. Definition of p-value.

For a p-value (significance level) of 0.05, one expects to obtain sample means in the critical region 5% of the time when the null hypotheses is true.

Or in other words, for every 100 tests  (independent and random) we conduct, there are 5 chances that we actually get the positive test results  we hope for, but these 5 positive test results are not statistically significant at all for us to reject the null hypotheses Ho.





If p-value calculated from our sample data  <= α (set by industry standard) , we can say that our tests have yielded statistically significant positive results we hope for, and we therefore can reject null hypothesis Ho, and  accept alternative hypothesis Ha.


4. P-value calculation.


click on link #4: page 17, page 20, page 23, page 26.

http://courses.pbsci.ucsc.edu/eeb/bioe286/Lecture%20Handouts/The%20linkage%20between%20Popperian%20Science%20and%20Statistical%20Analysis%20-%202016.pdf

Part IV. Type II Error: Ho is false, but we have accepted it, & therefore wrongly rejected  Ha

There is real difference between the two versions of Holberton school webpage, but  we mistakenly believe that there is  no real difference, and we think any difference is accidental, due to general viewer population's "random fluctuations".   The probability of committing Type II error or β can be relatively large, with an industry's standard 10% and 20% , meaning we are more likely to underestimate the the probability that we can actually redesign and improve our webpage.

click on link #5: page 24

http://courses.pbsci.ucsc.edu/eeb/bioe286/Lecture%20Handouts/The%20linkage%20between%20Popperian%20Science%20and%20Statistical%20Analysis%20-%202016.pdf

As with the significance level or Type I error, in order to avoid Type II error, we need to compute β by calculating another parameter to give us a reference, that is, statistical power, similar to the calculating confidence interval,  the statistical power to be calculated is   1 - β.


Suppose that the two versions of webpage do differ, and we can correctly reject the null hypothesis and obtain the probability of a statistically significant result, with a statistical power of 80-90 percent.

We calculate statistical power by analyzing  sample size, variance, α, and minimum variance or lower confidence interval.

click on link #6: page 27

http://courses.pbsci.ucsc.edu/eeb/bioe286/Lecture%20Handouts/The%20linkage%20between%20Popperian%20Science%20and%20Statistical%20Analysis%20-%202016.pdf


Now, we have finally accomplished  the A / B test,  and out test results indicate that our newly improved webpage has significantly more application clicks from viewers, with  a 95% confidence level, and 80% -90% of the statistical power,

Appendix

1.calculating p values


 http://www.cyclismo.org/tutorial/R/pValues.html

Here we look at some examples of calculating p values. The examples are for both normal and t distributions. We assume that you can enter data and know the commands associated with basic probability. We first show how to do the calculations the hard way and show how to do the calculations. The last method makes use of the t.test command and demonstrates an easier way to calculate a p value.

2. Standard deviations and standard errors