Part I. what is A/B testing?
"A/B testing is comparing two versions of a web page to see which one performs better. You compare two web pages by showing the two variants (let's call them A and B) to similar visitors at the same time. The one that gives a better conversion rate, wins!"
For example, we are doing A/B testing for Holberton School webpage. In one version (A version, the old one), we have the 'click to apply immediately" on the menu, and another version (B version, the new one) the button is right next to the picture of "cisfun$" (coding your own shell). 2500 views are tested and the application rate is 12.5% in version A and 14% in version B.
Now, Julien asks us, "Hi, guys, how do we know if version B is really better?"
https://www.holbertonschool.com/
click on link #1: https://www.optimizely.com/ab-
Part II. statistics as we know is all about a null distribution
A / B test is a comparative test, in the course of the experiment, we extract some samples from the population to collect data, and then come to an overall population parameter estimation. The scientific basis from which we can derive valid conclusions from experimental data is based on statistical principles.
1. statistical hypothesis testing and significance testing: basic concepts
In order to answer Julien's question, we need to set up two hypotheses and then test them.
Null Hypothesis (Ho): There is no difference between two webpages. We hope to overturn this hypothesis by our test results.
Alternative Hypothesis (Ha): We wish to validate the hypothesis by our test results
"Statistical Methodology - General A) Null hypotheses - Ho 1) In most sciences, we are faced with: a) NOT whether something is true or false (Popperian decision) b) BUT rather the degree to which an effect exists (if at all) - a statistical decision. 2) Therefore 2 competing statistical hypotheses are posed: a) Ha: there is a difference in effect between (usually posed as < or >) b) Ho: there is no difference in effect between"
click on link#2, page 13:
http://courses.pbsci.ucsc.edu/eeb/bioe286/Lecture%20Handouts/The%20linkage%20between%20Popperian%20Science%20and%20Statistical%20Analysis%20-%202016.pdf
2. The most important concept behind is null distribution
Page 8:
"Almost all ordinary statistics are based on a null distribution • If you understand a null distribution and what the correct null distribution is then statistical inference is straight-forward. • If you don’t, ordinary statistical inference is bewildering • A null distribution is the distribution of events that could occur if the null hypothesis is true"
https://www.google.com/search?q=oak+seedings+controlled+studies&biw=1600&bih=794&tbm=isch&source=lnms&sa=X&ved=0ahUKEwirp6eglZrRAhWG2yYKHeI0CpoQ_AUICCgD&dpr=1#imgrc=aczLFFR0vh9JsM%3A
3. We, of course, do hope to reject the null hypothesis Ho by our test results
But how do we do that statistically?
Logic: The null hypotheses and alternative hypotheses are a complete set of events, and they are opposite of each other. In a hypothesis test, either the null hypothesis H0 or the alternative hypothesis Ha must hold, and if one does not hold, one must accept the other one unconditionally. So, it is either Ho true, or Ha true.
In the A / B test, the purpose of our experiment is to demonstrate that the new version and the old version are statistically significantly different, in terms of application rate.
Again, the original null hypothesis Ho in this scenario is that there is no difference between the old version and the new version of webpage, Yes, there may be some differences in terms of application data we collected, but this difference in data is due to random fluctuations in a null distribution: meaning the webpage viewers as an population has certain "statistical and random fluctuations" in terms of their visits and filling out of applications, and this kind of fluctuations is bidirectional or with two -tails, sometime is more positive than negative, other times is more negative than positive.
Part III. Type I error is critical
Now we have a much better understanding of A/B testing: we want to do a A / B test to reject the null hypothesis Ho of no difference, proving Ho is false, and therefore proving that the alternative hypothesis Ha is true.
1. What is Type I error?
Type I error: The null hypothesis is rejected when the null hypothesis is true
This is saying that we did all the A/B testing, and in our report we rejected Ho, proving Ho is false, and therefore proving that the alternative hypothesis (Ha) is true.
But there is a probability that we could have made some sampling errors somewhere, and the probability of making this kind of error is called α.
"An analogy that some people find helpful (but others don't) in understanding the two types of error is to consider a defendant in a trial. The null hypothesis is "defendant is not guilty;" the alternate is "defendant is guilty."4 A Type I error would correspond to convicting an innocent person; a Type II error would correspond to setting a guilty person free."
click on link #3:
https://www.ma.utexas.edu/users/mks/statmistakes/errortypes.html
Again, the logic is: The null hypotheses and alternative hypotheses are a complete set of events, and they are opposite of each other. In a hypothesis test, either the null hypothesis H0 or the alternative hypothesis Ha must hold, and if one does not hold, one must accept the other one unconditionally. So, it is either Ho true, or Ha true.
2. Type I error and p-value
A standard value of 0.05 is the generally accepted probability that we may commit Type I error.
This 5% probability or significance level in statistics is called α, also represents the confidence level of our test results. If α is 0.05, then the confidence is 0.95, that is, if our A/B testing shows that our chance of Type I error is <5%, then we are confident with a > 95% probability that the positive feedback we get from new our new web page is due to our newly improved webpage design.
If we set α to 0.01, then we have a much tougher job proving that our new webpage actually made a difference.
α is set by the industry standard, and we compare to it with out own p-value from our sample data.
3. Definition of p-value.
For a p-value (significance level) of 0.05, one expects to obtain sample means in the critical region 5% of the time when the null hypotheses is true.
Or in other words, for every 100 tests (independent and random) we conduct, there are 5 chances that we actually get the positive test results we hope for, but these 5 positive test results are not statistically significant at all for us to reject the null hypotheses Ho.
If p-value calculated from our sample data <= α (set by industry standard) , we can say that our tests have yielded statistically significant positive results we hope for, and we therefore can reject null hypothesis Ho, and accept alternative hypothesis Ha.
4. P-value calculation.
click on link #4: page 17, page 20, page 23, page 26.
http://courses.pbsci.ucsc.edu/eeb/bioe286/Lecture%20Handouts/The%20linkage%20between%20Popperian%20Science%20and%20Statistical%20Analysis%20-%202016.pdf
Part IV. Type II Error: Ho is false, but we have accepted it, & therefore wrongly rejected Ha
There is real difference between the two versions of Holberton school webpage, but we mistakenly believe that there is no real difference, and we think any difference is accidental, due to general viewer population's "random fluctuations". The probability of committing Type II error or β can be relatively large, with an industry's standard 10% and 20% , meaning we are more likely to underestimate the the probability that we can actually redesign and improve our webpage.
click on link #5: page 24
http://courses.pbsci.ucsc.edu/eeb/bioe286/Lecture%20Handouts/The%20linkage%20between%20Popperian%20Science%20and%20Statistical%20Analysis%20-%202016.pdf
As with the significance level or Type I error, in order to avoid Type II error, we need to compute β by calculating another parameter to give us a reference, that is, statistical power, similar to the calculating confidence interval, the statistical power to be calculated is 1 - β.
Suppose that the two versions of webpage do differ, and we can correctly reject the null hypothesis and obtain the probability of a statistically significant result, with a statistical power of 80-90 percent.
We calculate statistical power by analyzing sample size, variance, α, and minimum variance or lower confidence interval.
click on link #6: page 27
http://courses.pbsci.ucsc.edu/eeb/bioe286/Lecture%20Handouts/The%20linkage%20between%20Popperian%20Science%20and%20Statistical%20Analysis%20-%202016.pdf
Now, we have finally accomplished the A / B test, and out test results indicate that our newly improved webpage has significantly more application clicks from viewers, with a 95% confidence level, and 80% -90% of the statistical power,
Appendix
1.calculating p values
http://www.cyclismo.org/tutorial/R/pValues.html
Here we look at some examples of calculating p values. The examples are for both normal and t distributions. We assume that you can enter data and know the commands associated with basic probability. We first show how to do the calculations the hard way and show how to do the calculations. The last method makes use of the t.test command and demonstrates an easier way to calculate a p value.