Thursday, March 31, 2016

sas "naive bayes classification linearization" https://web.stanford.edu/class/cs124/lec/naivebayes.pdf

Naive Bayes and Logistic Regression

https://www.cs.cmu.edu/.../NBayesLogReg.p...

Carnegie Mellon University
The Naive Bayes algorithm is a classification algorithm based on Bayes rule and a ... Equation (2) is the fundamental equation for the Naive Bayes classifier.



http://www.wsbookshow.com/uploads/bookfile/201101/9787508481517_1.pdf


Discriminative training of Bayesian Chow-Liu multinet ...

ieeexplore.ieee.org/.../abs...

Institute of Electrical and Electronics Engineers
by K Huang - ‎2003 - ‎Cited by 18 - ‎Related articles
Discriminative classifiers such as support vector machines directly learn adiscriminant function or a posterior probability model to perform classification. On the ...

[PDF]Discriminative Naive Bayesian Classifiers - CiteSeerX

citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.6.5175...

by K Huang - ‎2003 - ‎Cited by 6 - ‎Related articles
Apr 22, 2003 - discriminant function or a posterior probability model to perform classification. On the other hand, generative classifiers often learn a joint ...

Improving Naive Bayesian Classifier by Discriminative ...

https://www.researchgate.net/.../253730079_Improving_Naiv...
ResearchGate
Dec 16, 2013 - Discriminative classifiers such as Support Vector Machines (SVM) directly learn a discriminant function or a posterior probability model to ...

Discriminative Training of Bayesian Chow-Liu Multinet ...

www.researchgate.net/.../2836322_Discriminative_Training_o...
ResearchGate
Nov 20, 2014 - Discriminative classifiers such as Support Vector Machines directly learn a discriminant function or a posterior probability model to perform ...

非线性规划的基本理论,用数学软件求解 ...

A simple explanation of Naive Bayes Classification


http://stackoverflow.com/questions/10059594/a-simple-explanation-of-naive-bayes-classification
I am finding it hard to understand the process of Naive Bayes, and I was wondering if someone could explained it with a simple step by step process in English. I understand it takes comparisons by times occurred as a probability, but I have no idea how the training data is related to the actual dataset.
Please give me an explanation of what role the training set plays. I am giving a very simple example for fruits here, like banana for example
training set---
round-red
round-orange
oblong-yellow
round-red

dataset----
round-red
round-orange
round-red
round-orange
oblong-yellow
round-red
round-orange
oblong-yellow
oblong-yellow
round-red
shareedit
12
It's quite easy if you understand Bayes' Theorem. If you haven' read on Bayes' theorem, try this linkyudkowsky.net/rational/bayes. – Pinch Apr 8 '12 at 3:43
12
Here is a nice blog entry about How To Build a Naive Bayes Classifier. – tobigue Apr 8 '12 at 13:13
4
Thanks for the link, but OMG is it long. – Jaggerjack Apr 8 '12 at 19:11
9
NOTE: The accepted answer below is not a traditional example for Naïve Bayes. It's mostly a k Nearest Neighbor implementation. Read accordingly. – chmullig Nov 26 '13 at 4:01
1
@Jaggerjack: RamNarasimhan 's answer is well explained than the accepted answer. – Unmesha SreeVeniMay 20 '14 at 6:35

4 Answers


up vote403down voteaccepted
Your question as I understand is divided in two parts. One being you need more understanding for Naive Bayes classifier & second being the confusion surrounding Training set.
In general all of Machine Learning Algorithms need to be trained for supervised learning tasks like classification, prediction etc. or for unsupervised learning tasks like clustering.
By training it means to train them on particular inputs so that later on we may test them for unknown inputs (which they have never seen before) for which they may classify or predict etc (in case of supervised learning) based on their learning. This is what most of the Machine Learning techniques like Neural Networks, SVM, Bayesian etc. are based upon.
So in a general Machine Learning project basically you have to divide your input set to a Development Set (Training Set + Dev-Test Set) & a Test Set (or Evaluation set). Remember your basic objective would be that your system learns and classifies new inputs which they have never seen before in either Dev set or test set.
The test set typically has the same format as the training set. However, it is very important that the test set be distinct from the training corpus: if we simply reused the training set as the test set, then a model that simply memorized its input, without learning how to generalize to new examples, would receive misleadingly high scores.
In general, for an example, 70% can be training set cases. Also remember to partition the original set into the training and test sets randomly.
Now I come to your other question about Naive Bayes.
To demonstrate the concept of Naïve Bayes Classification, consider the example given below:
enter image description here
As indicated, the objects can be classified as either GREEN or RED. Our task is to classify new cases as they arrive, i.e., decide to which class label they belong, based on the currently existing objects.
Since there are twice as many GREEN objects as RED, it is reasonable to believe that a new case (which hasn't been observed yet) is twice as likely to have membership GREEN rather than RED. In the Bayesian analysis, this belief is known as the prior probability. Prior probabilities are based on previous experience, in this case the percentage of GREEN and RED objects, and often used to predict outcomes before they actually happen.
Thus, we can write:
Prior Probability of GREENnumber of GREEN objects / total number of objects
Prior Probability of REDnumber of RED objects / total number of objects
Since there is a total of 60 objects, 40 of which are GREEN and 20 RED, our prior probabilities for class membership are:
Prior Probability for GREEN40 / 60
Prior Probability for RED20 / 60
Having formulated our prior probability, we are now ready to classify a new object (WHITE circle in the diagram below). Since the objects are well clustered, it is reasonable to assume that the more GREEN(or RED) objects in the vicinity of X, the more likely that the new cases belong to that particular color. To measure this likelihood, we draw a circle around X which encompasses a number (to be chosen a priori) of points irrespective of their class labels. Then we calculate the number of points in the circle belonging to each class label. From this we calculate the likelihood:
enter image description here
enter image description here
From the illustration above, it is clear that Likelihood of X given GREEN is smaller than Likelihood of X given RED, since the circle encompasses 1 GREEN object and 3 RED ones. Thus:
enter image description here
enter image description here
Although the prior probabilities indicate that X may belong to GREEN (given that there are twice as many GREEN compared to RED) the likelihood indicates otherwise; that the class membership of X isRED (given that there are more RED objects in the vicinity of X than GREEN). In the Bayesian analysis, the final classification is produced by combining both sources of information, i.e., the prior and the likelihood, to form a posterior probability using the so-called Bayes' rule (named after Rev. Thomas Bayes 1702-1761).
enter image description here
Finally, we classify X as RED since its class membership achieves the largest posterior probability.
shareedit
22
Thank you for taking time to put all this together! Very useful! – Murat Derya Özen Apr 9 '12 at 9:35
1
@Murat: Thanks :) – Yavar Apr 10 '12 at 19:43
19
isn't this algorithm above more like k-nearest neighbors? – Renaud Jun 12 '13 at 8:46
138
This answer is confusing - it mixes KNN (k nearest neighbours) and naive bayes. – Michal Illich Sep 1 '13 at 16:39
5
The answer was proceeding nicely till the likelihood came up. So @Yavar has used K-nearest neighbours for calculating the likelihood. How correct is that? If it is, what are some other methods to calculate the likelihood? – wrahool Jan 31 '14 at 6:24

I realize that this is an old question, with an established answer. The reason I'm posting is that is the accepted answer has many elements of kNN (k nearest neighbor), a different algorithm.
Both kNN and NaiveBayes are classification algorithms. Conceptually, kNN uses the idea of "nearness" to classify new entities. In kNN 'nearness' is modeled with ideas such as Euclidean Distance or Cosine Distance. By contrast, in NaiveBayes, the concept of 'probability' is used to classify new entities.
Since the question is about Naive Bayes, here's how I'd describe the ideas and steps to someone. I'll try to do it with as few equations and in plain English as much as possible.

First, Conditional Probability & Bayes' Rule

Before someone can understand and appreciate the nuances of Naive Bayes', they need to know a couple of related concepts first, namely, the idea of Conditional Probability, and Bayes' Rule. (If you are familiar with these concepts, skip to the section titled Getting to Naive Bayes')
Conditional Probability in plain English: What is the probability that something will happen, given that something else has already happened.
Let's say that there is some Outcome O. And some Evidence E. From the way these probabilities are defined: The Probability of having both the Outcome O and Evidence E is: (Probability of O occurring) multiplied by the (Prob of E given that O happened)
One Example to understand Conditional Probability:
Let say we have a collection of US Senators. Senators could be Democrats or Republicans. They are also either male or female.
If we select one senator completely randomly, what is the probability that this person is a female Democrat? Conditional Probability can help us answer that.
Probability of (Democrat and Female Senator)= Prob(Senator is Democrat) multiplied by Conditional Probability of Being Female given that they are a Democrat.
  P(Democrat & Female) = P(Democrat) x P(Female | Democrat) 
We could compute the exact same thing, the reverse way:
  P(Democrat & Female) = P(Female) x P(Democrat | Female) 

Understanding Bayes Rule

Conceptually, this is a way to go from P(Evidence| Known Outcome) to P(Outcome|Known Evidence). Often, we know how frequently some particular evidence is observed, given a known outcome. We have to use this known fact to compute the reverse, to compute the chance of thatoutcome happening, given the evidence.
P(Outcome given that we know some Evidence) = P(Evidence given that we know the Outcome) times Prob(Outcome), scaled by the P(Evidence)
The classic example to understand Bayes' Rule:
Probability of Disease D given Test-positive = 

     Prob(Test is positive|Disease) *P(Disease) 
     _______________________________________________________
     (scaled by) Prob(Testing Positive, with or without the disease)
Now, all this was just preamble, to get to Naive Bayes.

Getting to Naive Bayes'

So far, we have talked only about one piece of evidence. In reality, we have to predict an outcome given multiple evidence. In that case, the math gets very complicated. To get around that complication, one approach is to 'uncouple' multiple pieces of evidence, and to treat each of piece of evidence as independent. This approach is why this is called naive Bayes.
P(Outcome|Multiple Evidence) = 
P(Evidence1|Outcome) x P(Evidence2|outcome) x ... x P(EvidenceN|outcome) x P(Outcome)
scaled by P(Multiple Evidence)
Many people choose to remember this as:
P(outcome|evidence) = P(Likelihood of Evidence) x Prior prob of outcome
                      ___________________________________________
                           P(Evidence)
Notice a few things about this equation:
  • If the Prob(evidence|outcome) is 1, then we are just multiplying by 1.
  • If the Prob(some particular evidence|outcome) is 0, then the whole prob. becomes 0. If you see contradicting evidence, we can rule out that outcome.
  • Since we divide everything by P(Evidence), we can even get away without calculating it.
  • The intuition behind multiplying by the prior is so that we give high probability to more common outcomes, and low probabilities to unlikely outcomes. These are also called base rates and they are a way to scale our predicted probabilities.

How to Apply NaiveBayes to Predict an Outcome?

Just run the formula above for each possible outcome. Since we are trying to classify, each outcome is called a class and it has a class label. Our job is to look at the evidence, to consider how likely it is to be this class or that class, and assign a label to each entity. Again, we take a very simple approach: The class that has the highest probability is declared the "winner" and that class label gets assigned to that combination of evidences.

Fruit Example

Let's try it out on an example to increase our understanding: The OP asked for a 'fruit' identification example.
Let's say that we have data on 1000 pieces of fruit. They happen to be BananaOrange or someOther Fruit. We know 3 characteristics about each fruit:
  1. Whether it is Long
  2. Whether it is Sweet and
  3. If its color is Yellow.
This is our 'training set.' We will use this to predict the type of any new fruit we encounter.
Type           Long | Not Long || Sweet | Not Sweet || Yellow |Not Yellow|Total
             ___________________________________________________________________
Banana      |  400  |    100   || 350   |    150    ||  450   |  50      |  500
Orange      |    0  |    300   || 150   |    150    ||  300   |   0      |  300
Other Fruit |  100  |    100   || 150   |     50    ||   50   | 150      |  200
            ____________________________________________________________________
Total       |  500  |    500   || 650   |    350    ||  800   | 200      | 1000
             ___________________________________________________________________
We can pre-compute a lot of things about our fruit collection.
The so-called "Prior" probabilities. (If we didn't know any of the fruit attributes, this would be our guess.) These are our base rates.
 P(Banana)  = 0.5 (500/1000)
 P(Orange)  = 0.3
 P(Other Fruit) = 0.2
Probability of "Evidence"
p(Long)  = 0.5
P(Sweet)  = 0.65
P(Yellow) = 0.8
Probability of "Likelihood"
P(Long|Banana) = 0.8
P(Long|Orange) = 0  [Oranges are never long in all the fruit we have seen.]
 ....

P(Yellow|Other Fruit) =  50/200 = 0.25
P(Not Yellow|Other Fruit)  = 0.75

Given a Fruit, how to classify it?

Let's say that we are given the properties of an unknown fruit, and asked to classify it. We are told that the fruit is Long, Sweet and Yellow. Is it a Banana? Is it an Orange? Or Is it some Other Fruit?
We can simply run the numbers for each of the 3 outcomes, one by one. Then we choose the highest probability and 'classify' our unknown fruit as belonging to the class that had the highest probability based on our prior evidence (our 1000 fruit training set):
P(Banana|Long, Sweet and Yellow) = P(Long|Banana) p(Sweet|Banana).P(Yellow|Banana) x P(banana)
                                            __________________________________________________
                                                   P(Long). P(Sweet). P(Yellow) 

                                   0.8 x 0.7 x 0.9 x 0.5
                              =    ______________________ 
                                     P(evidence)

                          = 0.252/P(evidence)

P(Orange|Long, Sweet and Yellow) = 0

P(Other Fruit|Long, Sweet and Yellow) = P(Long|Other fruit) x P(Sweet|Other fruit) x P(Yellow/Other fruit) x P(Other Fruit)
                                     = (100/200 x 150/200 x 50/150 x 200/1000) / P(evidence)
                                     = 0.01875/P(evidence)
By an overwhelming margin (0.252 >> 0.01875), we classify this Sweet/Long/Yellow fruit as likely to be a Banana.

Why is Bayes Classifier so popular?

Look at what it eventually comes down to. Just some counting and multiplication. We can pre-compute all these terms, and so classifying becomes easy, quick and efficient.
Let P(evidence) = z. Now we quickly compute the following three quantities.
P(Banana|evidence) = 1/z * Prob(Banana) x Prob(Evidence1|Banana).Prob(Evidence2|Banana)...

P(Orange|Evidence) = 1/z * Prob(Orange) x Prob(Evidence1|Orange).Prob(Evidence2|Orange)...

P(Other Fruit|Evidence) = 1/z * Prob(Other) x Prob(Evidence1|Other).Prob(Evidence2|Other)...
Assign the class label of whichever is the highest number, and you are done.
Despite the name, NaiveBayes turns out to be excellent in certain applications. Text classification is one area where it really shines.
Hope that helps in understanding the concepts behind the Naive Bayes algorithm.
shareedit
11
Really great explanation, thank you! It was very useful and easy to understand – PerkinsB1024 Dec 16 '13 at 16:40
3
Thanks for the very clear explanation! Easily one of the better ones floating around the web. Question: since each P(outcome/evidence) is multiplied by 1 / z=p(evidence) (which in the fruit case, means each is essentially the probability based solely on previous evidence), would it be correct to say that z doesn't matter at all for Naïve Bayes? Which would thus mean that if, say, one ran into a long/sweet/yellow fruit that wasn't a banana, it'd be classified incorrectly. – covariance Dec 21 '13 at 2:30
20
I think this is actually the best answer here. – fhucho Dec 27 '13 at 1:03
28
This is much better than the accepted answer. – wrahool Jan 31 '14 at 6:46
5
The 2 answers together are the even better answer. – smwikipedia Feb 11 '14 at 10:55 

I try to explain the Bayes rule.
Suppose that you know that 10% of people are smokers. You also know that 90% of smokers are men and 80% of them are above 20 years old.
Now you see someone who is a man and 15 years old. You want to know the chance that he be a smoker:
 X = smoker | he is a man and under 20
As you know 10% of people are smokers your initial guess is 10% (prior probability, without knowing anything about the person) but the other evidences (that he is a man and he is 15) can contribute to this probability.
Each evidence may increase or decrease this chance. For example the fact that he is a man mayincrease the chance that he is a smoker, providing that this percentage (being a man) among non-smokers is lower, for example, 40%. In other words being a man must be a good sign of being a smoker rather than a non-smoker
Or in other words, you should compare the commonness of a feature. if the commonness of being a man is 90%, then knowing that 90% of men are smoker doesn't affect our prior probability (10% * (90% / 90%) = 10%). But if the men contribute to 40% of the society, the evidence (that 90% of smokers are men) increases the chance of X (10% * (90% / 40%) = 22.5% ), and if they be 95%, regardless the fact that the percentage of smoker men is high (90%)!, this evidence decreases the chance of X (10% * (90% / 95%) = 9.5%).
So we have (As this commonness of each feature gets higher, the chance gets lower)
P(X) = 
P(smoker)* 
(P(being a man | smoker)/P(being a man))*
(P(under 20 | smoker)/ P(under 20))
Notice in this formula we assumed that being a man and being under 20 are independent features so we multiplied them, it means that knowing that someone is under 20 has no effect on guessing that he is man or woman. But it may not be true, for example maybe most adolescence in a society are men...
To use this formula in a classifier
The classifier is given some features (man and under 20) and it must decide if he is an smoker or not. It uses the above formula to find that. To gain the needed probabilities (90%, 10%, 80%...) it uses training set. for example it counts the people in the training set that are smokers and find they contribute 10% of the sample. Then for smokers checks how many of them are men or women .... how many are above 20 or under 20....
shareedit

Ram Narasimhan explained the concept very nicely here below is an alternative explanation through the code example of Naive Bayes in action
It uses an example problem from this book on page 351
This is the data set that we will be using
enter image description here
In the above dataset if we give the hypothesis = {"Age":'<=30', "Income":"medium", "Student":'yes' , "Creadit_Rating":'fair'} then what is the probability that he will buy or will not buy a computer.
The code below exactly answers that question.
Just create a file called named new_dataset.csv and paste the following content.
Age,Income,Student,Creadit_Rating,Buys_Computer
<=30,high,no,fair,no
<=30,high,no,excellent,no
31-40,high,no,fair,yes
>40,medium,no,fair,yes
>40,low,yes,fair,yes
>40,low,yes,excellent,no
31-40,low,yes,excellent,yes
<=30,medium,no,fair,no
<=30,low,yes,fair,yes
>40,medium,yes,fair,yes
<=30,medium,yes,excellent,yes
31-40,medium,no,excellent,yes
31-40,high,yes,fair,yes
>40,medium,no,excellent,no
Here is the code the comments explains everything we are doing here! [python]
import pandas as pd 
import pprint 

class Classifier():
    data = None
    class_attr = None
    priori = {}
    cp = {}
    hypothesis = None


    def __init__(self,filename=None, class_attr=None ):
        self.data = pd.read_csv(filename, sep=',', header =(0))
        self.class_attr = class_attr

    '''
        probability(class) =    How many  times it appears in cloumn
                             __________________________________________
                                  count of all class attribute
    '''
    def calculate_priori(self):
        class_values = list(set(self.data[self.class_attr]))
        class_data =  list(self.data[self.class_attr])
        for i in class_values:
            self.priori[i]  = class_data.count(i)/float(len(class_data))
        print "Priori Values: ", self.priori

    '''
        Here we calculate the individual probabilites 
        P(outcome|evidence) =   P(Likelihood of Evidence) x Prior prob of outcome
                               ___________________________________________
                                                    P(Evidence)
    '''
    def get_cp(self, attr, attr_type, class_value):
        data_attr = list(self.data[attr])
        class_data = list(self.data[self.class_attr])
        total =1
        for i in range(0, len(data_attr)):
            if class_data[i] == class_value and data_attr[i] == attr_type:
                total+=1
        return total/float(class_data.count(class_value))

    '''
        Here we calculate Likelihood of Evidence and multiple all individual probabilities with priori
        (Outcome|Multiple Evidence) = P(Evidence1|Outcome) x P(Evidence2|outcome) x ... x P(EvidenceN|outcome) x P(Outcome)
        scaled by P(Multiple Evidence)
    '''
    def calculate_conditional_probabilities(self, hypothesis):
        for i in self.priori:
            self.cp[i] = {}
            for j in hypothesis:
                self.cp[i].update({ hypothesis[j]: self.get_cp(j, hypothesis[j], i)})
        print "\nCalculated Conditional Probabilities: \n"
        pprint.pprint(self.cp)

    def classify(self):
        print "Result: "
        for i in self.cp:
            print i, " ==> ", reduce(lambda x, y: x*y, self.cp[i].values())*self.priori[i]

if __name__ == "__main__":
    c = Classifier(filename="new_dataset.csv", class_attr="Buys_Computer" )
    c.calculate_priori()
    c.hypothesis = {"Age":'<=30', "Income":"medium", "Student":'yes' , "Creadit_Rating":'fair'}

    c.calculate_conditional_probabilities(c.hypothesis)
    c.classify()
output:
Priori Values:  {'yes': 0.6428571428571429, 'no': 0.35714285714285715}

Calculated Conditional Probabilities: 

{
 'no': {
        '<=30': 0.8,
        'fair': 0.6, 
        'medium': 0.6, 
        'yes': 0.4
        },
'yes': {
        '<=30': 0.3333333333333333,
        'fair': 0.7777777777777778,
        'medium': 0.5555555555555556,
        'yes': 0.7777777777777778
      }
}

Result: 
yes  ==>  0.0720164609053
no  ==>  0.0411428571429
Hope it helps in better understanding the problem
peace
shareedit
http://www.wsbookshow.com/uploads/bookfile/201101/9787508481517_1.pdf

非线性规划的基本理论,用数学软件求解 ...



How to compute posterior model probabilities ... and why ...

www.math.rug.nl/stat/models/files/green.pdf
University of Groningen
Mar 15, 2011 - the product of posterior model probabilities and model-specific parameter posteriors. – very often the basis for reporting the inference, and in ...

[PDF]Posterior probability

https://astro.uni-bonn.de/.../Lecture3_2012.pdf
Hoher List Observatory
model parameters but it is not a probability density for θ). P(θ|x): old name “inverse probability” modern name “posterior probability”. Starting from observed ...

[PDF]a comparison of the information and posterior probability ...

https://www.princeton.edu/~erp/.../M253.pdf
Princeton University
by GC Chow - ‎1979 - ‎Cited by 102 - ‎Related articles
POSTERIOR PROBABILITY CRITERIA FOR MODEL SELECTION. Gregory C. ... formodel selection based on the posterior probability criterion, and points out.

What are posterior probabilities and prior probabilities ...

support.minitab.com/.../modeling.../what-are-posterior-and-prior-probab...
posterior probability is the probability of assigning observations to groups given the data. A prior probability is the probability that an observation will fall into a ...

[PDF]Posterior Model Probabilities via Path-based Pairwise Priors ...

https://www2.stat.duke.edu/~berger/papers/pathwise.pdf
Duke University
by JO Berger - ‎Cited by 41 - ‎Related articles
Posterior Model Probabilities via Path-based. Pairwise Priors. James O. Berger1. Duke University and Statistical and Applied Mathematical Sciences Institute,.

Bayesian model choice based on Monte Carlo estimates of ...

www.sciencedirect.com/science/.../S0167947304002464
ScienceDirect
by P Congdon - ‎2006 - ‎Cited by 56 - ‎Related articles
An approach is outlined here that produces posterior model probabilities and hence Bayes factor estimates but not marginal likelihoods. It uses a Monte Carlo ...

Posterior probabilities for choosing a regression model

www.jstor.org/stable/2335274
JSTOR
by AC Atkinson - ‎1978 - ‎Cited by 142 - ‎Related articles
Biometrika (1978), 65, 1, pp. 39-48. 39. With 4 text-figures. Printed in Great Britain.Posterior probabilities for choosing a regression model. BY A. C. ATKINSON.

Posterior probabilities of components - MATLAB - MathWorks

www.mathworks.com › ... › Gaussian Mixture Models
MathWorks
P = posterior(obj,X) returns the posterior probabilities of each of the k components in the Gaussian mixture ... Cluster Analysis · Gaussian Mixture Models.

[PDF]Comparing and Combining Generative and Posterior ...

www.aclweb.org/.../W04-3209...
Association for Computational Linguistics
by Y Liu - ‎Cited by 35 - ‎Related articles
Comparing and Combining Generative and Posterior Probability Models: Some Advances in Sentence Boundary Detection in Speech. Yang Liu. ICSI and .

No comments:

Post a Comment