Thursday, March 31, 2016

discriminative and Cluster Analysis Gaussian Mixture Models P = posterior(obj,X) returns the posterior probabilities of each of the k components in the Cluster Analysis Gaussian Mixture Models

On Discriminative vs. Generative classifiers: A comparison ...

ai.stanford.edu/~ang/.../nips01-discriminativegenerative....
Stanford AI Lab
by AY Ng - ‎Cited by 1324 - ‎Related articles
On Discriminative vsGenerative classifiers: A comparison of logistic regression and naive Bayes. Andrew Y. Ng. Computer Science Division. University of ...

What is the difference between a Generative and ...

stackoverflow.com/.../what-is-the-difference-between-a-generative-and-d...
May 18, 2009 - This paper is a very popular reference on the subject of discriminative vsgenerative classifiers, but it's pretty heavy going. The overall gist is ...

bayesian inference

Comparing and Combining Generative and Posterior ...

www.aclweb.org/.../W04-3209...

Association for Computational Linguistics
by Y Liu - ‎Cited by 35 - ‎Related articles
Comparing and Combining Generative and Posterior Probability Models: Some Advances in Sentence Boundary Detection in Speech. Yang Liu. ICSI and .


http://www.aclweb.org/anthology/W04-3209.pdf


If k-means clustering is a form of Gaussian mixture modeling, can it be used when the data are not normal?

http://stats.stackexchange.com/questions/69424/if-k-means-clustering-is-a-form-of-gaussian-mixture-modeling-can-it-be-used-whe
I'm reading Bishop on EM algorithm for GMM and the relationship between GMM and k-means.
In this book it says that k-means is a hard assign version of GMM. I'm wondering does that imply that if the data I'm trying to cluster are not Gaussian, I can't use k-means (or at least it's not suitable to use)? For example, what if the data are images of handwritten digits, consisting of 8*8 pixels each with value 0 or 1 (and assume they are independent thus it should be mixture of Bernoulli)?
I'm a little bit confused on this and will appreciate any thoughts.
shareimprove this question
1 
If you are asking whether it is valid to perform k-means clustering on non-normal data, the answer is yes if the data are assumed to be continuous. Binary data isn't continuous. Some people do k-means on such data, which thing is heuristically permissible, but theoretically invalid. – ttnphns Sep 7 '13 at 11:12
   
There's no probability model for k-means so there's no normality assumption to invalidate. (doesn't mean it will work well though) – conjectures Sep 8 '13 at 19:34
   
@conjectures Hmm... But k-menas is equivalent to GMM, and GMM assumes normal. – eddie.xie Sep 9 '13 at 1:54
   
@ttnphns Thanks for your answer! So I guess if I use TF-IDF to transfer text into scores and make it continuous then I can apply and it's valid? – eddie.xie Sep 9 '13 at 1:56
   
I suddenly realize that GMM is mixture (sum of) a few gaussians and it should able to express whatever distribution given enough mixtures. Thus, even GMM and K-means are equivalent does not mean K-means can't use non-normal data because GMM can express whatever distribution. Is that correct? – eddie.xieSep 9 '13 at 3:29 

1 Answer

In typical EM GMM situations, one does take variance and covariance into account. This is not done in k-means.
But indeed, one of the popular heuristics for k-means (note: k-means is a problem, not an algorithm) - the Lloyd algorithm - is essentially an EM algorithm, using a centroid model (without variance) and hard assignments.
When doing k-means style clustering (i.e. variance minimization), you
  • coincidentally minimize squared Euclidean distance, because WCSS (within-cluster sum of squares) variance contribution = squared euclidean distance
  • coincidentally assign objects to the nearest cluster by Euclidean distance, because the sqrt function is monotone (note that the mean does not optimize Euclidean distances, but the WCSS function)
  • represent clusters using a centroid only
  • get Voronoi cell shaped clusters, i.e. polygons
  • it works best with spherical clusters
The k-means objective function can be formalized as this:
argminSi=1kxjSid=1D(xjdμid)2
where S={S1Sk} are all possible partitionings of the data set into k partitions, D is the data set dimensionality, and e.g. xjd is the coordinate of the jth instance in dimension d.
It is commonly said that k-means assumes spherical clusters. It is also commonly acknowledged that k-means clusters are Voronoi cells, i.e. not spherical. Both are correct, and both are wrong. First of all, the clusters are not complete Voronoi cells, but only the known objects therein. There is no need to consider the dead space in-between the clusters to be part of either cluster, as having an object there would affect the algorithm result. But it is not much better to call it "spherical" either, just because the euclidean distance is spherical. K-means doesn't care about Euclidean distance. All it is, is a heuristic to minimize the variances. And that is actually, what you should consider k-means to be: variance minimization.
shareimprove this answer
   
Let me suggest you to refine a bit some of your expressions - for more accuracy. For instance, what is tominimize squared euclidean distance or minimize the variances? There must be words "sum of" or "pooled" or such, because we have 2+ clusters, isn't it? – ttnphns Sep 8 '13 at 9:26 
   
BTW, since k-means minimizes the pooled within-cluster sum of d^2 divided by the number of objects in the respective cluster, your point coincidentally minimize Euclidean distance, because the sqrt function is monotone is, to be precise, not correct. – ttnphns Sep 8 '13 at 9:47
   
The proper objective function, for which you can prove convergence, is WCSS, within-cluster sum-of-squares. And indeed, it doesn't minimize Euclidean distances, but it the nearest-centroid-by-euclidean distance is also the WCSS optimal assignment. – Anony-Mousse Sep 8 '13 at 14:40
   
Your wording remains unfortunately dubious. What does phrase minimize squared Euclidean distance, because WCSS variance contribution = squared euclidean distance mean? Are you saying "squared d's between the objects in clusters get minimized because WCSS of deviations get minimized", or just "WCSS of deviations get minimized, which - the deviations - are euclidean distances by nature"? Or smth else? – ttnphns Sep 8 '13 at 16:18
   
If you want the math part, use the equation. If you want the intuition, you have to live with ambiguity, and figure out the details yourself. I have yet to see a mathematically 100% correct term (including all prelimiaries) that is intuitive. – Anony-Mousse Sep 8 '13 at 18:31



How to compute posterior model probabilities ... and why ...

www.math.rug.nl/stat/models/files/green.pdf

University of Groningen
Mar 15, 2011 - the product of posterior model probabilities and model-specific parameter posteriors. – very often the basis for reporting the inference, and in ...

[PDF]Posterior probability

https://astro.uni-bonn.de/.../Lecture3_2012.pdf

Hoher List Observatory
model parameters but it is not a probability density for θ). P(θ|x): old name “inverse probability” modern name “posterior probability”. Starting from observed ...

[PDF]a comparison of the information and posterior probability ...

https://www.princeton.edu/~erp/.../M253.pdf

Princeton University
by GC Chow - ‎1979 - ‎Cited by 102 - ‎Related articles
POSTERIOR PROBABILITY CRITERIA FOR MODEL SELECTION. Gregory C. ... formodel selection based on the posterior probability criterion, and points out.

What are posterior probabilities and prior probabilities ...

support.minitab.com/.../modeling.../what-are-posterior-and-prior-probab...

posterior probability is the probability of assigning observations to groups given the data. A prior probability is the probability that an observation will fall into a ...

[PDF]Posterior Model Probabilities via Path-based Pairwise Priors ...

https://www2.stat.duke.edu/~berger/papers/pathwise.pdf

Duke University
by JO Berger - ‎Cited by 41 - ‎Related articles
Posterior Model Probabilities via Path-based. Pairwise Priors. James O. Berger1. Duke University and Statistical and Applied Mathematical Sciences Institute,.

Bayesian model choice based on Monte Carlo estimates of ...

www.sciencedirect.com/science/.../S0167947304002464

ScienceDirect
by P Congdon - ‎2006 - ‎Cited by 56 - ‎Related articles
An approach is outlined here that produces posterior model probabilities and hence Bayes factor estimates but not marginal likelihoods. It uses a Monte Carlo ...

Posterior probabilities for choosing a regression model

www.jstor.org/stable/2335274

JSTOR
by AC Atkinson - ‎1978 - ‎Cited by 142 - ‎Related articles
Biometrika (1978), 65, 1, pp. 39-48. 39. With 4 text-figures. Printed in Great Britain.Posterior probabilities for choosing a regression model. BY A. C. ATKINSON.

Posterior probabilities of components - MATLAB - MathWorks

www.mathworks.com › ... › Gaussian Mixture Models

MathWorks
P = posterior(obj,X) returns the posterior probabilities of each of the k components in the Gaussian mixture ... Cluster Analysis · Gaussian Mixture Models.

[PDF]Comparing and Combining Generative and Posterior ...

www.aclweb.org/.../W04-3209...

Association for Computational Linguistics
by Y Liu - ‎Cited by 35 - ‎Related articles
Comparing and Combining Generative and Posterior Probability Models: Some Advances in Sentence Boundary Detection in Speech. Yang Liu. ICSI and .


http://www.aclweb.org/anthology/W04-3209.pdf

No comments:

Post a Comment