Friday, April 29, 2016

The Equivalence of Logistic Regression and Maximum Entropy Modeling

The Equivalence of Logistic Regression and Maximum Entropy Modeling

http://web.engr.illinois.edu/~mqian2/upload/research/notes/The%20Equivalence%20of%20Logistic%20Regression%20and%20Maximum%20Entropy%20Modeling.pdf

B. Dual Problem The dual problem of maximum entropy modeling is an unconstrained optimization problem


The Equivalence of Logistic Regression and Maximum Entropy Modeling

Posterior probability

From Wikipedia, the free encyclopedia
In Bayesian statistics, the posterior probability of a random event or an uncertain proposition is the conditional probability that is assigned after the relevant evidence or background is taken into account. Similarly, the posterior probability distribution is the probability distribution of an unknown quantity, treated as a random variable,conditional on the evidence obtained from an experiment or survey. "Posterior", in this context, means after taking into account the relevant evidence related to the particular case being examined.

Definition[edit]

The posterior probability is the probability of the parameters \theta given the evidence Xp(\theta|X).
It contrasts with the likelihood function, which is the probability of the evidence given the parameters: p(X|\theta).
The two are related as follows:
Let us have a prior belief that the probability distribution function is p(\theta) and observations x with the likelihood p(x|\theta), then the posterior probability is defined as
p(\theta|x) = \frac{p(x|\theta)p(\theta)}{p(x)}.[1]
The posterior probability can be written in the memorable form as
\text{Posterior probability} \propto \text{Likelihood} \times \text{Prior probability}

The posterior probability is the probability of the parameters \theta given the evidence X: p(\theta|X). the likelihood function, which is the probability of the evidence given the parameters: p(X|\theta).

The Equivalence of Logistic Regression and Maximum Entropy Modeling

http://web.engr.illinois.edu/~mqian2/upload/research/notes/The%20Equivalence%20of%20Logistic%20Regression%20and%20Maximum%20Entropy%20Modeling.pdf

B. Dual Problem The dual problem of maximum entropy modeling is an unconstrained optimization problem


The Equivalence of Logistic Regression and Maximum Entropy Modeling

Posterior probability

From Wikipedia, the free encyclopedia
In Bayesian statistics, the posterior probability of a random event or an uncertain proposition is the conditional probability that is assigned after the relevant evidence or background is taken into account. Similarly, the posterior probability distribution is the probability distribution of an unknown quantity, treated as a random variable,conditional on the evidence obtained from an experiment or survey. "Posterior", in this context, means after taking into account the relevant evidence related to the particular case being examined.

Definition[edit]

The posterior probability is the probability of the parameters \theta given the evidence Xp(\theta|X).
It contrasts with the likelihood function, which is the probability of the evidence given the parameters: p(X|\theta).
The two are related as follows:
Let us have a prior belief that the probability distribution function is p(\theta) and observations x with the likelihood p(x|\theta), then the posterior probability is defined as
p(\theta|x) = \frac{p(x|\theta)p(\theta)}{p(x)}.[1]
The posterior probability can be written in the memorable form as
\text{Posterior probability} \propto \text{Likelihood} \times \text{Prior probability}

Wednesday, April 27, 2016

Programming Collective Intelligence

https://kastnerkyle.github.io/posts/bad-speech-synthesis-made-simple/
Programming Collective Intelligence
Building Smart Web 2.0 Applications
Publisher: O'Reilly Media
Final Release Date: August 2007
Pages: 362


Want to tap the power behind search rankings, product recommendations, social bookmarking, and online matchmaking? This fascinating book demonstrates how you can build Web 2.0 applications to mine the enormous amount of data created by people on the Internet. With the sophisticated algorithms in this book, you can write smart programs to access interesting datasets from other web sites, collect data from users of your own applications, and analyze and understand the data once you've found it.

Programming Collective Intelligence takes you into the world of machine learning and statistics, and explains how to draw conclusions about user experience, marketing, personal tastes, and human behavior in general -- all from information that you and others collect every day. Each algorithm is described clearly and concisely with code that can immediately be used on your web site, blog, Wiki, or specialized application. This book explains:
  • Collaborative filtering techniques that enable online retailers to recommend products or media
  • Methods of clustering to detect groups of similar items in a large dataset
  • Search engine features -- crawlers, indexers, query engines, and the PageRank algorithm
  • Optimization algorithms that search millions of possible solutions to a problem and choose the best one
  • Bayesian filtering, used in spam filters for classifying documents based on word types and other features
  • Using decision trees not only to make predictions, but to model the way decisions are made
  • Predicting numerical values rather than classifications to build price models
  • Support vector machines to match people in online dating sites
  • Non-negative matrix factorization to find the independent features in a dataset
  • Evolving intelligence for problem solving -- how a computer develops its skill by improving its own code the more it plays a game
Each chapter includes exercises for extending the algorithms to make them more powerful. Go beyond simple database-backed applications and put the wealth of Internet data to work for you.

"Bravo! I cannot think of a better way for a developer to first learn these algorithms and methods, nor can I think of a better way for me (an old AI dog) to reinvigorate my knowledge of the details."
-- Dan Russell, Google

"Toby's book does a great job of breaking down the complex subject matter of machine-learning algorithms into practical, easy-to-understand examples that can be directly applied to analysis of social interaction across the Web today. If I had this book two years ago, it would have saved precious time going down some fruitless paths."
-- Tim Wolters, CTO, Collective Intellect

gradient regression.

Optimal learning rates for Kernel Conjugate Gradient regression

papers.nips.cc/.../4077-optimal-learning-rates-for-kernel-conjugate-gradi...

by G Blanchard - ‎2010 - ‎Cited by 11 - ‎Related articles
Optimal learning rates for Kernel Conjugate Gradient regression. Gilles Blanchard. Mathematics Institute, University of Potsdam. Am neuen Palais 10, 14469 ...

Optimal learning rates for Kernel Conjugate Gradient regression

arxiv.org › math

arXiv
by G Blanchard - ‎2010 - ‎Cited by 11 - ‎Related articles
Title: Optimal learning rates for Kernel Conjugate Gradient regression. Authors: Gilles Blanchard, Nicole Kraemer. (Submitted on 29 Sep 2010). Abstract: We ...

A GRADiENT--REGRESSION SEARCH PROCEDURE FOR ...

dl.acm.org/ft_gateway.cfm?id...

Association for Computing Machinery
by WE Biles - ‎1974 - ‎Cited by 41 - ‎Related articles
GRADiENT--REGRESSION SEARCH PROCEDURE. FOR SIMULATION EXPERIMENTATION. William Ernest Biles. University of Notre Darne. ABST RAG T.

A gradient—regression search procedure for simulation experimentation

https://www.researchgate.net/.../234831321_A_gradient-regre...
ResearchGate
gradient—regression search procedure for simulation experimentation on ResearchGate, the professional network for scientists.

[PDF]13.7 The Outer-Product-of-the-Gradient Regression 477 equal to the ...

qed.econ.queensu.ca/dm-book/second/pdf/pg477.pdf

Queen's University
13.7 The Outer-Product-of-the-Gradient Regression. 477 equal to the classical Wald test statistic and share with it the property of being based exclusively on the ...

A gradient---regression search procedure for simulation ...

www.academia.edu/.../A_gradient---regression_search_proce...
Academia.edu
By William Biles in Constrained System and Boolean Satisfiability. This paper examines a gradient search procedure for simulation experimentation with ...

Optimal learning rates for Kernel Conjugate Gradient regression

citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.231.648
by G Blanchard - ‎Cited by 11 - ‎Related articles
CiteSeerX - Document Details (Isaac Councill, Lee Giles, Pradeep Teregowda): We prove rates of convergence in the statistical sense for kernel-based least ...

[PDF]Face Recognition using Image Gradient Regression Approach

ijcsi.org/papers/IJCSI-10-2-3-178-183.pdf

Image Gradient Regression Approach. Dr. Shafiqul Abidin1, Ujjawal Jain2 , Saurabh Gupta3 and Prof. (Dr.) S. C. Gupta4. 1 Associate Professor, Department of ...

(Processing) Gradient descent algorithm won't work..(Linear regression)

stackoverflow.com/.../processing-gradient-descent-algorithm-wont-work...

Dec 23, 2014 - ... if (keyCode==UP) { alpha+=0.00001; } if (keyCode==DOWN) { alpha-=0.00001; } }. processing gradient regression linear gradient-descent ...

Linear regression | Garbage In Garbage Out - Kyle Kastner

https://kastnerkyle.github.io/posts/linear-regression/

Aug 4, 2013 - Let's define the previous statements in code, and perform gradient regression. The linear_cost and linear_cost_grad functions look slightl


Linear regression


When presented with an unknown dataset, it is very common to attempt to find trends or patterns. The most basic form of this is visual inspection - how is the data trending? Does it repeat in cycles? Can we predict future data given some past events? The mathematical approach to this "trend finding" is called regression.
First, the import statements.
In [1]:
import numpy as np                                                              
import matplotlib.pyplot as plt
%matplotlib inline
For this example problem, we will need some data. One of the best tools I have found for generating datasets is the sklearn.datasets toolbox, which features built-in datasets, dataset downloaders, and tools for generating datasets. In this case, I will use the make_regression function to generate 1000 samples, with noise and bias.
In [2]:
from sklearn import datasets 
X, y = datasets.make_regression(n_samples=1000,n_features=1,
                                n_informative=1,noise=15,
                                bias=1000,random_state=0)

_ = plt.scatter(X, y, color='darkred', marker='o', edgecolors='none')
_ = plt.title('Sample dataset')
As we can see, this data is distributed in a fairly linear fashion. But how can we discover the underlying mathematical function, without human intervention? One simple method for this is linear regression. In this case, we model Y with a function:
Y=θ0+θ1x1
In [3]:
#Add bias terms, so that the first column of X is all ones.
#This should make the data match the model.
X = np.hstack((np.ones((X.shape[0], 1)), X))

#Make y into a 2D array to match X
y = y[:, np.newaxis]
To fit the generic function described above to our particular dataset, we need to find values for theta that minimize some cost function, such as mean squared error.
An optimization technique known as gradient descent can be used in order to find values for theta. The easiest way to do this is to define a cost, the gradient of the cost with respect to each theta, and an update rule.

Math. Not Even Once.

To perform optimization by gradient descent, we will need the following formulas. In all of these formulas, h(x) is the prediction of x by the linear regressor.
Cost (m is the length of X):
12mi=1m(hθ(xi)yi)2
Gradient of cost with respect to theta (m is the length of X):
1m(hθ(x)y)
Update rule (for every j in X, m is the length of X):
θj=θj+αi=1m(yihθ(xi))xji

The Thicke of It

Let's define the previous statements in code, and perform gradient regression. The linear_cost and linear_cost_grad functions look slightly different - this is because the code versions are written in matrix form, but are otherwise equivalent to the versions defined above.
In [4]:
#Initialize theta to zeros                                                      
learning_rate = alpha = 0.01                                                                    
iters = 1000      

def linear_cost(theta,X,y):                                                     
    m = y.shape[0]                                                              
    return 1. / (2. * m) * np.sum((np.dot(X, theta) - y) ** 2.)                            
                                                                                
def linear_cost_grad(theta,X,y):                                                
    m = y.shape[0]                                                              
    return 1. / m * np.dot(X.T, (y - np.dot(X,theta)))                                 
                                                                                
def gradient_descent(X,y,alpha,iters):                                    
    m = y.shape[0]                                                              
    all_cost = []
    #Initialize theta to zeros
    all_theta = [np.zeros((X.shape[1],1))]
    for i in range(iters):                                                      
        all_cost.append(linear_cost(all_theta[-1], X, y))
        all_theta.append(all_theta[-1] + float(alpha) * linear_cost_grad(all_theta[-1], X, y))                       
    return all_theta,all_cost                                                       
                                                                             
#Perform linear regression via gradient descent                                
all_theta, all_cost = gradient_descent(X, y, alpha, iters)

Are We There Yet?

With code written, how can we tell that it is functioning correctly? One way is the look at the cost over each iteration of the gradient descent algorithm. If this value is decreasing as iterations increase, the values of theta are approaching their optimimum values, as defined by minimizing the cost function. This type of optimization is known as convex optimization, which is an entire field of study to its own. Gradient regression, however, doesn't require much study.
We have all seen gradient descent in action - water always flows downhill! At the top of the hill, the water begins to trickle downward along the slope of the hill, always following the path of steepest descent. To match the analogy, at each point, we calculate the cost and the gradient of the cost, trending down slope along the miniumum gradient. Since we computer users cannot use continuous methods, like mother nature can, we take steps, calculating the cost and cost gradient each time. This step size is set by a parameter called the learning_rate.
More advanced methods of optimization attempt to avoid local minima, using additional parameters like momentum, or different optimization methods which look at different factors, such as gradients-of-gradients. An example of local minima would be a lake on top of a mountain. The lake is the lowest nearby point when compared to the peak, but the bottom of the mountain is much lower than the lake itself. We want to reach the bottom of the mountain, rather than being stuck in the lake.
Now that funtime is over, let's look at a plot of cost vs. number of iterations. We can see this plot decrease, asymptotically approaching zero. The differences between stopping at iteration 400 and stopping at iteration 1000 are fairly minimal, and many algorithms use minimal decrease settings to stop iterating when the cost stops decreasing noticeably.
In [5]:
plt.plot(all_cost, color='steelblue')
plt.title('Cost over iterations')
Out[5]:
<matplotlib.text.Text at 0x107d2b978>
There are some other ways to visualize the information seen above, such as a 2D contour plot with an overlay of the gradient path. To plot both of this, we must calculate a cost mesh across a combination of (theta0, theta1) values.
In [6]:
theta0_vals = np.linspace(-1200, 1200, 100)
theta1_vals = np.linspace(-150, 150, 100)
contours = np.array([linear_cost([theta0,theta1],X,y)
                     for theta0 in theta0_vals 
                     for theta1 in theta1_vals]).reshape((theta0_vals.shape[0],
                                                          theta1_vals.shape[0]))
The 2D plot includes gradient descent - what path was taken to arrive at the local minima?
In [7]:
fig, ax = plt.subplots()
best_thetas = all_theta[-1]
#Adding best_thetas[1] to get contour plot in the correct spot
ax.contourf(theta1_vals + best_thetas[1], theta0_vals, contours, 
            np.logspace(5, 7, 50), cmap=plt.cm.Spectral_r)
ax.set_xlabel('theta1')
ax.set_ylabel('theta0')
ax.autoscale(False)
ax.set_ylim([850, 1150])
ax.set_xlim([-50, 200])
t = np.array(all_theta)
#Subsample by 20 and plot scatter points on top of contours
ax.scatter(t[::20, 1], t[::20, 0], zorder=2, color='steelblue', marker="^")
#Plot X for final theta values
ax.scatter(best_thetas[1], best_thetas[0], zorder=3, color='black', marker='x', s=75)
plt.title('Path taken for gradient descent')
Out[7]:
<matplotlib.text.Text at 0x107d97e10>

Blurred Lines

With all of this in mind, let's take a look at how closely the final regression values represent the original dataset.
In [8]:
#Use only the second term - remember that we added a bias term earlier!
plt.scatter(X[:, 1], y, color='darkred', edgecolor='none',)
plt.plot(X[:, 1], np.dot(X, best_thetas), color='steelblue')
plt.title('Gradient regression')
Out[8]:
<matplotlib.text.Text at 0x107c4e5f8>
As seen above, gradient descent has accurately estimated the generating function for this dataset. This is a toy example, but many more advanced methods also employ gradient descent as part of the learning process. Though this example used simple, low dimensional data, gradient descent regression can also work on higher dimensional data. As a bonus, we can also solve linear regression in closed form with one line. See the Wikipedia entry under "Least-Squares Estimators and Related Techniques".
In [9]:
closed_form_best_thetas = np.linalg.pinv(np.dot(X.T, X)).dot(X.T).dot(y)
plt.scatter(X[:, 1], y, color='darkred', edgecolor='none')
plt.plot(X[:, 1], np.dot(X, closed_form_best_thetas), color='steelblue')
plt.title('Gradient regression')
Out[9]:
<matplotlib.text.Text at 0x108f62518>
Linear regression is an excellent first approach for many problems - simple to implement, fast, and often good enough to get the job done! For an example of using linear regression in sklearn, follow this link