Giorgos Zacharia | LinkedIn
https://www.linkedin.com/in/giorgoszacharia
Greater Boston Area - CTO at KAYAK - KAYAK
Join LinkedIn and access Giorgos’s full profile.Giorgos Zacharia Appointed Chief Technology Officer - Kayak
https://www.kayak.com/.../giorgos-zacharia-appointed-chief-techn...
Jan 22, 2014 - We are happy to announce that Chief Product Officer Giorgos Zachariahas been appointed Chief Technology Officer (CTO), replacing KAYAK ...
Kayak
REGULARIZED ALGORITHMS FOR RANKING, AND MANIFOLD LEARNING FOR RELATED TASKS
By Giorgos Zacharia SB Mathematics (1998) SB Computer Science, Minor in Economics (1998) SM Media Arts and Sciences (1999) Massachusetts Institute of Technology Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY February 2009 © Giorgos Zacharia 2009. All rights reserved The author hereby grants to MIT permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole or in part in any medium now known or hereafter created. Signature of Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Department of Electrical Engineering and Computer Science January 30, 2009 Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tomaso Poggio Eugene McDermott Professor of Brain and Cognitive Sciences Thesis Supervisor Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Terry P. Orlando Chairman, Departmental Committee on Graduate Students
‐ 2 ‐
REGULARIZED ALGORITHMS FOR RANKING, AND MANIFOLD LEARNING FOR RELATED TASKS
By Giorgos Zacharia Submitted to the Department of Electrical Engineering and Computer Science, on January 30, 2009, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering and Computer Science ABSTRACT
This thesis describes an investigation of regularized algorithms for ranking problems for user preferences and information retrieval problems. We utilize regularized manifold algorithms to appropriately incorporate data from related tasks. This investigation was inspired by personalization challenges in both user preference and information retrieval ranking problems. We formulate the ranking problem of related tasks as a special case of semi‐supervised learning. We examine how to incorporate instances from related tasks, with the appropriate penalty in the loss function to optimize performance on the hold out sets. We present a regularized manifold approach that allows us to learn a distance metric for the different instances directly from the data. This approach allows incorporation of information from related task examples, without prior estimation of cross‐task coefficient covariances. We also present applications of ranking problems in two text analysis problems: a) Supervise content‐word learning, and b) Company Entity matching for record linkage problems. Thesis Supervisor: Tomaso Poggio Title: Eugene McDermott Professor of Brain and Cognitive Sciences
‐ 3 ‐
AKNOWLEDGEMENTS
I would like to thank my Thesis supervisor, Professor Tommy Poggio, and my Thesis committee members, Professors Patrick Henry Winston, and Tommi Jaakkola. Their guidance and advice were invaluable in bringing this thesis to completion.
I would like to thank Tommy for providing the stimulating environment of the CBCL, and all the CBCL members with which I interacted, or exchanged insightful ideas throughout my sporadic PhD life at MIT. I also have to thank Gadi for the Socratic discussions, and for keeping the espresso machine, and other functions at the CBCL operational.
I would like to acknowledge the support of my Open Ratings, Emporics, and Kayak friends, but this thesis would not be complete without specifically acknowledging the help of Olga Simek, Costas Boussios, Alexandros Kyriakides, and Theodoros Evgeniou.
I dedicate my thesis to my family: my parents Maro and Christakis, my siblings Dr Margarita Zachariou, and soon to be Dr Yiannis Zachariou, and my wife Snejina, and daughter Maria.
‐ 4 ‐
TABLE OF CONTENTS
Abstract ....................................................................................................... ‐ 2 ‐
Table of Contents ........................................................................................ ‐ 4 ‐
Motivation ............................................................................................. ‐ 7 ‐
Ranking as an approach to Information Overload ................................ ‐ 8 ‐
Outline of the Thesis ............................................................................. ‐ 9 ‐
Contributions of the Thesis ................................................................. ‐ 11 ‐
Chapter 1: Regularized Ranking for User Preference Modeling ............... ‐ 13 ‐
Introduction ........................................................................................ ‐ 13 ‐
Related Work ...................................................................................... ‐ 15 ‐
A Theoretical Framework for Modeling Preferences ......................... ‐ 20 ‐
Generalization to Nonlinear models ................................................... ‐ 29 ‐
Estimating the regularization penalty ................................................. ‐ 33 ‐
Adding Positivity Constraints .............................................................. ‐ 34 ‐
Handling Heterogeneity ...................................................................... ‐ 38 ‐
Experiments ........................................................................................ ‐ 40 ‐
Design of Simulations .......................................................................... ‐ 40 ‐
Experimental Results .......................................................................... ‐ 42 ‐
Estimation of Nonlinear Models ......................................................... ‐ 46 ‐
Summary and Contributions ............................................................... ‐ 51 ‐
‐ 5 ‐
Appendix ............................................................................................. ‐ 56 ‐
Chapter 2: Regularized Manifolds for Learning from Related Tasks ........ ‐ 60 ‐
Introduction ........................................................................................ ‐ 60 ‐
Related Work ...................................................................................... ‐ 61 ‐
Semi‐Supervised Learning Intuition .................................................... ‐ 61 ‐
Penalized Regularized Least Squares (PRLSC) for Related Tasks ........ ‐ 62 ‐
Penalized Laplacian RLSC (PLapRLS), for Related Tasks ...................... ‐ 64 ‐
Evaluation ........................................................................................... ‐ 67 ‐
Theoretical similarities with HB, and multitask kernel methods ........ ‐ 75 ‐
Summary and Contributions ............................................................... ‐ 77 ‐
Chapter 3: Supervised Content Word Extraction ..................................... ‐ 78 ‐
Introduction ........................................................................................ ‐ 78 ‐
Overview ............................................................................................. ‐ 79 ‐
WordEx ................................................................................................ ‐ 80 ‐
The feature vector .............................................................................. ‐ 82 ‐
SentEx .................................................................................................. ‐ 85 ‐
Evaluation of Applications .................................................................. ‐ 85 ‐
Summary and Contributions ............................................................... ‐ 92 ‐
Chapter 4: Company Entity Matching ....................................................... ‐ 94 ‐
Introduction ........................................................................................ ‐ 94 ‐
The Company Record Matching problem ........................................... ‐ 94 ‐
Related Work ...................................................................................... ‐ 97 ‐
‐ 6 ‐
Information Retrieval for Company Record Matching ....................... ‐ 98 ‐
Implementation details ....................................................................... ‐ 99 ‐
Supervised ranking learning for matching ........................................ ‐ 101 ‐
Experiments ...................................................................................... ‐ 102 ‐
Summary and Contributions ............................................................. ‐ 103 ‐
Summary and Thesis Contributions ........................................................ ‐ 106 ‐
Appendix I: Ranking Metrics and Loss functions .................................... ‐ 108 ‐
Kendall’s τ coefficient ....................................................................... ‐ 108 ‐
Spearman’s rank correlation ............................................................. ‐ 108 ‐
Mean Average Precision ................................................................... ‐ 109 ‐
Appendix II: String Similarity Metrics ..................................................... ‐ 112 ‐
Levenshtein Distance ........................................................................ ‐ 112 ‐
Jaro‐Winkler string similarity ............................................................ ‐ 112 ‐
Soundex ............................................................................................. ‐ 113 ‐
Metaphone ....................................................................................... ‐ 114 ‐
Lucene Similarity ............................................................................... ‐ 115 ‐
Information Gain ............................................................................... ‐ 117 ‐
TFIDF ................................................................................................. ‐ 118 ‐
Bibliography ............................................................................................ ‐ 120 ‐
‐ 7 ‐
INTRODUCTION
Motivation The continuous explosive availability of computers and the internet has changed the way humans interact with information. The excess amount of information has made a difficult and time consuming task the human processing and filtering through the available information to find what human users are looking for. When users search for information, they would like to receive the most likely results to satisfy their query first.
Internet search engines alleviate the problem for users when searching for unstructured keyword matches by trying to present the most reputable or through personalization, the most appropriate results for that user first. The motivation for this thesis is to alleviate the information overload problem when the users search for structured, multi‐attribute products, documents or other artifacts of information. When the user searches for the ideal multi‐attribute product, like a car configuration, a restaurant with ratings, an LCD screen or other electronics, with multiple specifications, and other complex purchasing decisions, where price is not the only driving factor, the users expect to receive the candidate products in the order of their preference. Also, when a user reads through a long article on their smart phone, they prefer to read a summary in the smaller screen, rather than a long multi‐page article. When the user expects to read a summarized article, they again expect to receive the important information to them first. There are many similar problems such as these ones, where the information overload problem can be alleviated with intelligent ranking algorithms, so that the human user gets to the right information first.
No comments:
Post a Comment