BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Talks.cam//talks.cam.ac.uk//
X-WR-CALNAME:Talks.cam
BEGIN:VEVENT
SUMMARY:A fast algorithm with minimax optimal guarantees for topic models 
 with an unknown number of topics - Flori Bunea (Cornell University)
DTSTART:20180625T150000Z
DTEND:20180625T154500Z
UID:TALK107371@talks.cam.ac.uk
CONTACT:INI IT
DESCRIPTION:Topic models have become popular for the analysis of data that
  consists in a collection of n independent multinomial observations. The m
 odel links all cell probabilities\, collected in a  p x n matrix\, via the
  assumption that this matrix can be factorized as the product of  two non-
 negative matrices A and W. Topic models have been originally developed in 
 text mining\, when one browses through n documents\, based on a dictionary
  of p words\, and covering K topics. In this terminology\, the matrix A is
  called the word-topic matrix\, and is the main target of estimation. It c
 an be viewed as a matrix of conditional probabilities\, and it is uniquely
  defined\, under appropriate separability assumptions\, discussed in this 
 talk. Notably\, the unique A is required to satisfy what is commonly known
  as the anchor word assumption\, under which A has an unknown  number of r
 ows respectively proportional to K-dimensional  canonical basis vectors.  
 The indices of such rows are referred to as anchor words. Recent computati
 onally feasible algorithms\, with theoretical guarantees\,  utilize constr
 uctively this assumption by linking the estimation of the set of anchor wo
 rds with that of estimating the K  vertices of a simplex. This crucial ste
 p in the estimation of A requires K to be known\, and  cannot be easily ex
 tended to the more realistic set-up when K is unknown.   		 		In this talk
 \, we  take a different view on anchor word estimation\, and on the estima
 tion of the word-topic matrix A. We propose a new method of estimation in 
 topic models\, that is not a variation on the existing simplex finding alg
 orithms\, and that estimates K from the observed data. We derive new finit
 e sample minimax lower bounds for the estimation of A\, as well as new upp
 er bounds for our proposed estimator. We describe the scenarios where our 
 estimator is minimax adaptive. Our finite sample analysis is valid for any
  n\,  p\, K and document length N\,  and both p and K are allowed to incre
 ase with n\, a situation not handled well by previous analyses.   We illus
 trate\, via simulations\, that the new algorithm is faster and more accura
 te than the current ones\, although  we start out with a computational and
  theoretical disadvantage of not knowing  the  correct number of topics K\
 ,  while we provide the competing methods with the correct value in our si
 mulations.
LOCATION:Seminar Room 1\, Newton Institute
END:VEVENT
END:VCALENDAR