BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Talks.cam//talks.cam.ac.uk//
X-WR-CALNAME:Talks.cam
BEGIN:VEVENT
SUMMARY:Deep Web Data: Analysis\, Extraction\, and Modelling - Prof. Pierr
 e Senellart - Telecom Paris Tech
DTSTART:20100426T100000Z
DTEND:20100426T110000Z
UID:TALK24675@talks.cam.ac.uk
CONTACT:Microsoft Research Cambridge Talks Admins
DESCRIPTION:*Abstract:* The traditional way for Web search engines to retr
 ieve and index data from the Web has been to crawl its hyperlink structure
 . This approach cannot capture data of the deep Web (also known as hidden 
 Web or invisible Web)\, the huge amount of content available on the Web th
 at lies behind Web forms or Web services. The focus of this talk is to dis
 cuss automatic and unsupervised methods for analyzing\, extracting\, and m
 odelling Web data\, given some initial domain of interest. A strong stress
  will be put in the presentation of applied and theoretical open problems\
 , a solution of which would be of great help for undertanding data of the 
 deep Web. We first introduce classical methods for matching Web forms with
  concepts from an ontology\, and investigate how static analysis of JavaSc
 ript programs could be used to improve the quality of the understanding of
  a HTML form. We next present an unsupervised approach to information extr
 action over Deep Web result pages and highlight its limitations\, insistin
 g in particular on the need for a probabilistic representation of the extr
 acted data. This leads us to consider models for probabilistic trees. Afte
 r a quick survey of the literature on probabilistic XML\, we will discuss 
 interesting questions in verification aspects\, in particular connecting t
 he notion of probabilistic database with that of probabilistic schema.\n\n
 *Biography:* Dr. Pierre Senellart is an Associate Professor in the Compute
 r Science and Networking department at Télécom ParisTech\, the French le
 ading engineering school specialized in information technology. He is an a
 lumni of the École normale supérieure and obtained his M.Sc. (2003) and 
 his Ph.D. (2007) in Computer Science from Université Paris-Sud\, studying
  under the supervision of Serge Abiteboul. Pierre Senellart has published 
 articles in internationally renowned conferences and journals (PODS\, AAAI
 \, VLDB Journal\, Journal of the ACM\, etc.) He has been a member of the p
 rogram committee of ECML/PKDD\, WWW\, VLDB\, ICDE\, a member of the repeat
 ability committee of SIGMOD\, and the organizer of the SIGMOD 2010 program
 ming contest. He is also the Information Director of the Journal of the AC
 M. His research interests focus around theoretical aspects of database man
 agement systems and the World Wide Web\, and more specifically on the inte
 ntional indexing of the deep Web\, probabilistic XML databases\, and graph
  mining. He also has an interest in natural language processing\, and has 
 been collaborating with SYSTRAN\, the leading machine translation company.
LOCATION:Lecture-room large (126 seats) Microsoft Research Ltd\, Roger Nee
 dham Building\, 7 J J Thomson Avenue (Off Madingley Road)\, CB3 0FB
END:VEVENT
END:VCALENDAR
