BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Talks.cam//talks.cam.ac.uk//
X-WR-CALNAME:Talks.cam
BEGIN:VEVENT
SUMMARY:Design decisions in web corpus construction and their impact on di
 stributional semantic models - Felix Bildhauer\, Freie Universität Berlin
DTSTART:20140328T120000Z
DTEND:20140328T130000Z
UID:TALK50794@talks.cam.ac.uk
CONTACT:Tamara Polajnar
DESCRIPTION:The World Wide Web has become increasingly popular as a source
  of\nlinguistic data\, not only within the NLP communities\, but also with
 \ntheoretical linguists facing problems of data sparseness or data\ndivers
 ity. However\, for a variety of reasons\, commercial search\nengines are n
 ot normally suitable when it comes to collecting data for\nlinguistic rese
 arch (see\, for example\, Kilgarriff 2007). An obvious\nalternative to 'Go
 ogleology' consists in building static corpora from\nweb documents\, possi
 bly adding layers of linguistic annotation\, and\nquerying these corpora w
 ith tools geared towards the needs of\nlinguists. However\, constructing a
  relatively 'clean' corpus of web\ntexts from html-documents usually invol
 ves all kinds of design\ndecisions (e.g.\, concerning sampling strategy\, 
 filtering\,\nde-duplication\, normalization). The impact of such decisions
  on the\ncharacteristics of the final corpus has received relatively littl
 e\nattention so far.\nThis talk focuses on the processing steps that have 
 been applied in\nbuilding most of the large web corpora available today\, 
 such as the\nWaCKy corpora (Baroni et al. 2009) and the COW corpora (Schä
 fer and\nBildhauer 2012). I will discuss to what extent these steps involv
 e\narbitrary decisions and show how some of these can be avoided (or at\nl
 east\, shifted from the corpus builders to the corpus users). Finally\,\nI
  tentatively explore the impact of such decisions on distributional\nseman
 tic models based on the resulting corpora.\n\n\nReferences\n\nBaroni M.\, 
 Bernardini\, S.\, Ferraresi A.\, and Zanchetta\, E. 2009. The\nWaCky Wide 
 Web: A Collection of Very Large Linguistically Processed\nWeb-Crawled Corp
 ora. Language Resources and Evaluation 43(3)\, 209-226.\n\nKilgarriff\, A.
  2006. Googleology is Bad Science. Computational\nLinguistics 33(1)\, 147-
 151.\n\nSchäfer\, R. and Bildhauer\, F. 2012. Building Large Corpora from
  the\nWeb Using a New Efficient Tool Chain. In Nicoletta Calzolari et\nal.
  (eds.)\, Proceedings of the Eight International Conference on\nLanguage R
 esources and Evaluation (LREC'12). European Language\nResources Associatio
 n\, 486–493.\n
LOCATION:FW11\, Computer Laboratory
END:VEVENT
END:VCALENDAR
