Design decisions in web corpus construction and their impact on distributional semantic models
- 👤 Speaker: Felix Bildhauer, Freie Universität Berlin 🔗 Website
- 📅 Date & Time: Friday 28 March 2014, 12:00 - 13:00
- 📍 Venue: FW11, Computer Laboratory
Abstract
The World Wide Web has become increasingly popular as a source of linguistic data, not only within the NLP communities, but also with theoretical linguists facing problems of data sparseness or data diversity. However, for a variety of reasons, commercial search engines are not normally suitable when it comes to collecting data for linguistic research (see, for example, Kilgarriff 2007). An obvious alternative to ‘Googleology’ consists in building static corpora from web documents, possibly adding layers of linguistic annotation, and querying these corpora with tools geared towards the needs of linguists. However, constructing a relatively ‘clean’ corpus of web texts from html-documents usually involves all kinds of design decisions (e.g., concerning sampling strategy, filtering, de-duplication, normalization). The impact of such decisions on the characteristics of the final corpus has received relatively little attention so far. This talk focuses on the processing steps that have been applied in building most of the large web corpora available today, such as the WaCKy corpora (Baroni et al. 2009) and the COW corpora (Schäfer and Bildhauer 2012). I will discuss to what extent these steps involve arbitrary decisions and show how some of these can be avoided (or at least, shifted from the corpus builders to the corpus users). Finally, I tentatively explore the impact of such decisions on distributional semantic models based on the resulting corpora.
References
Baroni M., Bernardini, S., Ferraresi A., and Zanchetta, E. 2009. The WaCky Wide Web: A Collection of Very Large Linguistically Processed Web-Crawled Corpora. Language Resources and Evaluation 43(3), 209-226.
Kilgarriff, A. 2006. Googleology is Bad Science. Computational Linguistics 33(1), 147-151.
Schäfer, R. and Bildhauer, F. 2012. Building Large Corpora from the Web Using a New Efficient Tool Chain. In Nicoletta Calzolari et al. (eds.), Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12). European Language Resources Association, 486–493.
Series This talk is part of the NLIP Seminar Series series.
Included in Lists
- All Talks (aka the CURE list)
- bld31
- Cambridge Centre for Data-Driven Discovery (C2D3)
- Cambridge Forum of Science and Humanities
- Cambridge Language Sciences
- Cambridge talks
- Chris Davis' list
- Computer Education Research
- Computing Education Research
- Department of Computer Science and Technology talks and seminars
- FW11, Computer Laboratory
- Graduate-Seminars
- Guy Emerson's list
- Interested Talks
- Language Sciences for Graduate Students
- ndk22's list
- NLIP Seminar Series
- ob366-ai4er
- PMRFPS's
- rp587
- School of Technology
- Simon Baker's List
- Trust & Technology Initiative - interesting events
- yk449
Note: Ex-directory lists are not shown.
![[Talks.cam]](/static/images/talkslogosmall.gif)



Friday 28 March 2014, 12:00-13:00