BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Talks.cam//talks.cam.ac.uk//
X-WR-CALNAME:Talks.cam
BEGIN:VEVENT
SUMMARY:Multiword Expressions: Evaluation of Extraction Methods and their 
 Impact on Grammar Engineering - Valia Kordoni (LT-Lab DFKI GmbH and Dept. 
 of Computational Linguistics\, Saarland University\, Germany)
DTSTART:20080606T110000Z
DTEND:20080606T120000Z
UID:TALK12413@talks.cam.ac.uk
CONTACT:Johanna Geiss
DESCRIPTION:In the first part of the talk I focus on the linguistic proper
 ties of Multiword Expressions (MWEs)\, taking a closer look at their lexic
 al\, syntactic\, as well as semantic characteristics. The term Multiword E
 xpressions has been used to describe expressions for which the syntactic o
 r semantic properties of the whole expression cannot be derived from its p
 arts (cf.\, Sag et al.\, 2002)\, including a large number of related but d
 istinct phenomena\, such as phrasal verbs (e.g.\, "come along")\, nominal 
 compounds (e.g.\, "frying pan")\, institutionalised phrases (e.g.\, "bread
  and butter")\, and many others. Jackendoff (1997) estimates the number of
  MWEs in a speaker's lexicon to be comparable to the number of single word
 s. However\, due to their heterogeneous characteristics\, MWEs present a t
 ough challenge for both linguistic and computational work (cf.\, Sag et al
 .\, 2002). For instance\, some MWEs are fixed\, and do not present interna
 l variation\, such as "ad hoc"\, while others allow different degrees of i
 nternal variability and modification\, such as "spill beans" ("spill sever
 al/musical/mountains of beans").\n\nIn the second part of the talk I focus
  on methods for the automatic acquisition of MWEs for robust grammar engin
 eering. First I investigate the hypothesis that MWEs can be detected by th
 e distinct statistical properties of their component words\, regardless of
  their type\, comparing various statistical measures\, a procedure which l
 eads to extremely interesting conclusions. I then investigate the influenc
 e of the size and quality of different corpora\, using the BNC and the Web
  search engines Google and Yahoo. I conclude that\, in terms of language u
 sage\, web generated corpora are fairly similar to more carefully built co
 rpora\, like the BNC\, indicating that the lack of control and balance of 
 these corpora are probably compensated by their size.\n\nFinally\, I show 
 a qualitative evaluation of the results of automatically adding extracted 
 MWEs to existing linguistic resources. To this effect\, I first discuss tw
 o main approaches commonly employed in NLP for treating MWEs: the words-wi
 th-spaces approach which models an MWE as a single lexical entry and it ca
 n adequately capture fixed MWEs like "by and large"\, and compositional ap
 proaches which treat MWEs by general and compositional methods of linguist
 ic analysis\, being able to capture more syntactically flexible MWEs\, lik
 e "rock boat"\, which cannot be satisfactorily captured by a words-with-sp
 aces approach\, since this would require lexical entries to be added for a
 ll the possible variations of an MWE (e.g.\, "rock/rocks/rocking this/that
 /his... boat"). On this basis\, I argue that the process of the automatic 
 addition of extracted MWEs to existing linguistic resources improves quali
 tatively\, if a more compositional approach to grammar/lexicon automated e
 xtension is adopted.
LOCATION:SW01 Computer Laboratory
END:VEVENT
END:VCALENDAR
