Detecting Text Reuse in Large Historical Corpora and Authorship Attribution of Premodern Documents
- đ¤ Speaker: Aleksi Vesanto (University of Turku)
- đ Date & Time: Tuesday 03 October 2017, 11:00 - 12:00
- đ Venue: SR-24, English Faculty Building, 9 West Road (Sidgwick Site)
Abstract
This presentation covers two projects: a method to detect text reuse that can withstand extreme OCR noise, and real world applications of machine learning in authorship attribution. Detecting text reuse from historical documents is relevant to many, as it can shed light on many questions, such as how certain news spread or whether authors have plagiarized others. Finding these repeated passages can be fairly hard, as the documents are generally OCR transcribed and can contain extreme noise, where the text is bordering on unreadable. Authorship attribution is in no way a new field, yet machine learning has only had a limited spotlight in real world applications. This presentation highlights a case where machine learning provides new information that contradicts older manual attributions, and a method to attribute a document with multiple possible authors with very little training data.
Series This talk is part of the Language Technology Lab Seminars series.
Included in Lists
- bld31
- Cambridge Centre for Data-Driven Discovery (C2D3)
- Cambridge Forum of Science and Humanities
- Cambridge Language Sciences
- Cambridge talks
- Chris Davis' list
- Guy Emerson's list
- Interested Talks
- Language Sciences for Graduate Students
- Language Technology Lab Seminars
- ndk22's list
- ob366-ai4er
- rp587
- Simon Baker's List
- SR-24, English Faculty Building, 9 West Road (Sidgwick Site)
- Trust & Technology Initiative - interesting events
- yk449
Note: Ex-directory lists are not shown.
![[Talks.cam]](/static/images/talkslogosmall.gif)

Aleksi Vesanto (University of Turku)
Tuesday 03 October 2017, 11:00-12:00