University of Cambridge > Talks.cam > Language Technology Lab Seminars > Detecting Text Reuse in Large Historical Corpora and Authorship Attribution of Premodern Documents

Log in

Google

Microsoft

Information on

Subscribing to talks details

Finding a talk details

Adding a talk details

Disseminating talks details

Help and Documentation details

Detecting Text Reuse in Large Historical Corpora and Authorship Attribution of Premodern Documents

Download to your calendar using vCal

Aleksi Vesanto (University of Turku)
Tuesday 03 October 2017, 11:00-12:00
SR-24, English Faculty Building, 9 West Road (Sidgwick Site).

If you have a question about this talk, please contact Dimitri Kartsaklis .

This presentation covers two projects: a method to detect text reuse that can withstand extreme OCR noise, and real world applications of machine learning in authorship attribution. Detecting text reuse from historical documents is relevant to many, as it can shed light on many questions, such as how certain news spread or whether authors have plagiarized others. Finding these repeated passages can be fairly hard, as the documents are generally OCR transcribed and can contain extreme noise, where the text is bordering on unreadable. Authorship attribution is in no way a new field, yet machine learning has only had a limited spotlight in real world applications. This presentation highlights a case where machine learning provides new information that contradicts older manual attributions, and a method to attribute a document with multiple possible authors with very little training data.

This talk is part of the Language Technology Lab Seminars series.

This talk is included in these lists:

Note that ex-directory lists are not shown.

Detecting Text Reuse in Large Historical Corpora and Authorship Attribution of Premodern Documents

📅 Download to calendar (vCal)

👤 Speaker: Aleksi Vesanto (University of Turku)
📅 Date & Time: Tuesday 03 October 2017, 11:00 - 12:00
📍 Venue: SR-24, English Faculty Building, 9 West Road (Sidgwick Site)

Questions? Contact Dimitri Kartsaklis

Abstract

Series This talk is part of the Language Technology Lab Seminars series.

Included in Lists

Note: Ex-directory lists are not shown.

Log in

🔐 Log In

Information on

ℹ️ Information

Detecting Text Reuse in Large Historical Corpora and Authorship Attribution of Premodern Documents

This talk is included in these lists:

Detecting Text Reuse in Large Historical Corpora and Authorship Attribution of Premodern Documents

Abstract

Included in Lists

Log in

🔐 Log In

Information on

ℹ️ Information

Detecting Text Reuse in Large Historical Corpora and Authorship Attribution of Premodern Documents

This talk is included in these lists:

Other lists

Other talks

Detecting Text Reuse in Large Historical Corpora and Authorship Attribution of Premodern Documents

Abstract

Included in Lists