This week, I heard David Smith talk about "Inferring and Exploiting Relational Structure in Large Text Collections." Interesting that digitized books in the public domain are becoming testbeds for these research endeavors. He is also using translated text (e.g., books that have been translated into several languages) in order to discern the words used to describe specific concepts across languages.
I am so used to thinking about the digitization effort, that I rarely think about all of the ways that these now digitized texts can be used. That is one of the reasons why I found Smith's talk to be of interest.
Abstract: The digitization of knowledge and concerted retrospective scanning projects are making overwhelming amounts of text in diverse domains, genres, and languages available to readers and researchers. To make this data useful, our group is working on improving OCR, language modeling, syntactic analysis, information extraction, and information retrieval. I will focus in particular on problems of inferring the relational structure latent in large collections of documents, such as books, web pages, patent applications, grant proposals, and social media postings. Which books or passages quote, translate, paraphrase, and cite each other? This research requires improvements in modeling translation and other forms of similarity, as well as improvements in efficiently comparing large numbers of passages. Finally, I will discuss how passage similarity relations can be used to improve tasks such as named-entity recognition and syntactic parsing.