recentpopularlog in


« earlier   
NovelTM Datasets for English-Language Fiction, 1700-2009 | hc:26955 | Humanities CORE
This report describes a collection of 210,305 volumes of fiction that researchers are encouraged to borrow for their own work. Alternately, readers can simply browse the report as a description of English-language fiction in HathiTrust Digital Library. For instance, how does the proportion of fiction written by British authors or by women change across time? We also divide nineteenth- and twentieth-century fiction into seven subsets with different emphases (for instance, one where men and women are represented equally, and one composed of only the most prominent and widely-held books). Comparing the pictures produced by these different samples allows us to assess the fragility of recent quantitative arguments about literary history. Preprint version of an article to appear in the Journal of Cultural Analytics.
digital-humanities  open-data  natural-language-processing  rather-interesting  to-use  digitization 
2 days ago by Vaguery
Digital Content, Storytelling and Journalism: A Genuine Museum Experience – MW19 | Boston
Within digital engagement, the digital content team were given responsibility for a website and social channels which are Wellcome Collection rather than just promoting Wellcome Collection: creating original content that speaks to Wellcome Collection’s overall purpose, to challenge how we think and feel about health.
museum  digital-humanities 
8 days ago by historyshack
[1701.07396] LAREX - A semi-automatic open-source Tool for Layout Analysis and Region Extraction on Early Printed Books
A semi-automatic open-source tool for layout analysis on early printed books is presented. LAREX uses a rule based connected components approach which is very fast, easily comprehensible for the user and allows an intuitive manual correction if necessary. The PageXML format is used to support integration into existing OCR workflows. Evaluations showed that LAREX provides an efficient and flexible way to segment pages of early printed books.
OCR  digitization  machine-learning  algorithms  image-processing  image-segmentation  digital-humanities  to-understand  to-write-about  to-simulate 
5 weeks ago by Vaguery
Analyzing Documents with TF-IDF | Programming Historian
"This lesson focuses on a foundational natural language processing and information retrieval method called Term Frequency - Inverse Document Frequency (tf-idf). This lesson explores the foundations of tf-idf, and will also introduce you to some of the questions and concepts of computationally oriented text analysis."
text-mining  python  digital-humanities  text-analysis  tutorial 
7 weeks ago by tsuomela
[1608.02153] OCR of historical printings with an application to building diachronic corpora: A case study using the RIDGES herbal corpus
This article describes the results of a case study that applies Neural Network-based Optical Character Recognition (OCR) to scanned images of books printed between 1487 and 1870 by training the OCR engine OCRopus [@breuel2013high] on the RIDGES herbal text corpus [@OdebrechtEtAlSubmitted]. Training specific OCR models was possible because the necessary *ground truth* is available as error-corrected diplomatic transcriptions. The OCR results have been evaluated for accuracy against the ground truth of unseen test sets. Character and word accuracies (percentage of correctly recognized items) for the resulting machine-readable texts of individual documents range from 94% to more than 99% (character level) and from 76% to 97% (word level). This includes the earliest printed books, which were thought to be inaccessible by OCR methods until recently. Furthermore, OCR models trained on one part of the corpus consisting of books with different printing dates and different typesets *(mixed models)* have been tested for their predictive power on the books from the other part containing yet other fonts, mostly yielding character accuracies well above 90%. It therefore seems possible to construct generalized models trained on a range of fonts that can be applied to a wide variety of historical printings still giving good results. A moderate postcorrection effort of some pages will then enable the training of individual models with even better accuracies. Using this method, diachronic corpora including early printings can be constructed much faster and cheaper than by manual transcription. The OCR methods reported here open up the possibility of transforming our printed textual cultural heritage into electronic text by largely automatic means, which is a prerequisite for the mass conversion of scanned books.
OCR  image-processing  natural-language-processing  algorithms  machine-learning  rather-interesting  commodity-software  digital-humanities  to-write-about  consider:swarms  consider:stochastic-resonance 
10 weeks ago by Vaguery
HS 7310A: Public History in the Digital Age - Spring 2019 Syllabus
Students will learn to use digital media and computational analysis to further historical practice, presentation, analysis and research primarily for online audiences. Students will use technologies including blogs and social media, online publishing platforms, and mapping tools to create and share historical content with public audiences.
museum  digital-history  digital-humanities  digital-pedagogy  design  public-history  teaching  syllabus 
may 2019 by historyshack
Connections in Sound: Irish traditional music at AFC
Patrick Egan is a scholar and musician from Ireland, currently serving as Kluge Fellow in Digital Studies at the Kluge Center. He has recently submitted his PhD in digital humanities with ethnomusicology in at University College Cork. Patrick’s interests over the past number of years have focused on ways to creatively use descriptive data in order to re-imagine how research is conducted with archival collections.
LoC  Irish  music  linked-data  digital-humanities 
may 2019 by Psammead

Copy this bookmark:

to read