tesseract-ocr/tesseract: Tesseract Open Source OCR Engine (main repository)
This package contains an OCR engine - libtesseract and a command line program - tesseract.

The lead developer is Ray Smith. The maintainer is Zdenko Podobny. For a list of contributors see AUTHORS and GitHub's log of contributors.

Tesseract has unicode (UTF-8) support, and can recognize more than 100 languages "out of the box".

Tesseract supports various output formats: plain-text, hocr(html), pdf, tsv, invisible-text-only pdf.

You should note that in many cases, in order to get better OCR results, you'll need to improve the quality of the image you are giving Tesseract.

This project does not include a GUI application. If you need one, please see the 3rdParty wiki page.

Tesseract can be trained to recognize other languages. See Tesseract Training for more information.
OCR  text  opensource 
2 days ago by horshacktest
Examining the Impact of Artificial Intelligence in Museums – MW17: Museums and the Web 2017
Artificial Intelligence. It’s a concept that holds lots of promise, generates endless buzz, and is starting to make its way into everyday life. In 2015, artificial intelligence went mainstream, and undoubtedly, in 2016, we will begin to see an increase in experimentation within the cultural space. In this presentation, we’ll explore some of AI’s most powerful uses related to machine learning and its impact on galleries, libraries, archives, and museums in the areas of collections, ticketing, and attendance data. We’ll also examine machine vision; a computer’s ability to understand what it is seeing. Machine vision can be used to inspect and analyze images. Imagine being able to classify all of your visual objects with the flip of a switch (actually, a few lines of code). We’ll explore real examples of machine learning on the following topics: -Identifying subject matter -Exacting color composition -Sentiment analysis -Text/character recognition -Recognizing similarity and patterns -Art authentication Machine learning and vision are very powerful tools and are more accessible than ever before. In the hands of museums, these technologies will inevitably lead to interesting discoveries, rich data, and new paths into your collection.
mw2017  artificial_intelligence  image_recognition  text_mining  ocr 
5 days ago by stacker
pdf ocr for table and data
pdf  ocr 
7 days ago by lenciel
skylander86/lambda-text-extractor: AWS Lambda functions to extract text from various binary formats.
GitHub is where people build software. More than 28 million people use GitHub to discover, fork, and contribute to over 85 million projects.
aws  lambda  pdf  text  extraction  ocr  tesseract 
13 days ago by floehopper
Creating a Modern OCR Pipeline Using Computer Vision and Deep Learning | Dropbox Tech Blog
“To address this shortcoming, we eventually tracked down a font vendor in China who could provide us with representative ancient thermal printer fonts.” - via Ben Hammersley
china  mechanicalturk  dropbox  ocr  font  thermalprinter 
19 days ago by danhon

