Tesserocr

环境配置      

验证安装

打开命令终端,输入:tesseract -v,可以看到版本信息

落地实践

我们使用 tesseract 和 tesserocr 来分别进行测试。
首先,直接下载:https://raw.githubusercontent.com/Python3WebSpider/Testtess/master/image.png
然后,将图片下载下来并保存为 image.png,然后分别使用 tesseract 和 tesserocr 命令进行测试。

tesseract 命令:

tesseract image.png result -l eng && cat result.txt
Tesseract Open Source OCR Engine v3.05.02 with Leptonica
Python3WebSpider

tesserocr 命令:

import tesserocr
from PIL import Image
image = Image.open('D:/tesserocr_testdata/image.png')
print(tesserocr.image_to_text(image))

import tesserocr
print(tesserocr.file_to_text("D:/tesserocr_testdata/image.png"))

可能会遇到的问题 1

使用 tesserocr.image_to_text("path")报错:RuntimeError: Failed to init API, possibly an invalid tessdata path: D:\
问题原因:
初始化 API 失败,可能是在路径 D:\下存在无效的 tessdata,意思是在 D 盘中找不到 tessdata。
解决方案:
直接把 Tesseract-OCR 中的 tessdata 文件夹全部复制到 Anaconda3 的根目录下,具体路径为"D:\Anaconda3\tessdata"。

可能会遇到的问题 2

pip install tesserocr
Collecting tesserocr
  Using cached tesserocr-2.1.3.tar.gz
Building wheels for collected packages: tesserocr
  Running setup.py bdist_wheel for tesserocr ... error
  Complete output from command "C:\Program Files\Anaconda3\python.exe" -u -c 
"import setuptools, 
tokenize;__file__='C:\\Users\\hp\\AppData\\Local\\Temp\\pip-build-
klj3zdup\\tesserocr\\setup.py';f=getattr(tokenize, 'open', open)
(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, 
__file__, 'exec'))" bdist_wheel -d 
C:\Users\hp\AppData\Local\Temp\tmpoyt9eh40pip-wheel- --python-tag cp35:
  running bdist_wheel
  running build
  running build_ext
  Failed to extract tesseract version from executable: [WinError 2] The 
system cannot find the file specified
  Supporting tesseract v3.04.00
  Building with configs: {'libraries': ['tesseract', 'lept'], 
'cython_compile_time_env': {'TESSERACT_VERSION': 197632}}
  cythoning tesserocr.pyx to tesserocr.cpp
  building 'tesserocr' extension
  error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools

问题原因:出现上述错误主要是缺失 Microsoft Visual C++ 14.0 组件支持。
解决方案 1:
最简单的解决办法,是安装运行 C++ 应用程序所需的 Visual C++ 组件,下载地址:https://go.microsoft.com/fwlink/?LinkId=615460
解决方案 2:
用.whl 文件下载 tesserocr 库,就不会出现这个问题,下载地址:https://github.com/simonflueckiger/tesserocr-windows_build/releases/tag/tesserocr-v2.2.2-tesseract-4.0.0-master ,选择下载 tesserocr-2.2.2-cp36-cp36m-win_amd64.whl 文件,然后在命令行中输入:pip install ...\tesserocr-2.2.2-cp36-cp36m-win_amd64.whl,即可安装成功。

可能会遇到的问题 3

Windows10 下,直接使用 pip install tesserocr 的命令,输出如下:
tesserocr.cpp(596): fatal error C1083: 无法打开包括文件: “ leptonica/allheaders.h ”: No such file or directory error: command 'C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe' failed with exit status 2
解决方案:

Conda
You can use the channel simonflueckiger to install from Conda:
conda install -c simonflueckiger tesserocr
or
to get tesserocr compiled with tesseract 4.0.0:
conda install -c simonflueckiger/label/tesseract-4.0.0-master tesserocr
pip
Download the wheel file corresponding to your Windows platform and Python installation from simonflueckiger/tesserocr-windows_build/releases and install them via:
pip install .whl
Usage
Initialize and re-use the tesseract API instance to score multiple images:

相关链接

tesserocr GitHub: https://github.com/sirfz/tesserocr
tesserocr PyPI: https://pypi.python.org/pypi/tesserocr
tesseract 下载地址: https://digi.bib.uni-mannheim.de/tesseract/
tesseract GitHub : https://github.com/tesseract-ocr/tesseract
tesseract 语言包 : https://github.com/tesseract-ocr/tessdata
tesseract 文档 : https://github.com/tesseract-ocr/tesseract/wiki/Documentation


↙↙↙阅读原文可查看相关链接,并与作者交流