diff mbox

[v2,2/2] tesseract-ocr: new package

Message ID 1489910873-8450-3-git-send-email-gilles.talis@gmail.com
State Accepted
Headers show

Commit Message

Gilles Talis March 19, 2017, 8:07 a.m. UTC
Signed-off-by: Gilles Talis <gilles.talis@gmail.com>
---
Changes  v2 (following review by Thomas P.)
- Added language data files support inside main package instead of
specific package for each of them
- Explicitly selected PNG, JPEG and TIFF libraries as dependencies
- Added DEVELOPERS file change
- Fixed indentation issues
- Added extra comments
- Added limitations found using test-pkg script
---
 DEVELOPERS                               |  1 +
 package/Config.in                        |  1 +
 package/tesseract-ocr/Config.in          | 44 ++++++++++++++++++++
 package/tesseract-ocr/tesseract-ocr.hash |  8 ++++
 package/tesseract-ocr/tesseract-ocr.mk   | 69 ++++++++++++++++++++++++++++++++
 5 files changed, 123 insertions(+)
 create mode 100644 package/tesseract-ocr/Config.in
 create mode 100644 package/tesseract-ocr/tesseract-ocr.hash
 create mode 100644 package/tesseract-ocr/tesseract-ocr.mk

Comments

Thomas Petazzoni March 19, 2017, 1:54 p.m. UTC | #1
Hello,

On Sun, 19 Mar 2017 09:07:53 +0100, Gilles Talis wrote:
> diff --git a/package/tesseract-ocr/Config.in b/package/tesseract-ocr/Config.in
> new file mode 100644
> index 0000000..4fd0668
> --- /dev/null
> +++ b/package/tesseract-ocr/Config.in
> @@ -0,0 +1,44 @@
> +comment "tesseract-ocr needs a toolchain w/ threads, C++, gcc >= 4.8 & dynamic library"
> +	depends on BR2_USE_MMU
> +	depends on !BR2_INSTALL_LIBSTDCPP || !BR2_TOOLCHAIN_HAS_THREADS || \
> +        !BR2_TOOLCHAIN_GCC_AT_LEAST_4_8 || BR2_STATIC_LIBS

Indentation of this last line should have been two tabs.

> +menuconfig BR2_PACKAGE_TESSERACT_OCR
> +	bool "tesseract-ocr"
> +	depends on BR2_INSTALL_LIBSTDCPP
> +	depends on BR2_TOOLCHAIN_HAS_THREADS
> +	depends on BR2_TOOLCHAIN_GCC_AT_LEAST_4_8 # C++11
> +	depends on BR2_USE_MMU # fork()
> +	depends on !BR2_STATIC_LIBS
> +	select BR2_PACKAGE_JPEG
> +	select BR2_PACKAGE_LEPTONICA
> +	select BR2_PACKAGE_LIBPNG
> +	select BR2_PACKAGE_TIFF

I don't see where jpeg, libpng and tiff are mandatory. In fact, I don't
see them being used by tesseract-ocr, so I've dropped those
dependencies for nwo.


> +TESSERACT_OCR_VERSION = 3.05.00
> +TESSERACT_OCR_DATA_VERSION = 3.04.00
> +TESSERACT_OCR_SITE = $(call github,tesseract-ocr,tesseract,$(TESSERACT_OCR_VERSION))
> +TESSERACT_OCR_LICENSE = Apache-2.0
> +TESSERACT_OCR_LICENSE_FILES = COPYING
> +
> +# Source from github, no configure script provided
> +TESSERACT_OCR_AUTORECONF = YES
> +
> +TESSERACT_OCR_DEPENDENCIES += leptonica jpeg libpng tiff

I've dropped jpeg, libpng and tiff. Instead, I've added host-pkgconf
which is really needed since configure.ac uses PKG_CHECK_MODULES().

I've also passed --disable-opencl since your package hasn't added
explicit support for OpenCL.

> +# Language data files download
> +ifeq ($(BR2_PACKAGE_TESSERACT_OCR_LANG_ENG),y)
> +TESSERACT_OCR_DATA_FILES += eng.traineddata
> +endif
> +
> +ifeq ($(BR2_PACKAGE_TESSERACT_OCR_LANG_FRA),y)
> +TESSERACT_OCR_DATA_FILES += fra.traineddata
> +endif
> +
> +ifeq ($(BR2_PACKAGE_TESSERACT_OCR_LANG_DEU),y)
> +TESSERACT_OCR_DATA_FILES += deu.traineddata
> +endif
> +
> +ifeq ($(BR2_PACKAGE_TESSERACT_OCR_LANG_SPA),y)
> +TESSERACT_OCR_DATA_FILES += spa.traineddata
> +endif
> +
> +ifeq ($(BR2_PACKAGE_TESSERACT_OCR_LANG_CHI_SIM),y)
> +TESSERACT_OCR_DATA_FILES += chi_sim.traineddata
> +endif
> +
> +ifeq ($(BR2_PACKAGE_TESSERACT_OCR_LANG_CHI_TRA),y)
> +TESSERACT_OCR_DATA_FILES += chi_tra.traineddata
> +endif

Regarding the language files, I'm not entirely happy with the current
solution, but I couldn't come up with something better. I looked at the
two following options:

 * Creating a separate package for the tessdata repository
   https://github.com/tesseract-ocr/tessdata/, but this repository is
   3.4GB in size, which is admittedly a bit annoying to download when
   you just want a single language.

 * Since the list of languages is quite long, having an explicit option
   for each of them is a bit annoying. So I looked into turning your
   one-option-per-language idea into a single option with a space
   separated list of languages. Except that we anyway need to have the
   hash file for each language in tesseract-ocr.hash.

So in the end, I kept it as-is. We'll see if other folks have better
idea.

So in the mean time, I've applied with the fixes described above.

Thanks!

Thomas
Arnout Vandecappelle March 19, 2017, 11 p.m. UTC | #2
On 19-03-17 14:54, Thomas Petazzoni wrote:
> Regarding the language files, I'm not entirely happy with the current
> solution, but I couldn't come up with something better. I looked at the
> two following options:
> 
>  * Creating a separate package for the tessdata repository
>    https://github.com/tesseract-ocr/tessdata/, but this repository is
>    3.4GB in size, which is admittedly a bit annoying to download when
>    you just want a single language.
> 
>  * Since the list of languages is quite long, having an explicit option
>    for each of them is a bit annoying. So I looked into turning your
>    one-option-per-language idea into a single option with a space
>    separated list of languages. Except that we anyway need to have the
>    hash file for each language in tesseract-ocr.hash.

 That's why we have BR_NO_CHECK_HASH_FOR, no?

 Regards,
 Arnout

> 
> So in the end, I kept it as-is. We'll see if other folks have better
> idea.
Thomas Petazzoni March 19, 2017, 11:03 p.m. UTC | #3
Hello,

On Mon, 20 Mar 2017 00:00:27 +0100, Arnout Vandecappelle wrote:

> >  * Since the list of languages is quite long, having an explicit option
> >    for each of them is a bit annoying. So I looked into turning your
> >    one-option-per-language idea into a single option with a space
> >    separated list of languages. Except that we anyway need to have the
> >    hash file for each language in tesseract-ocr.hash.  
> 
>  That's why we have BR_NO_CHECK_HASH_FOR, no?

True. But then we don't check hashes for stuff downloaded through
Github, which potentially could change (hence the reason why I'm also
suggesting to have a package that downloads all of the tessdata
package, but it's huge).

Thomas
Gilles Talis March 20, 2017, 8:10 a.m. UTC | #4
Hi Thomas, Arnout,

>>  That's why we have BR_NO_CHECK_HASH_FOR, no?
>
> True. But then we don't check hashes for stuff downloaded through
> Github, which potentially could change (hence the reason why I'm also
> suggesting to have a package that downloads all of the tessdata
> package, but it's huge).
First of all, thanks a lot for the corrections to my patch. and for
committing it.
Regarding the language pack, my first intention was to create a
package for the entire tessdata.
But just like you, I found out this was not a viable option.

I am open to all suggestions to make this support better though.

thanks again
Gilles.
diff mbox

Patch

diff --git a/DEVELOPERS b/DEVELOPERS
index 8802fc7..bdc93d9 100644
--- a/DEVELOPERS
+++ b/DEVELOPERS
@@ -589,6 +589,7 @@  F:	package/httping/
 F:	package/iozone/
 F:	package/leptonica/
 F:	package/ocrad/
+F:	package/tesseract-ocr/
 F:	package/webp/
 
 N:	Gregory Dymarek <gregd72002@gmail.com>
diff --git a/package/Config.in b/package/Config.in
index ed48058..66c87d5 100644
--- a/package/Config.in
+++ b/package/Config.in
@@ -244,6 +244,7 @@  comment "Graphic applications"
 	source "package/mesa3d-demos/Config.in"
 	source "package/qt5cinex/Config.in"
 	source "package/rrdtool/Config.in"
+	source "package/tesseract-ocr/Config.in"
 
 comment "Graphic libraries"
 	source "package/cegui06/Config.in"
diff --git a/package/tesseract-ocr/Config.in b/package/tesseract-ocr/Config.in
new file mode 100644
index 0000000..4fd0668
--- /dev/null
+++ b/package/tesseract-ocr/Config.in
@@ -0,0 +1,44 @@ 
+comment "tesseract-ocr needs a toolchain w/ threads, C++, gcc >= 4.8 & dynamic library"
+	depends on BR2_USE_MMU
+	depends on !BR2_INSTALL_LIBSTDCPP || !BR2_TOOLCHAIN_HAS_THREADS || \
+        !BR2_TOOLCHAIN_GCC_AT_LEAST_4_8 || BR2_STATIC_LIBS
+
+menuconfig BR2_PACKAGE_TESSERACT_OCR
+	bool "tesseract-ocr"
+	depends on BR2_INSTALL_LIBSTDCPP
+	depends on BR2_TOOLCHAIN_HAS_THREADS
+	depends on BR2_TOOLCHAIN_GCC_AT_LEAST_4_8 # C++11
+	depends on BR2_USE_MMU # fork()
+	depends on !BR2_STATIC_LIBS
+	select BR2_PACKAGE_JPEG
+	select BR2_PACKAGE_LEPTONICA
+	select BR2_PACKAGE_LIBPNG
+	select BR2_PACKAGE_TIFF
+	help
+	  Tesseract is an OCR (Optical Character Recognition) engine,
+	  It can be used directly, or (for programmers) using an API.
+	  It supports a wide variety of languages.
+
+	  https://github.com/tesseract-ocr/tesseract
+
+if BR2_PACKAGE_TESSERACT_OCR
+comment "tesseract-ocr languages support"
+
+config BR2_PACKAGE_TESSERACT_OCR_LANG_ENG
+	bool "English"
+
+config BR2_PACKAGE_TESSERACT_OCR_LANG_FRA
+	bool "French"
+
+config BR2_PACKAGE_TESSERACT_OCR_LANG_GER
+	bool "German"
+
+config BR2_PACKAGE_TESSERACT_OCR_LANG_SPA
+	bool "Spanish"
+
+config BR2_PACKAGE_TESSERACT_OCR_LANG_CHI_SIM
+	bool "Simplified Chinese"
+
+config BR2_PACKAGE_TESSERACT_OCR_LANG_CHI_TRA
+	bool "Traditional Chinese"
+endif
diff --git a/package/tesseract-ocr/tesseract-ocr.hash b/package/tesseract-ocr/tesseract-ocr.hash
new file mode 100644
index 0000000..9bb5b52
--- /dev/null
+++ b/package/tesseract-ocr/tesseract-ocr.hash
@@ -0,0 +1,8 @@ 
+# locally computed
+sha256  3fe83e06d0f73b39f6e92ed9fc7ccba3ef734877b76aa5ddaaa778fac095d996  tesseract-ocr-3.05.00.tar.gz
+sha256  c0515c9f1e0c79e1069fcc05c2b2f6a6841fb5e1082d695db160333c1154f06d  eng.traineddata
+sha256  86afb23ad146467f263e8ade56fd3951b1cc28f8c4eebc34f993d3c02d88a7ab  fra.traineddata
+sha256  cb7eb42a7e972cec7ef904fe81825d7b547c46df684c814fdb11a930b13bca3a  deu.traineddata
+sha256  f23985996bbcfe2b57864ccb082783c1c74c87429f04411a04a6ba4d3da2efda  spa.traineddata
+sha256  323ae74d4a2ff49e932dbb4d6282fe0e67ddfafda075ec85803ecd077207454c  chi_sim.traineddata
+sha256  774d566bd0b36e4b6c07415dfa5b6b57feb2575b1f5f231d7fe01a52dac5dd0e  chi_tra.traineddata
diff --git a/package/tesseract-ocr/tesseract-ocr.mk b/package/tesseract-ocr/tesseract-ocr.mk
new file mode 100644
index 0000000..5ddacda
--- /dev/null
+++ b/package/tesseract-ocr/tesseract-ocr.mk
@@ -0,0 +1,69 @@ 
+################################################################################
+#
+# tesseract-ocr
+#
+################################################################################
+
+TESSERACT_OCR_VERSION = 3.05.00
+TESSERACT_OCR_DATA_VERSION = 3.04.00
+TESSERACT_OCR_SITE = $(call github,tesseract-ocr,tesseract,$(TESSERACT_OCR_VERSION))
+TESSERACT_OCR_LICENSE = Apache-2.0
+TESSERACT_OCR_LICENSE_FILES = COPYING
+
+# Source from github, no configure script provided
+TESSERACT_OCR_AUTORECONF = YES
+
+TESSERACT_OCR_DEPENDENCIES += leptonica jpeg libpng tiff
+
+TESSERACT_OCR_INSTALL_STAGING = YES
+
+TESSERACT_OCR_CONF_ENV += \
+	LIBLEPT_HEADERSDIR=$(STAGING_DIR)/usr/include/leptonica
+
+# Language data files download
+ifeq ($(BR2_PACKAGE_TESSERACT_OCR_LANG_ENG),y)
+TESSERACT_OCR_DATA_FILES += eng.traineddata
+endif
+
+ifeq ($(BR2_PACKAGE_TESSERACT_OCR_LANG_FRA),y)
+TESSERACT_OCR_DATA_FILES += fra.traineddata
+endif
+
+ifeq ($(BR2_PACKAGE_TESSERACT_OCR_LANG_DEU),y)
+TESSERACT_OCR_DATA_FILES += deu.traineddata
+endif
+
+ifeq ($(BR2_PACKAGE_TESSERACT_OCR_LANG_SPA),y)
+TESSERACT_OCR_DATA_FILES += spa.traineddata
+endif
+
+ifeq ($(BR2_PACKAGE_TESSERACT_OCR_LANG_CHI_SIM),y)
+TESSERACT_OCR_DATA_FILES += chi_sim.traineddata
+endif
+
+ifeq ($(BR2_PACKAGE_TESSERACT_OCR_LANG_CHI_TRA),y)
+TESSERACT_OCR_DATA_FILES += chi_tra.traineddata
+endif
+
+TESSERACT_OCR_EXTRA_DOWNLOADS = \
+	$(addprefix https://github.com/tesseract-ocr/tessdata/raw/$(TESSERACT_OCR_DATA_VERSION)/,\
+		$(TESSERACT_OCR_DATA_FILES))
+
+define TESSERACT_OCR_PRECONFIGURE
+	# Autoreconf step fails due to missing m4 directory
+	mkdir -p $(@D)/m4
+endef
+
+TESSERACT_OCR_PRE_CONFIGURE_HOOKS += TESSERACT_OCR_PRECONFIGURE
+
+# Language data files installation
+define TESSERACT_OCR_INSTALL_LANG_DATA
+	$(foreach langfile,$(TESSERACT_OCR_DATA_FILES), \
+		$(INSTALL) -D -m 0644 $(DL_DIR)/$(langfile) \
+			$(TARGET_DIR)/usr/share/tessdata/$(langfile)
+	)
+endef
+
+TESSERACT_OCR_POST_INSTALL_TARGET_HOOKS += TESSERACT_OCR_INSTALL_LANG_DATA
+
+$(eval $(autotools-package))