Set the OCR agent

Task

You want to specify the type of agent that you use when performing optical character recognition (OCR) on files, such as images and PDFs.

This task works only with the Unstructured open source library.

Approach

Set the environment variable named OCR_AGENT to one of the following supported values:

unstructured.partition.utils.ocr_models.tesseract_ocr.OCRAgentTesseract to use Tesseract OCR. This is the default if not otherwise specified.
unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle to use Paddle OCR.
unstructured.partition.utils.ocr_models.google_vision_ocr.OCRAgentGoogleVision to use Google Cloud Vision OCR.

Also, be sure to install the corresponding OCR agent and its dependencies, if you have not already done so:

For Tesseract OCR, see the dependency list.
For Paddle OCR, see the dependency list.
For Google Cloud Vision OCR, see the dependency list.

Example code

This example uses a PNG file with an embedded combination of English and Korean text. This example uses Tesseract OCR.

Language codes will differ depending on the OCR agent you use:

For Tesseract OCR, see the language codes list.
For Paddle OCR, see the language codes list and language names list.
For Google Cloud Vision OCR, see the language codes list.

Python

import json

from unstructured.partition.image import partition_image

# Source: https://github.com/Unstructured-IO/unstructured-ingest/blob/main/example-docs/img/english-and-korean.png
# Path to the local file to process, relative to this .py file.
filename = "local-ingest-png/english-and-korean.png"

elements = partition_image(
  filename=filename,
  strategy="ocr_only",
  languages=["eng", "kor"] # Language codes differ by the OCR agent used.
)

# Convert the list of returned elements into a list of dictionaries for printing or saving.
element_dicts = [element.to_dict() for element in elements]

# Print the list.
print(json.dumps(element_dicts, indent=2))

# Or, save the list locally:
#
# file = "local-ingest-output/english-and-korean.json"
#
# with open(file, "w") as file:
#     json.dump(element_dicts, file, indent=2)

Get chunked elements Chunking

On this page

Task
Approach
Example code

Task

You want to specify the type of agent that you use when performing optical character recognition (OCR) on files, such as images and PDFs.

This task works only with the Unstructured open source library.

Approach

Set the environment variable named OCR_AGENT to one of the following supported values:

unstructured.partition.utils.ocr_models.tesseract_ocr.OCRAgentTesseract to use Tesseract OCR. This is the default if not otherwise specified.
unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle to use Paddle OCR.
unstructured.partition.utils.ocr_models.google_vision_ocr.OCRAgentGoogleVision to use Google Cloud Vision OCR.

Also, be sure to install the corresponding OCR agent and its dependencies, if you have not already done so:

For Tesseract OCR, see the dependency list.
For Paddle OCR, see the dependency list.
For Google Cloud Vision OCR, see the dependency list.

Example code

This example uses a PNG file with an embedded combination of English and Korean text. This example uses Tesseract OCR.

Language codes will differ depending on the OCR agent you use:

For Tesseract OCR, see the language codes list.
For Paddle OCR, see the language codes list and language names list.
For Google Cloud Vision OCR, see the language codes list.

Python

import json

from unstructured.partition.image import partition_image

# Source: https://github.com/Unstructured-IO/unstructured-ingest/blob/main/example-docs/img/english-and-korean.png
# Path to the local file to process, relative to this .py file.
filename = "local-ingest-png/english-and-korean.png"

elements = partition_image(
  filename=filename,
  strategy="ocr_only",
  languages=["eng", "kor"] # Language codes differ by the OCR agent used.
)

# Convert the list of returned elements into a list of dictionaries for printing or saving.
element_dicts = [element.to_dict() for element in elements]

# Print the list.
print(json.dumps(element_dicts, indent=2))

# Or, save the list locally:
#
# file = "local-ingest-output/english-and-korean.json"
#
# with open(file, "w") as file:
#     json.dump(element_dicts, file, indent=2)

Get chunked elements Chunking

On this page

Task
Approach
Example code

Task

Approach

Example code

Unstructured open source

Getting started with open source

Using Unstructured open source

Ingestion

How to

Best practices

Concepts

Integrations

Set the OCR agent

Task

Approach

Example code

​Task

​Approach

​Example code

Unstructured open source

Getting started with open source

Using Unstructured open source

Ingestion

How to

Best practices

Concepts

Integrations

​Task

​Approach

​Example code

Task

Approach

Example code

Task

Approach

Example code