Overview

The Unstructured open source library is designed as a starting point for quick prototyping and has limits. For production scenarios, use the Unstructured user interface (UI) or the Unstructured API instead.

To start using the Unstructured open source library right away, skip ahead to the quickstart.

The Unstructured open source library (GitHub, PyPI) offers an open-source toolkit designed to simplify the ingestion and pre-processing of diverse data formats, including images and text-based documents such as PDFs, HTML files, Word documents, and more. With a focus on optimizing data workflows for Large Language Models (LLMs), the Unstructured open source library provides modular functions and connectors that work seamlessly together. This cohesive system ensures efficient transformation of unstructured data into structured formats, while also offering adaptability to various platforms and use cases.

Key functionality

Precise document extraction: Unstructured offers advanced capabilities in extracting elements and metadata from documents. This includes a variety of document element types and metadata. Learn more about Document elements and metadata.
Robust file support: The platform supports a wide array of file types, ensuring versatility in handling different document formats from PDF, Images, HTML, and many more. Detailed information on supported file types can be found here.
Robust core functionality: Unstructured provides a suite of core functionalities critical for efficient data processing. This includes:
- Partitioning: The partitioning functions in Unstructured enable the extraction of structured content from raw, unstructured documents. This feature is crucial for transforming unorganized data into usable formats, aiding in efficient data processing and analysis.
- Cleaning: Data preparation for NLP models often requires cleaning to ensure quality. The Unstructured library includes cleaning functions that assist in sanitizing output, removing unwanted content, and improving the performance of NLP models. This step is essential for maintaining the integrity of data before it is passed to downstream applications.
- Extracting: This functionality allows for the extraction of specific entities within documents. It is designed to identify and isolate relevant pieces of information, making it easier for users to focus on the most pertinent data in their documents.
- Staging: Staging functions help prepare your data for ingestion into downstream systems. Please note that this functionality is being deprecated in favor of destination connectors in the Unstructured Ingest CLI and Unstructured Ingest Python library.
- Chunking: The chunking process in Unstructured is distinct from conventional methods. Instead of relying solely on text-based features to form chunks, Unstructured uses a deep understanding of document formats to partition documents into semantic units (document elements).

Common use cases

Pretraining models
Fine-tuning models
Retrieval Augmented Generation (RAG)
Traditional ETL

GPU usage is not supported for the Unstructured open source library.

Limits

The Unstructured open source library has the following limits as compared to the Unstructured UI and the Unstructured API:

Not designed for production scenarios.
Significantly decreased performance on document and table extraction.
No access to Unstructured’s latest vision language model (VLM) offerings.
No access to Unstructured’s fine-tuned OCR models.
No access to Unstructured’s by-page and by-similarity chunking strategies.
No support for generating embeddings in the core Unstructured open source offering. (However, you can generate embeddings as a separate step manually. Learn how. Also, there is built-in support for generating embeddings by using the open source’s Unstructured Ingest CLI and Unstructured Ingest Python library offerings. Learn more.)
No support for Unstructured’s enrichment types such as image descriptions, table descriptions, and named entity recognition (NER).
Lack of support for SOC2 Type 2, HIPAA, and GDPR compliance.
No authentication or identity management in the core open source offering for local document processing.
No incremental data loading.
No ETL job scheduling or monitoring.
No image extraction from documents.
Less sophisticated document hierarchy detection.
You must manage many of your own code dependencies, for instance for libraries such as Poppler and Tesseract.
For local document processing, you must manage your own infrastructure, including parallelization and other performance optimizations.

Pricing

Calls to the Unstructured open source library that are routed to Unstructured’s software-as-a-service (SaaS) for processing (for example, by calling the partition_via_api or partition_multiple_via_api functions with an Unstructured API key and an Unstructured SaaS URL) require an Unstructured account for billing purposes.

Unstructured offers three account pricing plans:

SaaS Cloud-hosted - Processing happens on Unstructured’s software-as-a-service (SaaS) cloud infrastructure in a multi-tenant environment.
Private SaaS - Processing also happens on Unstructured’s SaaS cloud infrastructure, but your data stays protected in a dedicated cloud environment, maintaining strict data privacy.
VPC - Sometimes referred to as self-hosted, an instance of the Unstructured SaaS is deployed into your own virtual private cloud (VPC), providing complete data ownership and infrastructure control, full customization, and dedicated technical support.

For more details, see the Unstructured Pricing page.

Some of these plans are billed on a per-page basis.

Unstructured calculates a page as follows:

For these file types, a page is a page, slide, or image: .pdf, .pptx, and .tiff.
For .docx files that have page metadata, Unstructured calculates the number of pages based on that metadata.
For all other file types, Unstructured calculates the number of pages as the file’s size divided by 100 KB.
For non-file data, Unstructured calculates a page as 100 KB of incoming data to be processed.

Supported file types

On this page

Key functionality
Common use cases
Limits
Pricing

To start using the Unstructured open source library right away, skip ahead to the quickstart.

Key functionality

Precise document extraction: Unstructured offers advanced capabilities in extracting elements and metadata from documents. This includes a variety of document element types and metadata. Learn more about Document elements and metadata.
Robust file support: The platform supports a wide array of file types, ensuring versatility in handling different document formats from PDF, Images, HTML, and many more. Detailed information on supported file types can be found here.
Robust core functionality: Unstructured provides a suite of core functionalities critical for efficient data processing. This includes:
- Partitioning: The partitioning functions in Unstructured enable the extraction of structured content from raw, unstructured documents. This feature is crucial for transforming unorganized data into usable formats, aiding in efficient data processing and analysis.
- Cleaning: Data preparation for NLP models often requires cleaning to ensure quality. The Unstructured library includes cleaning functions that assist in sanitizing output, removing unwanted content, and improving the performance of NLP models. This step is essential for maintaining the integrity of data before it is passed to downstream applications.
- Extracting: This functionality allows for the extraction of specific entities within documents. It is designed to identify and isolate relevant pieces of information, making it easier for users to focus on the most pertinent data in their documents.
- Staging: Staging functions help prepare your data for ingestion into downstream systems. Please note that this functionality is being deprecated in favor of destination connectors in the Unstructured Ingest CLI and Unstructured Ingest Python library.
- Chunking: The chunking process in Unstructured is distinct from conventional methods. Instead of relying solely on text-based features to form chunks, Unstructured uses a deep understanding of document formats to partition documents into semantic units (document elements).

Common use cases

Pretraining models
Fine-tuning models
Retrieval Augmented Generation (RAG)
Traditional ETL

GPU usage is not supported for the Unstructured open source library.

Limits

The Unstructured open source library has the following limits as compared to the Unstructured UI and the Unstructured API:

Not designed for production scenarios.
Significantly decreased performance on document and table extraction.
No access to Unstructured’s latest vision language model (VLM) offerings.
No access to Unstructured’s fine-tuned OCR models.
No access to Unstructured’s by-page and by-similarity chunking strategies.
No support for generating embeddings in the core Unstructured open source offering. (However, you can generate embeddings as a separate step manually. Learn how. Also, there is built-in support for generating embeddings by using the open source’s Unstructured Ingest CLI and Unstructured Ingest Python library offerings. Learn more.)
No support for Unstructured’s enrichment types such as image descriptions, table descriptions, and named entity recognition (NER).
Lack of support for SOC2 Type 2, HIPAA, and GDPR compliance.
No authentication or identity management in the core open source offering for local document processing.
No incremental data loading.
No ETL job scheduling or monitoring.
No image extraction from documents.
Less sophisticated document hierarchy detection.
You must manage many of your own code dependencies, for instance for libraries such as Poppler and Tesseract.
For local document processing, you must manage your own infrastructure, including parallelization and other performance optimizations.

Pricing

Unstructured offers three account pricing plans:

SaaS Cloud-hosted - Processing happens on Unstructured’s software-as-a-service (SaaS) cloud infrastructure in a multi-tenant environment.
Private SaaS - Processing also happens on Unstructured’s SaaS cloud infrastructure, but your data stays protected in a dedicated cloud environment, maintaining strict data privacy.
VPC - Sometimes referred to as self-hosted, an instance of the Unstructured SaaS is deployed into your own virtual private cloud (VPC), providing complete data ownership and infrastructure control, full customization, and dedicated technical support.

For more details, see the Unstructured Pricing page.

Some of these plans are billed on a per-page basis.

Unstructured calculates a page as follows:

For these file types, a page is a page, slide, or image: .pdf, .pptx, and .tiff.
For .docx files that have page metadata, Unstructured calculates the number of pages based on that metadata.
For all other file types, Unstructured calculates the number of pages as the file’s size divided by 100 KB.
For non-file data, Unstructured calculates a page as 100 KB of incoming data to be processed.

Supported file types

On this page

Key functionality
Common use cases
Limits
Pricing

Key functionality

Common use cases

Limits

Pricing

Unstructured open source

Getting started with open source

Using Unstructured open source

Ingestion

How to

Best practices

Concepts

Integrations

Overview

Key functionality

Common use cases

Limits

Pricing

​Key functionality

​Common use cases

​Limits

​Pricing

Unstructured open source

Getting started with open source

Using Unstructured open source

Ingestion

How to

Best practices

Concepts

Integrations

​Key functionality

​Common use cases

​Limits

​Pricing

Key functionality

Common use cases

Limits

Pricing

Key functionality

Common use cases

Limits

Pricing