To view the workflows dashboard, on the sidebar, click Workflows.
A workflow in Unstructured is a defined sequence of processes that automate the data handling from source to destination. It allows users to configure how and when data should be ingested, processed, and stored.
Workflows are crucial for establishing a systematic approach to managing data flows within the platform, ensuring consistency, efficiency, and adherence to specific data processing requirements.
Unstructured provides two types of workflow builders:
You must first have an existing source connector and destination connector to add to the workflow.
You cannot create an automatic workflow that uses a local source connector.
If you do not have an existing remote connector for either your target source (input) or destination (output) location, create the source connector, create the destination connector, and then return here.
To see your existing connectors, on the sidebar, click Connectors, and then click Sources or Destinations.
To create an automatic workflow:
On the sidebar, click Workflows.
Click New Workflow.
Next to Build it for Me, click Create Workflow.
For Workflow Name, enter some unique name for this workflow.
In the Sources dropdown list, select your source location.
In the Destinations dropdown list, select your destination location.
Click Continue.
The Reprocess All box applies only to blob storage connectors such as the Amazon S3, Azure Blob Storage, and Google Cloud Storage connectors:
Checking this box reprocesses all documents in the source location on every workflow run.
Unchecking this box causes only new documents that are added to the source location, or existing documents that are updated in the source location, since the last workflow run to be processed on future runs. Previously processed documents are not processed again. However:
Click Continue.
If you want this workflow to run on a schedule, in the Repeat Run dropdown list, select one of the scheduling options, and fill in the scheduling settings. Otherwise, select Don’t repeat.
Click Complete.
By default, this workflow partitions, chunks, and generates embeddings as follows:
Partitioner: Auto strategy
Unstructured automatically analyzes and processes files on a page-by-page basis (for PDF files) and on a document-by-document basis for everything else:
Chunker: Chunk by Title strategy
Embedder:
Enrichments:
This workflow contains no enrichments.
After this workflow is created, you can change any or all of its settings if you want to. This includes the workflow’s source connector, destination connector, partitioning, chunking, and embedding settings. You can also add enrichments to the workflow if you want to.
Image summary descriptions, table summary descriptions, and table-to-HTML output is generated only when the Partitioner node in a workflow is set to use the High Res partitioning strategy and the workflow also contains an image description, table description, or table-to-HTML enrichment node.
Setting the Partitioner node to use Auto, VLM, or Fast in a workflow that also contains an image description, table description, or table-to-HTML enrichment node will not generate any image summary descriptions, table summary descriptions, or table-to-HTML output, and it could also cause the workflow to stop running or produce unexpected results.
To change the workflow’s default settings or to add enrichments:
If you did not previously set the workflow to run on a schedule, you can run the worklow now.
If you already have an existing workflow that you want to change, do the following:
You must first have an existing source connector and destination connector to add to the workflow.
You can create a custom workflow that uses a local source connector, but you cannot save the workflow.
If you do not have an existing connector for either your target source (input) or destination (output) location, create the source connector, create the destination connector, and then return here.
To see your existing connectors, on the sidebar, click Connectors, and then click Sources or Destinations.
On the sidebar, click Workflows.
Click New Workflow.
Click the Build it Myself option, and then click Continue.
In the This workflow pane, click the Details button.
Next to Name, click the pencil icon, enter some unique name for this workflow, and then click the check mark icon.
If you want this workflow to run on a schedule, click the Schedule button. In the Repeat Run dropdown list, select one of the scheduling options, and fill in the scheduling settings.
To overwrite any previously processed files, or to retry any documents that fail to process, click the Settings button, and check either or both of the boxes.
The Reprocess All Files box applies only to blob storage connectors such as the Amazon S3, Azure Blob Storage, and Google Cloud Storage connectors:
Checking this box reprocesses all documents in the source location on every workflow run.
Unchecking this box causes only new documents that are added to the source locations, or existing documents that are updated in the source location, since the last workflow run to be processed on future runs. Previously processed documents are not processed again. However:
The workflow begins with the following layout:
The following workflow layouts are also valid:
In the pipeline designer, click the Source node. In the Source pane, select the source location. Then click Save.
To use a local source location, do not choose a source connector.
If the workflow uses a local source location, in the Source node, drag or click to specify a local file, and then click Test. The workflow’s results are displayed on-screen.
A workflow that uses a local source location has the following limitations:
Click the Destination node. In the Destination pane, select the destination location. Then click Save.
As needed, add more nodes by clicking the plus icon (recommended) or Add Node button:
Click Connect to add another Source or Destination node. You can add multiple source and destination locations. Files will be ingested from all of the source locations, and the processed data will be delivered to all of the destination locations. Learn more.
Click Enrich to add a Chunker or Enrichment node. Learn more.
Image summary descriptions, table summary descriptions, and table-to-HTML output is generated only when the Partitioner node in a workflow is set to use the High Res partitioning strategy and the workflow also contains an image description, table description, or table-to-HTML enrichment node.
Setting the Partitioner node to use Auto, VLM, or Fast in a workflow that also contains an image description, table description, or table-to-HTML enrichment node will not generate any image summary descriptions, table summary descriptions, or table-to-HTML output, and it could also cause the workflow to stop running or produce unexpected results.
Click Transform to add a Partitioner or Embedder node. Learn more.
Make sure to add nodes in the correct order. If you are unsure, see the usage hints in the blue note that appears in the node’s settings pane.
To edit a node, click that node, and then change its settings.
To delete a node, click that node, and then click the trash can icon above it.
Click Save.
If you did not set the workflow to run on a schedule, you can run the worklow now.
Partitioner node
Choose from one of four available partitioning strategies.
Unstructured recommends that you choose the Auto partitioning strategy in most cases. With Auto, Unstructured does all the heavy lifting, optimizing at runtime for the highest quality at the lowest cost page-by-page.
You should consider the following additional strategies only if you are absolutely sure that your documents are of the same type. Each of the following strategies are best suited for specific situations. Choosing one of these strategies other than Auto for sets of documents of different types could produce undesirable results, including reduction in transformation quality.
.bmp
, .gif
, .heic
, .jpeg
, .jpg
, .pdf
, .png
, .tiff
, and .webp
.For VLM, you must also choose a VLM provider and model. Available choices include:
Anthropic:
OpenAI:
Amazon Bedrock:
Vertex AI:
When you use the VLM strategy with embeddings for PDF files of 200 or more pages, you might notice some errors when these files are processed. These errors typically occur when these larger PDF files have lots of tables and high-resolution images.
Chunker node
For Chunkers, select one of the following:
Chunk by title: Preserve section boundaries and optionally page boundaries as well. A single chunk will never contain text that occurred in two different sections. When a new section starts, the existing chunk is closed and a new one is started, even if the next element would fit in the prior chunk. Also, specify the following:
metadata
field’s orig_elements
field for that chunk. By default, this box is unchecked.Chunk by character (also known as basic chunking): Combine sequential elements to maximally fill each chunk. Also, specify the following:
metadata
field’s orig_elements
field for that chunk. By default, this box is unchecked.Chunk by page: Preserve page boundaries. When a new page is detected, the existing chunk is closed and a new one is started, even if the next element would fit in the prior chunk. Also, specify the following:
metadata
field’s orig_elements
field for that chunk. By default, this box is unchecked.Chunk by similarity: Use the sentence-transformers/multi-qa-mpnet-base-dot-v1 embedding model to identify topically similar sequential elements and combine them into chunks. Also, specify the following:
metadata
field’s orig_elements
field for that chunk. By default, this box is unchecked.Learn more:
Enrichment node
Choose one of the following:
Image summary descriptions, table summary descriptions, and table-to-HTML output is generated only when the Partitioner node in a workflow is set to use the High Res partitioning strategy and the workflow also contains an image description, table description, or table-to-HTML enrichment node.
Setting the Partitioner node to use Auto, VLM, or Fast in a workflow that also contains an image description, table description, or table-to-HTML enrichment node will not generate any image summary descriptions, table summary descriptions, or table-to-HTML output, and it could also cause the workflow to stop running or produce unexpected results.
Image to summarize images. Also select one of the following provider (and model) combinations to use:
Table to summarize tables. Also select one of the following provider (and model) combinations to use:
Make sure after you choose the provider and model, that Table Description is also displayed. If Table Description and Table to HTML are both displayed, be sure to select Table Description.
Table to convert tables to HTML. Also select one of the following provider (and model) combinations to use:
Make sure after you choose this provider and model, that Table to HTML is also selected.
Text to generate a list of recognized entities and their relationships by using a technique called named entity recognition (NER). Also select one of the following provider (and model) combinations to use:
You can also customize the prompt used to add or remove entities and relationships. In the Details tab, under Prompt, click Edit. Click Run Prompt in the Edit & Test Prompt section to test the prompt.
Embedder node
For Select Embedding Model, select one of the following:
Azure OpenAI: Use Azure OpenAI to generate embeddings with one of the following models:
text-embedding-ada-002
), with 1536 dimensions.Amazon Bedrock: Use Amazon Bedrock to generate embeddings with one of the following models:
TogetherAI: Use TogetherAI to generate embeddings with one of the following models:
Voyage AI: Use Voyage AI to generate embeddings with one of the following models:
Learn more:
To run a workflow once, manually:
For each of the workflows on the Workflows list page, the following actions are available by clicking the ellipses (the three dots) in the row for the respective workflow:
To stop running a workflow that is set to run on a repeating schedule:
Turning off the Status toggle also disables the workflow’s Run button, which prevents that workflow from being run manually as well.
To resume running the workflow on its original repeating schedule, as well as enable the workflow to be run manually as needed, turn on the workflow’s Status toggle.
To duplicate (copy or clone) a workflow:
On the sidebar, click Workflows.
In the list of workflows, click the ellipses (the three dots) in the row for the workflow that you want to duplicate.
Click Duplicate.
A duplicate of the workflow is created with the same configuration as the original workflow. The duplicate workflow has the same display name as the original workflow but with (Copy) at the end.
To view the workflows dashboard, on the sidebar, click Workflows.
A workflow in Unstructured is a defined sequence of processes that automate the data handling from source to destination. It allows users to configure how and when data should be ingested, processed, and stored.
Workflows are crucial for establishing a systematic approach to managing data flows within the platform, ensuring consistency, efficiency, and adherence to specific data processing requirements.
Unstructured provides two types of workflow builders:
You must first have an existing source connector and destination connector to add to the workflow.
You cannot create an automatic workflow that uses a local source connector.
If you do not have an existing remote connector for either your target source (input) or destination (output) location, create the source connector, create the destination connector, and then return here.
To see your existing connectors, on the sidebar, click Connectors, and then click Sources or Destinations.
To create an automatic workflow:
On the sidebar, click Workflows.
Click New Workflow.
Next to Build it for Me, click Create Workflow.
For Workflow Name, enter some unique name for this workflow.
In the Sources dropdown list, select your source location.
In the Destinations dropdown list, select your destination location.
Click Continue.
The Reprocess All box applies only to blob storage connectors such as the Amazon S3, Azure Blob Storage, and Google Cloud Storage connectors:
Checking this box reprocesses all documents in the source location on every workflow run.
Unchecking this box causes only new documents that are added to the source location, or existing documents that are updated in the source location, since the last workflow run to be processed on future runs. Previously processed documents are not processed again. However:
Click Continue.
If you want this workflow to run on a schedule, in the Repeat Run dropdown list, select one of the scheduling options, and fill in the scheduling settings. Otherwise, select Don’t repeat.
Click Complete.
By default, this workflow partitions, chunks, and generates embeddings as follows:
Partitioner: Auto strategy
Unstructured automatically analyzes and processes files on a page-by-page basis (for PDF files) and on a document-by-document basis for everything else:
Chunker: Chunk by Title strategy
Embedder:
Enrichments:
This workflow contains no enrichments.
After this workflow is created, you can change any or all of its settings if you want to. This includes the workflow’s source connector, destination connector, partitioning, chunking, and embedding settings. You can also add enrichments to the workflow if you want to.
Image summary descriptions, table summary descriptions, and table-to-HTML output is generated only when the Partitioner node in a workflow is set to use the High Res partitioning strategy and the workflow also contains an image description, table description, or table-to-HTML enrichment node.
Setting the Partitioner node to use Auto, VLM, or Fast in a workflow that also contains an image description, table description, or table-to-HTML enrichment node will not generate any image summary descriptions, table summary descriptions, or table-to-HTML output, and it could also cause the workflow to stop running or produce unexpected results.
To change the workflow’s default settings or to add enrichments:
If you did not previously set the workflow to run on a schedule, you can run the worklow now.
If you already have an existing workflow that you want to change, do the following:
You must first have an existing source connector and destination connector to add to the workflow.
You can create a custom workflow that uses a local source connector, but you cannot save the workflow.
If you do not have an existing connector for either your target source (input) or destination (output) location, create the source connector, create the destination connector, and then return here.
To see your existing connectors, on the sidebar, click Connectors, and then click Sources or Destinations.
On the sidebar, click Workflows.
Click New Workflow.
Click the Build it Myself option, and then click Continue.
In the This workflow pane, click the Details button.
Next to Name, click the pencil icon, enter some unique name for this workflow, and then click the check mark icon.
If you want this workflow to run on a schedule, click the Schedule button. In the Repeat Run dropdown list, select one of the scheduling options, and fill in the scheduling settings.
To overwrite any previously processed files, or to retry any documents that fail to process, click the Settings button, and check either or both of the boxes.
The Reprocess All Files box applies only to blob storage connectors such as the Amazon S3, Azure Blob Storage, and Google Cloud Storage connectors:
Checking this box reprocesses all documents in the source location on every workflow run.
Unchecking this box causes only new documents that are added to the source locations, or existing documents that are updated in the source location, since the last workflow run to be processed on future runs. Previously processed documents are not processed again. However:
The workflow begins with the following layout:
The following workflow layouts are also valid:
In the pipeline designer, click the Source node. In the Source pane, select the source location. Then click Save.
To use a local source location, do not choose a source connector.
If the workflow uses a local source location, in the Source node, drag or click to specify a local file, and then click Test. The workflow’s results are displayed on-screen.
A workflow that uses a local source location has the following limitations:
Click the Destination node. In the Destination pane, select the destination location. Then click Save.
As needed, add more nodes by clicking the plus icon (recommended) or Add Node button:
Click Connect to add another Source or Destination node. You can add multiple source and destination locations. Files will be ingested from all of the source locations, and the processed data will be delivered to all of the destination locations. Learn more.
Click Enrich to add a Chunker or Enrichment node. Learn more.
Image summary descriptions, table summary descriptions, and table-to-HTML output is generated only when the Partitioner node in a workflow is set to use the High Res partitioning strategy and the workflow also contains an image description, table description, or table-to-HTML enrichment node.
Setting the Partitioner node to use Auto, VLM, or Fast in a workflow that also contains an image description, table description, or table-to-HTML enrichment node will not generate any image summary descriptions, table summary descriptions, or table-to-HTML output, and it could also cause the workflow to stop running or produce unexpected results.
Click Transform to add a Partitioner or Embedder node. Learn more.
Make sure to add nodes in the correct order. If you are unsure, see the usage hints in the blue note that appears in the node’s settings pane.
To edit a node, click that node, and then change its settings.
To delete a node, click that node, and then click the trash can icon above it.
Click Save.
If you did not set the workflow to run on a schedule, you can run the worklow now.
Partitioner node
Choose from one of four available partitioning strategies.
Unstructured recommends that you choose the Auto partitioning strategy in most cases. With Auto, Unstructured does all the heavy lifting, optimizing at runtime for the highest quality at the lowest cost page-by-page.
You should consider the following additional strategies only if you are absolutely sure that your documents are of the same type. Each of the following strategies are best suited for specific situations. Choosing one of these strategies other than Auto for sets of documents of different types could produce undesirable results, including reduction in transformation quality.
.bmp
, .gif
, .heic
, .jpeg
, .jpg
, .pdf
, .png
, .tiff
, and .webp
.For VLM, you must also choose a VLM provider and model. Available choices include:
Anthropic:
OpenAI:
Amazon Bedrock:
Vertex AI:
When you use the VLM strategy with embeddings for PDF files of 200 or more pages, you might notice some errors when these files are processed. These errors typically occur when these larger PDF files have lots of tables and high-resolution images.
Chunker node
For Chunkers, select one of the following:
Chunk by title: Preserve section boundaries and optionally page boundaries as well. A single chunk will never contain text that occurred in two different sections. When a new section starts, the existing chunk is closed and a new one is started, even if the next element would fit in the prior chunk. Also, specify the following:
metadata
field’s orig_elements
field for that chunk. By default, this box is unchecked.Chunk by character (also known as basic chunking): Combine sequential elements to maximally fill each chunk. Also, specify the following:
metadata
field’s orig_elements
field for that chunk. By default, this box is unchecked.Chunk by page: Preserve page boundaries. When a new page is detected, the existing chunk is closed and a new one is started, even if the next element would fit in the prior chunk. Also, specify the following:
metadata
field’s orig_elements
field for that chunk. By default, this box is unchecked.Chunk by similarity: Use the sentence-transformers/multi-qa-mpnet-base-dot-v1 embedding model to identify topically similar sequential elements and combine them into chunks. Also, specify the following:
metadata
field’s orig_elements
field for that chunk. By default, this box is unchecked.Learn more:
Enrichment node
Choose one of the following:
Image summary descriptions, table summary descriptions, and table-to-HTML output is generated only when the Partitioner node in a workflow is set to use the High Res partitioning strategy and the workflow also contains an image description, table description, or table-to-HTML enrichment node.
Setting the Partitioner node to use Auto, VLM, or Fast in a workflow that also contains an image description, table description, or table-to-HTML enrichment node will not generate any image summary descriptions, table summary descriptions, or table-to-HTML output, and it could also cause the workflow to stop running or produce unexpected results.
Image to summarize images. Also select one of the following provider (and model) combinations to use:
Table to summarize tables. Also select one of the following provider (and model) combinations to use:
Make sure after you choose the provider and model, that Table Description is also displayed. If Table Description and Table to HTML are both displayed, be sure to select Table Description.
Table to convert tables to HTML. Also select one of the following provider (and model) combinations to use:
Make sure after you choose this provider and model, that Table to HTML is also selected.
Text to generate a list of recognized entities and their relationships by using a technique called named entity recognition (NER). Also select one of the following provider (and model) combinations to use:
You can also customize the prompt used to add or remove entities and relationships. In the Details tab, under Prompt, click Edit. Click Run Prompt in the Edit & Test Prompt section to test the prompt.
Embedder node
For Select Embedding Model, select one of the following:
Azure OpenAI: Use Azure OpenAI to generate embeddings with one of the following models:
text-embedding-ada-002
), with 1536 dimensions.Amazon Bedrock: Use Amazon Bedrock to generate embeddings with one of the following models:
TogetherAI: Use TogetherAI to generate embeddings with one of the following models:
Voyage AI: Use Voyage AI to generate embeddings with one of the following models:
Learn more:
To run a workflow once, manually:
For each of the workflows on the Workflows list page, the following actions are available by clicking the ellipses (the three dots) in the row for the respective workflow:
To stop running a workflow that is set to run on a repeating schedule:
Turning off the Status toggle also disables the workflow’s Run button, which prevents that workflow from being run manually as well.
To resume running the workflow on its original repeating schedule, as well as enable the workflow to be run manually as needed, turn on the workflow’s Status toggle.
To duplicate (copy or clone) a workflow:
On the sidebar, click Workflows.
In the list of workflows, click the ellipses (the three dots) in the row for the workflow that you want to duplicate.
Click Duplicate.
A duplicate of the workflow is created with the same configuration as the original workflow. The duplicate workflow has the same display name as the original workflow but with (Copy) at the end.