DocETL: Semantic Steering for Large-Scale Document Processing

Exploring DocETL - a powerful framework for extracting insights from unstructured documents using LLM operations with built-in optimization and feedback loops
Author

Alex Kelly

Published

August 30, 2025

Introduction

In the world of data processing, we’ve long had tools like Pandas for working with structured, tabular data. But what happens when you need to extract meaningful insights from thousands of unstructured documents - emails, PDFs, customer service transcripts, or research papers? This is where semantic steering comes into play, and DocETL represents an approach to this challenge.

DocETL is a framework that allows you to process documents (JSON, CSV, and PDFs) through a series of LLM operations to extract exactly the information you need. Think of it as “Pandas for LLM operations” - but with something crucial that traditional data processing lacks: A graphical interface that shows the complete outer loop that includes both the start and end of the process, with built-in optimization and human-in-the-loop feedback with all of your comments in situ.

What Makes DocETL Different

The Complete Processing Loop

What sets DocETL apart from ad-hoc LLM processing is its comprehensive approach:

  1. Document Input: Supports JSON, CSV, and PDF formats
  2. LLM Operations: Chain together map, reduce, filter operations
  3. Visual Feedback Interface: See results in a graphical format
  4. In-Situ Comments: Add comments directly against specific data points
  5. Automated Optimization: LLM-as-a-judge automatically improves prompts and operations

The end process is arguably the most important - it allows you to view the information being processed through LLM operations in a graphical format and write comments directly on the data. This isn’t just about convenience; it’s about creating a feedback loop that makes your entire pipeline smarter.

Human-in-the-Loop Optimization

When you add comments to specific pieces of data, DocETL stores that information against those exact data points. This becomes incredibly powerful because:

  • Prompt Improvement: The system can update prompts based on your comments plus the original data
  • Iterative Refinement: Comments help the LLM understand what you actually want versus what you asked for
  • Automatic Optimization: The LLM-as-a-judge feature automatically optimizes the entire process after each iteration

Real-World Example: Airline Customer Service Analysis

Let me walk through a practical example using a dataset of airline customer service chat transcripts - a small sample of 250 rows of customer interactions that need to be analyzed for sentiment, complaints, and actionable insights.

The Pipeline Structure

The demonstration uses a two-step pipeline:

  1. Map Operation: Analyze each individual chat to identify complaints, issues, and customer frustration
  2. Reduce Operation: Summarize findings across all chats to identify common themes

Map Operation: Individual Chat Analysis

The first operation processes each customer service transcript with a detailed prompt that extracts:

  • Chief Complaint: Direct quotes capturing the customer’s exact words
  • Complaint Category: Pricing, tech issues, customer service, or animal assistance
  • Frustration Level: High, medium, or low based on tone analysis
  • User Status: Frequent flyer member status and loyalty indicators
  • Negative Comments: Specific issues mentioned
  • Positive Comments: Any praise or positive experiences

Here’s what makes this powerful - the prompt includes example outputs to guide the LLM’s response format, ensuring consistent, structured results across thousands of records.

The Visual Interface in Action

What’s good about DocETL is how you can immediately see the results. The interface shows:

  • Original text data on one side
  • Newly extracted columns with structured information
  • Ability to add comments directly on specific records

For instance, looking at one customer complaint: “Absolutely terrible nightmarish experience… not worth the headache… embarrassing that this is a Canadian company.”

The system extracted this into structured data while maintaining the emotional context and specific details. But here’s where it gets interesting - you can add comments like:

  • “Very good description, but needs more details”
  • “Make this shorter”
  • “Add more specific examples”

Automated Prompt Improvement

When you click “Improve Prompt,” DocETL takes your comments along with the associated data records and automatically suggests improvements to the prompt. The LLM understands both what you’re asking for and what your feedback indicates, then updates the prompt to be clearer and more effective.

This creates a powerful feedback loop where the system gets better with each iteration, learning from your domain expertise.

Reduce Operation: Executive Summary

The second operation takes all the individual chat analyses and creates an executive summary. In our airline example, this reduced 250 individual records into a single comprehensive report that executives could use to understand:

  • Common complaint patterns across all customer interactions
  • Areas needing immediate attention
  • Overall sentiment trends
  • Specific issues affecting customer satisfaction

This type of analysis could run daily, automatically processing new customer service tickets and providing executives with up-to-date insights into customer satisfaction and operational issues.

Key Features and Capabilities

1. Multi-Format Document Support

  • JSON and CSV for structured data
  • PDF processing for unstructured documents
  • Seamless integration across formats

2. Flexible LLM Operations

  • Map: Process each record individually (like sentiment analysis)
  • Reduce: Aggregate and summarize across records
  • Filter: Remove irrelevant data based on criteria
  • Composition: Chain operations together for complex workflows

3. Model Flexibility

  • Support for various models e.g. GPT-5 to gp5 nano, cost-effective options
  • Strategic model selection: use smaller models for simple tasks, larger models for complex analysis
  • Cost optimization through intelligent model routing

4. Built-in Optimization

  • LLM-as-a-judge automatically evaluates outputs
  • Iterative improvement of prompts and operations
  • Self-optimizing pipelines that get better over time

5. Visual Data Exploration

  • Graphical interface for viewing processed data
  • In-situ commenting system for providing feedback
  • Real-time visualization of results and statistics

Why This Matters Now

We’re at an interesting inflection point with LLMs. While some argue we’re approaching the limits of model improvement (i dont think is true), even so there’s tremendous untapped value in how we use these models. DocETL represents exactly this opportunity - not just throwing documents at an LLM, but creating systematic, optimizable workflows that extract maximum value from unstructured data.

The Efficiency Advantage

Even if we had unlimited context windows, DocETL’s approach offers several advantages:

  • Cost Management: Process only what’s needed with appropriate model sizes
  • Quality Control: Human feedback ensures outputs meet domain requirements
  • Scalability: Systematic operations handle datasets of any size
  • Reproducibility: Documented workflows can be rerun and modified

Real-World Applications

This approach opens up possibilities across industries:

  • Customer Support: Daily analysis of support tickets for trend identification
  • Research: Processing academic papers to extract specific findings
  • Legal: Contract analysis and compliance checking
  • Healthcare: Medical record analysis for pattern recognition
  • Business Intelligence: Extracting insights from unstructured business documents

Getting Started

DocETL provides both a web interface and programmatic access. you can access the playgroud here. As of writing this, the creator said she comends datasets below 100MB, anything above that you can use the python version, see here. If your not comfy with coding, get the playground working with a subset of your data and then use the python version to process the rest of your data.

The playground allows you to:

  1. Upload your dataset
  2. View data statistics and structure
  3. Design your processing pipeline
  4. Test operations with immediate feedback
  5. Optimize and iterate based on results

The visual interface makes it accessible to domain experts who understand the data but may not be LLM engineering specialists.

Key Takeaways

DocETL represents a significant step forward in making LLM-powered document processing both accessible and systematic. The key insights from exploring this tool:

The Power of the Complete Loop

Unlike ad-hoc LLM processing, DocETL provides the full workflow from data input to optimized output, with built-in feedback mechanisms that make the entire process smarter over time.

Human-AI Collaboration

The in-situ commenting system creates a genuine partnership between human domain expertise and LLM processing power, resulting in outputs that neither could achieve alone.

Systematic Optimization

Rather than manual prompt engineering, DocETL automates the improvement process while still leveraging human feedback, creating workflows that continuously get better.

Practical Scalability

The framework handles real-world constraints like cost management, model selection, and processing large datasets systematically rather than through one-off solutions.

Democratizing Advanced AI

By providing a visual interface and systematic approach, DocETL makes sophisticated document processing accessible to domain experts across industries.

Limitations and Challenges

While DocETL is a powerful framework, my exploration revealed several important limitations that users should be aware of:

1. High Dependency on User Skill and LLM Knowledge

The effectiveness of DocETL relies heavily on the user’s expertise with prompt engineering and understanding of LLM capabilities.

Prompt Engineering Requirements: Success depends on writing effective prompts that can extract data accurately and consistently, especially when targeting structured formats like JSON. This isn’t trivial - it requires understanding how to phrase requests in ways that guide the LLM toward the desired output format.

Model Selection Complexity: Users need to understand the capabilities of different LLMs. For example, a cheaper model like GPT-4o mini might fail to extract multiple distinct pieces of information from a single complex prompt, requiring you to break the task into several smaller operations. A more advanced model might handle it in one operation but at higher cost.

Data Extraction Experience: The process is much smoother for users who already have experience extracting data from unstructured text into structured formats. Novices might struggle with defining output schemas and writing prompts that reliably enforce them.

2. Linear and Inflexible Workflow

One of the most significant challenges I encountered was the rigid, linear nature of the processing pipeline, which can hinder exploratory data analysis.

The “Point of No Return”: Once you perform a reduce operation - which summarizes extracted columns into a final report - you cannot add more extractions after a reduce operation or create additional summaries.

Hindrance to Experimentation: Data analysis is inherently iterative. You might extract some data, discover an interesting pattern, and want to extract additional fields. DocETL’s current workflow makes it difficult to “fork” the pipeline to explore different paths without starting over, which can slow down discovery and experimentation. However, in our study group we did discuss copying the repo and modifying the ports for both fastapi and the frontend to allow for multiple instances of the app to run.

3. Project and Environment Management Issues

Managing different projects or versions within DocETL can be cumbersome.

No Easy Project Switching: The tool lacks built-in features to save project states (including uploaded data, prompts, notes in situ and outputs) and easily switch between different projects.

Manual Workarounds Required: To work on different tasks or revert to previous states, you must manually export the current pipeline as a YAML file, then re-import it later. This process of saving only saves the files, the operations, the outputs you need to rerun the pipeline but you cant keep the notes in situ and outputs. using the same appraoch as before, you can copy the repo and modifying the ports for both fastapi and the frontend to allow for multiple instances of the doc wrangler to run which will keep the full state.

4. Challenges with Data Export and Formatting

Even after successfully extracting the data, I had some difficulty in using the output in other standard data analysis tools. * Formatting Issues: When exporting the extracted data for use in programs like Excel or with libraries like Pandas, the format was not always clean or directly usable. I was was unsure if this was due to inconsistencies in the LLM-generated output or my own “skill issue” in parsing it, i think this is abit of both. I think a bit more playing and have solved this by next week.

Conclusion

DocETL isn’t just another tool for processing documents with LLMs - it’s a framework for systematic extraction of insights from unstructured data at scale. The combination of flexible operations, visual feedback, and automated optimization creates a powerful platform for turning document chaos into actionable intelligence.

As we continue to generate more unstructured data across every industry, tools like DocETL become essential infrastructure for organizations that want to extract value from their information assets. The future isn’t just about having more powerful models - it’s about having better systems for putting those models to work on real problems.

Whether you’re analyzing customer feedback, processing research literature, or extracting insights from business documents, DocETL provides a systematic approach that scales from small experiments to production workflows. The semantic steering capabilities mean you can guide the process toward exactly the insights you need, while the optimization features ensure your workflows get better over time.

This represents the kind of practical AI application that delivers immediate value while pointing toward a future where sophisticated document processing is as routine as running a spreadsheet calculation.