The GenAI automation potential of data extraction (on Xebia.com ⧉)

Introduction

GenAI has been around for a little while and its capabilities are expanding quickly. We are experiencing peak-hype levels for AI and expectations are sky-high. But still, many companies fail to generate actual value with GenAI. Why is that? Why are we promised so much yet manage to get so little? What is still fantasy and what concrete potential exists? What should be automated and what should not? In this blogpost, we explore the GenAI automation potential that exists today for data extraction.

Together, we will learn about:

Why GenAI data extraction
The automation levels
The automation potential

Let’s start!

1. Why GenAI data extraction

Why exactly should we care about GenAI data extraction? Let us motivate this by looking at 4 example usecases in different domains and with various data types like text-, images-, documents- or audio.

Insurance claims (text)
Menu cards (images)
Annual reports (PDF documents)
Customer service calls (audio)

Insurance claims (text)

Imagine you are running an insurance company and your employees are tasked with processing insurance claims. Claims are often provided in a free-form format: email, phone call, chat, in short: unstructured data. Your employees will then need to process this information to handle the claim: update the internal databases, find similar claim cases, etc. Let’s take the case where we receive this information in free-form text like emails.

LLMs can extract structured data from free-form text like an insurance claim, saving employees time doing it manually. See Code.

We instruct the LLM to give its answers in structured format ^[ref], so we can easily work with- and transform its output. To clear this entire email box with some 25 emails Gemini 2.0 Flash took some 15 seconds. That is if ran sequentially: if ran in parallel this amount of data process can be processed in a matter of seconds. If this were scaled to process a mailbox of 100k emails we would spend about $1.52. Not an awful lot given we can save some serious time here. We can also half this cost if we use Gemini’s Batch API instead. How long would it take a human to process 100k emails? What is their time worth?

Extracting client information from this email message took Gemini 2.0 Flash about 0.5 seconds and cost $0.000015.

Different usecase. Imagine now you are running an online platform for food delivery. You want to digitise menu cards so the offering can be ingested in your own platform in standardised manner. You are staffing a number of employees to help you do this task. The menu cards arrive in image format. Can we speed up the menu conversion process?

LLMs can read menu cards and convert them directly into a desired structured format, cutting development cost for custom- OCR and extraction solutions and potentially providing superior extraction performance. See Code.

Yes: we can speed this up. LLMs consider the image as a whole and are also able to accurately process text details in the image. Because we are using a generic model images in many contexts can be processed by the same model without having to fine-tune over specific situations.

Converting this menu card took Gemini 2.0 Flash about 10 seconds and cost $0.0005. How long would a human take?

Annual reports (PDF documents)

Imagine you are tasked with answering questions based on financial documents. You are to go through each question and reason on the answer based on information provided in the documents. Take annual statements, for example. Companies worldwide are obliged to publish these statements and company finances are to be audited. Going through such documents is time-consuming. Take an annual statement in PDF document format. Can we help employees formulate answers based on this document faster?

An entire annual report document of 382 pages fits in the model context at once, allowing the model to answer questions considering the whole document. Reasoning-type of questions like audit questionnaires can now be answered by LLMs, citing where the answer is located. The human gets a head start and then only has to check and extend the answers, saving time. See Code.

Yes. The document fits in the model context all at once, allowing the model to take a comprehensive view of the document and formulate sensible answers. In fact, LLMs like Google’s Gemini models take up to 2 million tokens in context at once ^[ref]. This allows for up to 3,000 PDF pages to be inserted into the prompt. PDF pages are processed as images by Gemini ^[ref].

Answering these questions took Gemini 2.0 Flash just under 1 minute and cost $0.015 . How long would a human take?

Customer service calls (audio)

Imagine you have a customer service department and are processing many calls on the daily. The customer service agents do their best to document and record useful information during the call but cannot possibly document all valuable information. Calls are temporarily recorded for analytical purposes. Calls could be listened back to figure out the customer problem, the root cause, the solution and the customer sentiment. This information is valuable to have because this is necessary feedback information to improve customer service in a targeted way. What issues were we most often unable to resolve? In what situations are our customers left unhappy?

The insights are valuable, but hard to get to. The data is hidden in an audio format.

LLMs can listen to audio quicker than any human. Customer service calls can be analysed and previously-inaccessible but valuable information can be extracted and taken advantage of. Insights can be gathered in automated fashion, effectively improving business processes in targeted fashion. See Code.

Gemini 2.0 Flash processed this 5-minute customer service call in 3 seconds and spent $0.0083. That is faster than any human takes to listen- and process the call. LLMs can also accurately process audio, without the need for a transcription step first. Because the model is exposed to the raw audio file, more than just the words that are spoken can be taken into account. In what way is this customer talking? Is this customer happy or unhappy? Off-the-shelf LLMs can be used without modification and just by prompting to extract structured data from audio ^[ref].

Concluding on the 4 usecase examples, we learned the following.

LLMs for data extraction: LLMs can be used to extract structured data from free-form formats like text, images, PDFs, and audio.
Larger contexts: Growing LLM context windows allow processing of larger documents at once. This strengthens the applicability for data extraction using LLMs without the need for extra retrieval steps.
Competitive pricing: Lower inference costs make more data extraction use cases feasible.
GenAI data extraction ROI: LLMs can complete data extraction tasks faster than humans and at a lower cost. Albeit not at perfection, the goal is to create a system that is good enough to assist humans in their work and provide business value.

That is really cool. LLMs are becoming more capable and cheaper, making possible more usecases than before. So now what exactly is this data extraction? How can we use this to our advantage to automate business processes? What are the automation levels for data extraction?

2. The automation levels

To discover what is possible with GenAI data extraction, we will divide into 4 levels of increasing automation.

Level 0: Manual
Level 1: Assisted
Level 2: Autopilot
Level 3: Autonomous

Level 0: Manual work

No automation is applied. Human labor is required to extract data in the desired format. The extracted data is then consumed by a human user. All of the data extraction-, evaluation of the data quality and any actions to be done with the data are manual human processes.

Without automation, manual work is required to extract data in structured format. After extraction humans decide on what to do with the data.

We can do better. LLMs can be used to automate part of this process, helping the human in the Assistedautomation level.

Level 1: Assisted

In the assisted workflow, a LLM is used to extract useful data. This process we like to call Structured Data Extraction.

LLMs can be used for structured data extraction: extracting useful data from free-form documents in structured format.

That can already save a lot of time. Previously we needed to manually perform the tedious process of processing the document to formulate answers in the target format. In this level of automation, though, we are still targeting a human user as output. The human user is responsible for all of evaluating the LLM output and deciding what to do with the data. This gives the human control but also costs extra time. We can do better in Autopilot.

Level 2: Autopilot

In the autopilot workflow we go a step further. The extracted data is directly ingested in a Data Warehouse, opening up new automation possibilities. With the data in the data warehouse the data can be more efficiently taken advantage of. We can now use the data for dashboards and insights as well as use the data for Machine Learning usecases.

When extracted data is ingested into a Data Warehouse the data can be more efficiently taken advantage of. Dashboards can reveal new insights and ML usecases can benefit from richer available data.

Ingesting this data directly into a Data Warehouse can be beneficial indeed. But with the introduction of this automated ingestion we also need to be careful. Data Warehouse consumers are farther away from the extraction process and are not aware of the data quality. We need to always ask ourselves: Is this data reliable?

Data Warehouse consumers need to always be aware of data- quality and reliability.

Bad data quality can lead to misleading insights and bad decisions later down. This is not what we want! Quantified metrics informing us on the data reliability are required. Those with expertise in the dataset and those involved in the extraction process will need to help out. These are typically an AI Engineer and a Subject Matter Expert. We need to introduce an evaluation step.

A Subject Matter Expert is required to label samples of data, informing Data Warehouse consumers and engineers on the data reliability. The Subject Matter Expert is to be enabled to also experiment with the data extraction process, lowering iteration times.

More steps are introduced, keeping the human-in-the-loop. The steps are necessary, though. The Subject Matter Expert plays a key role in ensuring the quality- and reliability of the data. This gives Data Warehouse consumers the trust they are looking for and at the same time enables the AI Engineer to more systematically improve the Structured Data Extraction system. Additionally, enabling the Subject Matter Expert themselves to be part of the prompting process lowers iteration times and reduces context being lost in translation between AI Engineer and Subject Matter Expert. Win-win.

We can go more automated, even. Let’s continue to the last level: Autonomous.

Level 3: Autonomous

In this level, the last human interactions are ought to be automated. Evaluation previously manually done by the Subject Matter Expert is now done by Quality Assurance- and evaluation tooling.

Introducing Quality Assurance- and evaluation tooling allows the Structured Data Extraction system to run fully autonomous.

So what do we mean with such tooling? We want tooling to help us guarantee the outcome and quality of our data without minimal human intervention. This can be LLM-as-a-judge systems, for example. The required effort, however, to successfully implement such systems and run them safely is high. One gets value in return, from extra automation, but whether this is worth the effort- and cost is the question. Let’s compare the automation levels and summarise its potential.

3. The automation potential

We have learned about each of the four different levels of automation for data extraction using GenAI. Also, for each level, we have explored how architecturally a Structured Data Extraction system would look like. That is great, but now where is most potential? Let’s first summarise the automation levels as follows:

The Automation Levels for GenAI data extraction.
Level	Automation	Description	Human-in-the-loop
0	Manual	Human performs all work. Human responsible for data extraction, quality assurance and consumption of data.	Yes
1	Assisted	Human is assisted with tasks but still in control and responsible. Data is extracted using a LLM. Extracted data is used to assist human users.	Yes
2	Autopilot	Tasks largely automated – human supervises. Extracted data is upserted directly into a Data Warehouse. Systematic evaluation is necessary and Subject Matter Expert involvement is key.	Yes
3	Autonomous	Tasks fully automated. Evaluation step is to be automated using Quality Assurance- and evaluation tooling. Fully automated pipelines: no human intervention or supervision.	No

So say you are thinking about implementing a data extraction usecase. What is the ultimate goal? The ultimate goal need not always be to automate as much as possible: it should be to create value. Because perhaps, you can benefit largely enough from your usecase if Assisted- or Autopilot automation is applied and the extra investment to full automation is not worth it. If we were to take the menu card example from earlier, how would potential time savings look like?

Time savings for automating data extraction (example):
30 minutes (manual) → 5 minutes (assisted) → 1 minute (autopilot) → 0 minutes (autonomous).

We can see that the time savings are not linear. The more automation we apply, the less time we save. Even though, the last step is the hardest and most expensive to implement. It is not always worth the effort and cost to implement this last step. These are the Diminishing Returns of automation, which can be plotted as follows:

Automation is nice but value is the goal. Take automation step-by-step. Partial automation is also valuable.

Let’s sum things up. To conclude, GenAI for data extraction is:

Useful for a wide variety of usecases with various data types including text, images, PDFs and audio.
More likely to provide ROI due to 1) cheaper models, 2) larger context windows and 3) more capable models able to process multi-modal data.
Very well suited for partial automation: which can bring business value without needing extra investment for full automation.

Good luck with your own GenAI data extraction usecases 🍀♡.