Generative AI Language Models & The Challenge of Unstructured Data Quality

The Promise and Optimism Surrounding Generative AI Language Models

There are some amazing innovations happening because of Generative AI Language Models. In additional to some of the more well-known cases such as chatbots, marketing automation, and enhanced online shopping experiences, companies are also improving internal efficiencies on tasks that have traditionally been manually focused and document-based, such as:

financial reporting analysis
financial submission analysis
medical research
interactive holograms
knowledge management agents
code assistance
no-touch decision making
…and more

And that’s merely after year 1!

But while some of the use cases represent low-hanging fruit as well as relatively low risk, there have been significant challenges in tackling the second round of use cases. There are several reasons for this, including:

regulatory challenges
concerns about bias
language model risks
model parameter tuning and output-filtering to get better results
lack of talent
insufficient governance
poor input data quality
…and others

Of these, I’d like to focus this blog on the last one – poor input data quality, which is often the most time-consuming to overcome. And, specifically, I’d like to talk about unstructured data quality, which is a relatively new, urgent, and significant challenge, thanks to Generative AI.

A Big Challenge of Language Models: Unstructured Data Quality

Everybody says that data quality is a major challenge to implementing their language models, but what exactly does that mean? Well, let’s take a few interesting but real cases and see if you can relate.

Is the information I want to use to train or tune the model trustworthy enough for the intended purpose?
My organization’s documents are spread across any number of locations, including duplicate versions at different stages of completion. How do I know how to find the latest versions to ensure that they and only they are loaded into the LM?
Many LMs do not have a feature to “back out” inaccurate or stale documents. What do I do about that?
We’ve been able to get by with loosely managed policies around Sharepoint or other document repositories because searching and finding information in those repositories has been difficult. Many times, people didn’t even know they had access to those materials. However, that changes when the documents are loaded into a search tool as easily as an LM. What should I do to minimize data leakage within the organization?
…amongst so many other variations.

What Can be Done

So, what can an organization do to overcome these challenges? Well, you’re not going to like this answer, but given that this is the first time we’ve had to tackle these issues on an enterprise-wide scale, it is what it is. There is not a magic-bullet automated solution, and the best answer right now is to implement a human-driven operating model to curate input data.

Life would be so much easier if there were a pre-existing solution, that documents had been well-managed and locked down with zero trust, and that they were tagged with a level of specificity that AI solutions could easily pick up and understand them. And it sure would be nice if data-drift solutions that can be used to manage and monitor predictive and descriptive AI input data would also magically apply here, but they don’t (and there isn’t a clear data-drift solution in this space yet). There are some technology solutions that can help with some of the unstructured-data-quality problem – automated discovery and classification solutions like BigID come to mind. But they’re still only partial solutions.

The most reliable answer for curating unstructured data as input to your model is to have a team of humans doing it. How do we know?

Well, let’s look at the leader in this space: OpenAI. According to reports, OpenAI has an army of 1,000 contractors curating data for ChatGPT.

Within large corporations, we see a similar operating model emerging. Morgan Stanley has a group of 20 individuals in the Philippines who are curating data for one of their language models.

The lesson to take away here is that preparing to release a language model is not just setting up a language model and putting a bunch of data into it, then letting it run. It requires an ecosystem and operating model that involves new people roles (and associated funding) that may not already exist in an organization.

I’m not addressing all the issues related to curating and keeping as clean of a language model as possible. The above point is the most important place to start.

Additional Mitigation Actions

As some of these unstructured data quality challenges are being worked through over the next few years, there are some things that can be done to minimize the amount of manual curation that needs to be done. These will lead to more purpose-specific rather than general-purpose models.

Train or tune the LM to perform a specific role, using specific content an input
Add a LM validation process to validate that the response fits within known things that are true, accurate, or appropriate
Create a policy on how a LM is to be used, and be clear on what it should not be used for
Restrict prompts up-front

I would love to hear any thoughts or ideas that you have!

Wavestone Can Help

All the above is still quite surface level, but we have a deeper set of knowledge and defined set of approaches to help you be successful in your Gen-AI-influenced transformation journey. And a Change Management approach to reduce the stress that your people may feel as a part of these significant changes. Please reach out and we’ll partner with you in whatever way your organization needs it!