Learning Data and GenAI: Securing the Source of Generative Intelligence

Corporate adoption of generative AI technologies has accelerated rapidly, with 35% of companies incorporating GenAI into their operations in 2022. But security and ethical regulations have not kept pace, and careless use of GenAI technologies can proliferate harmful information and unethical decisions made on the basis of AI inaccuracies.

As aggregation engines, GenAI models are dependent on the learning data they are trained on. Biases are inherited and amplified, and if left unaddressed pose a risk to the model’s integrity. Knowing how to capture, triage, and process learning data securely and responsibly is a critical prerequisite to efficacy and ethical use.

In this blog, we examine 2 major risk areas inherent to the GenAI learning process, why they occur, and measures businesses can take to identify and prevent them.

Ghosts in the Machine: AI Hallucinations

What is an AI hallucination?

AI hallucinations occur when Large Language Models (LLMs) generate false information. False information encapsulates both deviations from external facts and “internal” errors – problems encountered within the AI’s contextual logic.

AI hallucinations illustrate a fundamental GenAI limitation: they can only produce content based on learning data, and cannot evaluate outputs against reality.

There are 4 broad types of AI hallucination:

Sentence contradictions generate sentences that contradict other sentences
Prompt contradictions produce content contrary to prompt specifications
Factual contradictions present fictitious information as factual
Random contradictions introduce information with no connection to inputs or outputs

Why do AI hallucinations occur?

While the precise causes differ from model to model, there are general factors that affect the likelihood of hallucinations. These include:

Data provenance and quality. Inaccurate information in the AI’s learning data will manifest in its output. Incorporating data sets from less reputable sources can increase the likelihood of assimilating incorrect information.
Generation and learning processes. Training procedures can introduce errors over time. Biases towards specific words and phrases can create faulty patterns as minor inaccuracies accumulate over successive generations.
Input quality. Accurate data sets and thorough training procedures will always struggle with inconsistent or contradictory prompts.

Minimizing hallucinations

Hallucinations can be difficult to spot because LLMs are trained to sound fluent and plausible. Deploy the following preventative and reactive countermeasures to minimize hallucination instances:

Fact checking. Data Science teams maintaining the AI application and its learning data should conduct frequent checks to remove blatantly erroneous results.
Clear and specific prompts. Providing context can help the AI eliminate nonsensical interpretations and guide the application towards intended output. Practices include:
- Limiting possible output formats and types
- Providing relevant, factual data sources as references
- Framing the query within a role (e.g. “you are a programmer tasked with coding”), to clarify tone and positioning
Filtering and ranking methodologies. Experimentation with the model’s built-in parameters can reveal setting configurations that produce desired content.
Multi-shot prompting. Providing complete examples of the target format, tone, and positioning can help it recognize patterns and refine generated content.

At the Point of Capture: Data Collection, Privacy, and Compliance

The act of selecting, capturing, moving, and storing data for learning purposes is fraught with legal and ethical risks. Major dangers and best practices to manage them include:

Copyright and legal exposure. The vast volumes of data involved in GenAI learning risk producing outputs based on stolen intellectual property. Such theft can provoke legal action, leading to costly reputational and financial damage. Get ahead on compliance by formulating internal ethical practices regulating the target type, source, capture, transit, and storage of data bound for GenAI applications. Internal guidelines will serve as a basis for future adaptations to legal precedents as industry regulations catch up.
Data privacy and consent. Many GenAI learning data sets inadvertently incorporate Personally Identifiable Information (PII) without individual consent. Text prompts can elicit said data, posing a serious risk to data privacy. As many LLMs are proprietary, it is also difficult to locate personal information. Institute frequent and regular checks to ensure deployed LLMs are not embedding PII in their data sets. Alternatively, favor open source LLMs with transparent data processes over proprietary ones. Communications channels can also help individuals request PII deletion.
Changes to workforce roles. Workers are being increasingly displaced as GenAI assumes low-level run tasks, such as writing, coding, and analysis. Businesses must develop pathways for workers to stay relevant and contribute value. Prepare employees for new roles created by generative AI applications, like prompt engineering. Review organizational and operational structures to map affected roles and resultant skill gaps, with a comprehensive view of changes needed. Retraining also serves to retool the workforce for growth.

Ultimately, both AI hallucinations and data collection challenges are inherent to the GenAI reliance on learning data.

Securing learning data is a bespoke challenge, with risks and requirements changing with enterprise needs. Expert advisory is recommended to manage learning data safely and optimize AI deployments.

Have a question? Just ask.

Talk to a Wavestone expert for guidance on the challenges of generative intelligence and how to leverage GenAI applications for enhanced business performance.