Large language models in insurance: Hype or real productivity boost?

Published May 17, 2024

  • Data & AI
  • Insurance

A third of companies worldwide are already using generative AI, from pilot projects to full-scale implementation. However, what makes generative AI a potential game-changer compared to earlier machine learning approaches?

We provide an insight into which large language models (LLMs) are available, what needs to be considered for commercial use, and what real added value language models can offer in the insurance industry.

What are large language models?

LLMs are language models based on sequence predictions, i.e., they utilize statistical distributions. LLMs are trained with huge amounts of text from the internet, for example Wikipedia, as well as data from private sources, like news articles.

When a query, known as a prompt, is sent to a model, it determines which word most closely follows the first. This is based on the most frequent word sequences in the training data.

Technically, LLMs rely on transformer architecture – a neural network framework that learns context and meaning by analyzing relationships in sequential data like words in sentences.

Transformers are based on encoders and decoders. Encoders understand speech and are often used for classification and sentiment analysis, such as Google’s BERT. Decoders generate content and language, such as the GPT models from OpenAI.

In contrast, previous machine learning algorithms rely on fixed features and clearly defined inputs. They are therefore specialized for a specific task. LLMs, on the other hand, can independently learn complex patterns from large volumes of unstructured data.

In contrast to traditional algorithms, LLMs require less manual pre-processing and can be adapted to different tasks in many ways through transfer learning.

While offering significant advantages in natural language processing, they demand substantial computing resources and powerful hardware

Which large language models are available and how do they differ?

There are various providers of LLMs – the best known at present are OpenAI with its GPT models and Google with Gemini. They differ in three dimensions:

  1. Data basis
  2. Model weights
  3. Licensing

LLMs can be divided into two groups: Closed or private models and open source models. When using open source models, it is important to check whether the licensing of the selected open source model allows commercial use at all. For example, Meta’s LLaMA does not allow this, whereas LLaMA2 does.

Well-known open source models include LlaMa2 from Meta, Falcon, from the Technology Innovation Institute of the United Arab Emirates, and StableLM, from Stable Diffusion. Closed models are fee-based and can also be highly specialized for certain tasks, such as BloombergGPT from Bloomberg.

Well-known closed models are the GPT models from OpenAI, Gemini from Google, Claude from Anthropic, and Dolly from DataBricks.

But how do these models differ? In terms of the data basis, the models primarily differ in how up to date the data is: For example, OpenAI trained the GPT-1 to GPT-3 models using data only up to 2021, while Google’s Gemini uses real-time data. If you look at an LLM technically, it consists of several billion weights that assign a “strength” or “importance” to a word or sentence.

These weights, determined by the data points from the training sets, essentially act as the model’s fingerprint. The initial weights are first defined by training on existing data. Subsequent fine-tuning of the models involves ‘reinforcement learning with human feedback’ to distinguish between good and bad responses.

Current models have more than 100 billion weights, although the exact number of newer models such as GPT-4 or Gemini Ultra is not known – GPT-3 has 175 billion weights and the older PalM2 model from Google has 530 billion weights.

The increasing number of weights leads to a limitation of the relevant providers, as they require enormous computing capacities for training.

Incidentally, the computing power required also poses a problem for our environment. For instance, the carbon dioxide emissions from training GPT-3, over 550 tons, are roughly equivalent to what one person would emit on nearly 550 flights from New York to San Francisco.

However, the emissions also vary between models and depend on the currency of the hardware, the country, and fundamentally, the size of the models.

Current research is tackling precisely this problem by developing smaller and much more specialized models, known as mixtures of experts (MoE).

Large Language models: How to get started?

LLMs are easy to try out: Write a prompt in the input field and enjoy the fantastic answer. The responses of LLMs can also be fascinating when a model starts to hallucinate.

Hallucinating in this context means that models generate content that seems very plausible but is nevertheless incorrect. This can lead to problems, especially in a business context.

How does this happen?

This can occur due to inputs that are too short and imprecise or simply because of insufficient information. However, this should only occur to a limited extent. After all, an LLM is not an encyclopedia designed to provide factual explanations, but a tool for creating new, plausible content.

Hallucination can be mitigated by precise prompt engineering, among other strategies. This involves providing specific instructions and contextual information.

Prompt engineering can be categorized by complexity into three approaches: zero-shot, one-shot, and few-shot.


Here, an LLM is confronted with a prompt for which it has not been specifically trained. The LLM should be able to understand the prompt and respond to it, even though it has not seen a direct example of this prompt during training.

One-shot and few-shot:

An LLM receives a single example (one-shot) or a small number (few-shot) of examples of a particular task via prompt and should learn to understand and cope with these prompts.

These approaches allow LLMs to be used in a variety of ways without providing an extensive set of examples for specific prompts.

Which model is best for a particular use case depends on various parameters. For example, whether access to real-time data is required or whether the model needs to be trained again on its own data.

Various benchmarks such as the MMLU benchmark, which contains 57 questions from different areas such as mathematics, US history and law, can provide an initial indication of which models may be suitable for the use case.

However, specially developed test cases tailored to the use case are the best way to find the most suitable model.

What challenges are there in the practical use of large language models?

If companies want to introduce LLMs, these four dimensions should be taken into account:


The technical integration of LLMs into the system landscape via an interface (API) is simple: it only takes 4 to 5 lines of code to address the API. The actual work lies in prompt engineering and training.

Where in the system landscape an LLM is connected depends on the use case – as a service for generating texts for the chat or voice bot, the LLM can be integrated into the middleware between the front and back end, or as an entity extractor, further back between the process engine and the back end.


The cost of using private LLMs per interaction ranges from a few tenths of a cent to a few cents for 500 words as input and/or output.

Depending on the size of the prompts, the size of the resulting outputs and the frequency of the interactions, the costs can be less than €1,000 per year or more than €100,000 per year.

The costs vary greatly from provider to provider and can quickly increase by a factor of 10 or decrease by a factor of 100, even with new models.

If you want to have your own LLM trained via the Hugging Face service, you can get the smallest available model for just over €40,000 and the largest model for just under €17.3 million.


The data protection requirements for insurance companies are particularly important in the current use of LLMs, as many of the existing models process input in the USA.

This would put companies in breach of the EU GDPR when using LLMs, due to uncertainties about data handling. This problem can be solved, for example, by choosing a provider that hosts its LLM in Europe or offers the option of hosting the model on its own server.

In addition, no personal data may be used when training the model. Otherwise, this would violate the “right to be forgotten” or you would have to train a new model with every request to delete personal data.

However, in addition to the GDPR, the EU AI Act, which has not yet been passed, is a sword of Damocles for the future use of LLMs. Researchers at Stanford University have analyzed current LLMs and available information about them and provided an assessment for each category of the EU AI Act.

The result showed that all LLMs are not fully compliant with the current draft. In particular, compliance with the requirements relating to copyright, energy, risk and compliance with industry benchmarks is particularly poor for all LLMs.

If the EU AI comes into force in one form or another, it will have a significant positive impact on all LLMs.


The best LLM will not bring significant added value as long as the right skills are not available in the organization. On the one hand, new roles are required for the use of LLMs, an example of this is the prompt engineer.

They deal with the design of targeted prompts for training and the evaluation of LLMs for the use case.

On the other hand, companies should start change management as early as possible so as not to panic employees and inform them about the upcoming changes to their activities.

In addition, conceivable training courses should be offered to employees as early as possible.

Can large language models do insurance?

LLMs are fed with information from the internet and private data sources – which means that their knowledge about insurance only constitutes a very small proportion.

Open source models can be specialized for their use in insurance – i.e., trained on industry-specific keywords. The added value of LLMs depends on their respective use.

We therefore present an insurance-specific use case from the area of sales potential below.

Use case: Become a champion in customer dialogue, whether in sales, claims notification, or questions about contracts

Interactions with customers involve a variety of tasks:

  • Are customers getting the right information they need to conclude a contract?
  • Can customers report a claim quickly and do they receive the appropriate attention?

LLMs offer enormous potential in voice and text interactions with customers. The three key benefits of integrating LLMs are:

1. The interaction with customers becomes more individualized and natural

Requests for information that resemble filling out forms and chatbots that only understand a few phrases only lead to frustration among customers.

Truly functioning chatbots and voicebots that understand contextual information and a variety of phrases thanks to the integration of LLMs represent enormous added value for customers and increase satisfaction through a natural and personalized conversation.

2. Focus on the interactions that really matter

Increasing volumes in the customer contact center and the dwindling number of skilled employees lead to increasing time pressure and leave employees little time for individual and critical concerns.

The use of LLMs in voicebots enables simple and non-critical interactions to be resolved quickly and efficiently and gives employees enough time to deal with critical issues.

3. Fast integration options and easy scalability

Previous solutions in chat and voicebot functionalities required extensive training or customization to solve use cases, often with only partially satisfactory quality.

LLMs are quickly and easily integrated into existing bots and contact points and can be easily scaled as demand increases.

A key prerequisite for the integration of LLMs into insurance processes is an omnichannel platform that guarantees the same level of information across all sales and service channels.

Large language models offer enormous potential for language and text interaction with customers

An exemplary user journey for taking out a policy

Viola is an existing customer of an insurance company and had a quote for a new home insurance policy calculated online a week ago.

After deciding on the offer, Viola now wants to take out the policy in person via a contact center and picks up the phone.

1. Getting started

Viola calls the insurer’s service center using the phone number provided in the offer and in the email and is connected to service employee Thomas within a few minutes.

2. Transition to the service center

During a brief wait, an AI analyzes Viola’s voice in the background, recognizing her. The time until employee Thomas is free is used to ask Viola about her request.

Using a speech-to-text framework (such as AWS Polly), and by evaluating the text through an LLM, along with data from the omnichannel platform, Thomas instantly has all necessary information from Viola’s last points of contact displayed on his screen.

3. Conversation and conclusion

Thomas can address Viola directly about the created offer in the conversation starter and respond individually to Viola’s questions. Within a few clicks and with Viola’s confirmation, the deal is closed.

The consultation documentation is created in the background and an artificial recommendation score, similar to the Net Promoter Score (NPS), is generated based on the recorded conversation.

In both scenarios, speech-to-text frameworks and the integration of LLMs for text evaluation achieve high quality through natural language understanding.

In the case of the recommendation score, algorithms for calculating the score can be based much better on the processed LLM text passages and thus increase the scoring accuracy.


The use case from sales shows: The type of communication and how companies get in touch with their customers will change. In the future, there will continue to be few points of contact with customers.

However, this is precisely why these few points of contact are so valuable and must be designed and used in the best possible way in the future. LLMs in conjunction with an omnichannel platform can create a very positive experience within the contact points between customers and insurers.

Moreover, the use of LLMs is not limited to sales; similarly, an equivalent use case could be implemented in claims processing. LLMs even offer the opportunity to completely automate the processing of claims.

Thanks to their ability to link information and understand complex descriptions, LLMs could, for example, validate claims coverage and free up time for insurance employees to deal with claims that require personal interaction.

Hype or real productivity boost?

Large language models are powerful AI models based on neural networks that independently learn complex patterns from large volumes of unstructured text data.

LLMs are characterized by their ability to transfer learning and are used in particular in natural language processing, machine text generation, and question-answer systems.

Whether LLMs actually provide a real productivity boost for insurance companies depends on various factors. These include selecting the right LLM, integrating it into existing systems, and training employees.

However, the benefits of LLMs in insurance are clear.

They can help to increase customer satisfaction, increase efficiency, and reduce costs. If implementation meets user expectations, it’s only a matter of time before LLMs become widespread in the insurance industry.

However, amidst the hype surrounding new technologies, one clear priority emerges: rendering the data foundation usable. This includes defining and recording the required data, improving data quality, and making it accessible for use.

All companies, regardless of industry, are still facing challenges here.


  • Dr. Annika Bergbauer

    Senior Manager – Germany, Munich


  • Nico Gerhard

    Manager – Germany, Frankfurt am Main


  • Noah Hennes

    Senior Consultant – Germany, Cologne


  • Matthias Pierzyna

    Senior Consultant – Germany, Frankfurt am Main


  • Uta Niendorf

    Partner – Germany, Hamburg