Importance of Data Management in Enterprise Knowledge Solution

Enterprise Knowledge is a key use case for GenAI and it seems a new Enterprise Knowledge product is coming out every week. Salesforce, ServiceNow, Google and Microsoft are all developing solutions to Enterprise Knowledge. Recently we saw the launch of Atlassian’s solution, Rovo. At Devoteam we have been focusing on Amazon Q Business to support the Enterprise Knowledge use case.

We discussed Enterprise Knowledge in part 1 of our series on this topic. Briefly, Enterprise Knowledge is all the data that your organisation owns and collects that relates to your business. In part 2 we looked at strategies and best practices for getting an Enterprise Knowledge AI assistant up and running.

Enterprise Knowledge is a great solution to solving the use case of searching and querying data relating to your business, but reveals another significant problem – Data Management.

As is becoming very clear, AI assistants and chatbots are only as good as the data that the AI has ingested and is using. Bad data leads to poor query results from AI assistants.

Double Cost and Carbon penalties

In terms of costs and carbon emissions, we pay the penalties of bad data twice. First is the cost and carbon emissions incurred through storing unnecessary data. Second that data is then ingested by AI models, increasing the storage and compute requirements of the AI model, and further increasing costs and carbon emissions.

Working both within Devoteam and with our customers on this use case reveals large amounts of no or low value data across our knowledge management tools, shared drives and websites. For example:

Duplicate data: Duplicate entries do not add value; they increase storage costs and carbon emissions, and slow down data processing.
Out-of-date data: Outdated information should be archived or deleted at source. Additionally, only the most recent version of a document should be ingested, ensuring that data remains current. We often see multiple versions of documents – e.g. Doc_v1, Doc_v2, Doc_v3 – stored in a document folder. (This is generally an anti-pattern as modern document tools support versioning without needing multiple copies.) Do we want to ingest all these versions, or only the most recent document? Older versions may contradict newer versions and can confuse AI assistants.
Unusable data: Many data sources contain binary or other data formats that are irrelevant to AI assistants. It’s best to filter such data out at the source to prevent unnecessary ingestion.
Images, pictures or videos: Currently Amazon Q Business cannot utilize visual data like images or pictures, making their inclusion in data sources redundant.
Archive or draft data: Enterprise data sources are littered with archive folders and draft documents. Ideally these should be managed at source with good data management policies, to avoid ingesting unfinished or unapproved documents.

Eliminating or filtering out this data can significantly reduce costs and carbon emissions, can improve data ingestion speeds and most importantly lead to higher value responses from AI assistants.

Example of bad data affecting an AI Query

Below is an example of a query from our AI Assistant that highlights the problem. When asking a question, the data used in the reply come from three sources:

the original “v1” document. We assume that this is the latest and that there is no “v2”
a PDF version of the “v1” document. Is this the same data as the original “v1” document used to create the PDF?
a draft “v0.3” document. Given it is a draft, we probably didn’t want this version to be used in the query.

This example, and the problems it highlights, will be familiar to those working on Enterprise Knowledge solutions.

Data Management

Whichever tool or vendor you are considering when looking at providing Enterprise Knowledge, ensuring robust data management practices is fundamental in maximizing its utility. GenAI assistants, whether for Enterprise Knowledge or another use case, are only as effective as the quality and relevance of the data they access.

There is a lot of data in large enterprises that offer little or no value to an Enterprise Knowledge AI assistant. This “low/no value” data not only escalates the costs and carbon emissions associated with data storage for AI systems, but also lengthens the time it takes to ingest data and can degrade the AI’s responses to queries.

Eliminating or filtering out this data can significantly reduce costs and carbon emissions, can improve data ingestion speeds and most importantly lead to higher value responses from AI assistants.

Implementing robust data management practices is crucial for enhancing the efficiency of AI assistants. Effective data management can address up to 90% of Dark Data — data that is stored and seldom used. Adopting these practices will not only improve your experience with AI assistants, but will also lead to significant organizational benefits. For more detailed guidance on implementing these practices, refer to this data management talk.

This is Part 3 of our AI for Enterprise Knowledge series. Start with the introduction in Part 1, see best practices for creating Enterprise Knowledge AI assistant in Part 2, explore cost management in Part 4, discover the Sales Knowledge use case in Part 5, and then finish with AI solution comparison in Part 6.