
Richard Ford at Integrity360 looks at how the widespread use of AI in business processes could lead to inadvertent data leakage
Generative AI (GenAI) is susceptible to a long list of risks as identified by NIST in its Risk Management Framework profile of the technology. These range from confabulation (sometimes referred to as hallucinations) to bias or homogenisation to data privacy, but it’s the latter that is now causing businesses the most concern.
Protecting sensitive and proprietary data is crucial and numerous processes have been put in place to observe GDPR and protect these assets. But GenAI is now undermining those efforts by enabling unintended data leakage and the unauthorised use, disclosure or deanonymisation of that data.
GenAI effectively adds to the insider risk by creating another conduit for both inadvertent and deliberate data loss. A recent study found 20% of UK businesses have suffered data leakage due to using AI chatbots, for example, due to users putting information into the prompts they are using.
Even proprietary solutions, such as Microsoft’s CoPilot, have fallen foul of the issue, with the US government’s House of Representatives banning staff from using CoPilot on government devices citing concerns over it transferring data to the cloud, a move that then spurred Microsoft to focus on developing a more secure government-grade version.
CoPilot and other proprietary AI platforms differ from chatbots in that they rely on Retrieval Augmented Generation or RAG systems to ensure they hone search and retrieval of data to augment the LLM’s knowledge base, access to which then aligns with the user’s access privileges. However, if those privileges have not been correctly configured, the system can see data cross the divide, particularly if the user is malicious and is seeking to trick the system.
Researchers at the University of Texas revealed in August that they’d been able to do precisely that over CoPilot by manipulating documents that the RAG systems retrieved information from. These were also used to influence CoPilot’s decision making.
In the first attack, the LLM was prevented from responding using other documents on the same topic, in the second it no longer cited sources, making it difficult to track the malicious source, and in a third it was prevented from answering a query due to the classification of the document. Finally, the team also showed how a deleted document could persist.
Of course, most of these attacks relied on systems not being locked down. Microsoft has itself said that Copilot requires “permission models in all available services, such as Sharepoint, [are used] to help ensure the right users or groups have the right access”.
Without the correct permissions in place, it’s been speculated that the LLM could be used to provide access to a whole host of sensitive data by issuing prompts asking for user credentials, API and access keys, M&A activity, or just any files that are labelled ‘sensitive’. What’s more, if those documents are not assigned Data Loss Prevention (DLP) labelling, that data cannot be tracked.
Where the problem often lies is enacting the concept of least privilege whereby employees only have access on a need-to-know basis to perform their role. But access is a moving target that then multiplies up for each employee and across multiple identity platforms. The State of Access 2024 report found the average organisation has 1400 access permissions per employee and that for every 1000 users in the organisation there were 700 groups.
Often there’s a tendency for users to be over privileged too, with the Guide to Tackling Admin Sprawl report finding employees often hold higher access and more permissions than they need to do their job.
All of this indicates that far from AI being the problem it’s the fact that insufficient groundwork has been done to prepare the way for the LLM. In addition to access controls, no business should go straight into using the technology without first carrying out an extensive audit of how data has been classified and labelled in terms of its sensitivity.
This is likely to be a time intensive process because it will involve reviewing files in directories, email and chat, and collaborative platforms. Boundaries then need to be defined not just by role but by department or location.
Worryingly, many organisations have not done this due diligence. The Cloud and Threat Report: AI Apps in the Enterprise report found that where GenAI is being used daily in large businesses there were 183 incidents of sensitive data being sent to ChatGPT on a monthly basis, with source code followed by regulatory data being the most common form of data being leaked.
Without data sensitivity labelling and access rights in place, data can and will find its way out of the business, demonstrating that preparation is key to a secure and successful GenAI deployment.
Richard Ford is CTO of Integrity360
Main image courtesy of iStockPhoto.com and iLexx

© 2025, Lyonsdown Limited. teiss® is a registered trademark of Lyonsdown Ltd. VAT registration number: 830519543