Sifting through data. After gaining initial access, it’s where attackers spend the bulk of their time. They trawl through files and network data to discover credentials, looking for opportunities for lateral movement and privilege escalation. Simultaneously, they hunt for high-value data like financial information, PCI/PII, or intellectual property to maximize impact and raise the stakes of the compromise.
Defending against this is difficult. EDR solutions rarely flag on an attacker simply reading files, and traditional data security tools fall short. These tools, often based on regular expressions and keyword matching, excel with highly structured data and finding easily identifiable data types like AWS access keys or credit card numbers. But they don’t fare well when faced with the massive volume of unstructured data found in file shares, cloud storage, and collaboration tools.
This is exactly where Large Language Models (LLMs) shine. LLMs move beyond rigid searches to provide true semantic understanding, allowing them to analyze data in context. This unlocks the ability to find hard-to-find credentials and assess the business risk of compromised data in a human-like way that was previously impossible.
In this post, we’ll explore how NodeZero’s new Advanced Data Pilfering (ADP) feature combines LLMs with offensive security techniques to supercharge defenders’ understanding of data at risk.
Introduction
Advanced Data Pilfering (ADP) covers two common attacker behaviors:
- Advanced Credential Pilfering: Using LLMs to find credentials hidden in files and Active Directory metadata. NodeZero doesn’t just find them – it then validates and uses those credentials to discover new paths for lateral movement and privilege escalation.
- Data Risk Inference: Using LLMs to automatically classify compromised data (like intellectual property, financial records, or PII) and assess its business risk, showing you precisely what an attacker might go after.
Underpinning both capabilities is a smart data pipeline. It’s not feasible from a cost, performance, or efficiency perspective to send all data in a network to an LLM. NodeZero solves this by using a multi-stage approach to progressively filter and assess data. It combines traditional machine learning (to score files by metadata), embedding models (to understand semantic content), and well-tuned regular expressions (for highly structured data) to ensure that only the most relevant or ambiguous data is sent to the LLM for analysis.
A Word on Data Security
The LLMs used by NodeZero are hosted by AWS Bedrock. By design, NodeZero isolates its usage of Bedrock across clients and individual pentests. AWS guarantees that client data is not shared with model providers, is not used to improve base models, and is not accessible to other customers.
NodeZero’s file embedding model runs locally within the client environment. This model is part of a data pipeline that pre-filters data, and only the data snippets from files identified as relevant are sent to the LLM for analysis.
NodeZero users have full control over ADP and can configure whether data is processed by an LLM and what type of data is included. These controls are broken down by feature:
- Advanced Credential Pilfering: This feature can be disabled. When disabled, no file contents are sent to an LLM for credential analysis.
- Data Risk Inference: This feature can be configured in one of three modes:
- Disabled: No data of any kind is sent to an LLM for risk inference.
- Metadata Only: Only metadata about files (such as paths and directory structure) is sent to an LLM for analysis.
- Full Inference: Relevant data snippets and metadata are sent to the LLM for analysis.
Advanced Data Pilfering in Action
Let’s look at how NodeZero leverages ADP during a pentest. We built these scenarios using a modified version of the well known Game Of Active Directory (GOAD) cyber range.
Extracting Passwords from Active Directory Attributes
In a previous writeup, we showed how NodeZero got initial access in the standard GOAD environment by finding the password for samwell.tarly in that user’s Description field in Active Directory:
The contextual clue word “Password” in the description makes it feasible to use regular expressions to pull out the password “Heartsbane”. NodeZero uses well tuned regexes to pull out obvious passwords like this, and these regexes are surprisingly effective in real-world production environments.
But regex has its limit. As an example, let’s say we modify the GOAD setup to put only the password by itself in the Description field like this:
And to go further, let’s set up another user viserys.targaryen in a different domain with their password in their Notes field, enclosed in Spanish text.
These passwords would be nearly impossible to extract using a general-purpose regex, but with Advanced Data Pilfering, NodeZero extracts these passwords and compromises both users, as shown in the attack path below:
NodeZero does the following:
- Anonymously enumerates domain users and finds the password “Heartsbane” for
samwell.tarlyfrom hisDescriptionfield with the assistance of an LLM. - Logs in as
samwell.tarly - With
samwell.tarly‘s access, conducts cross-forest enumeration of users and finds the password “GoldCrown” forviserys.targaryenfrom hisNotesfield, again with the assistance of an LLM. - Logs in as
viserys.targaryen
This LLM-powered analysis isn’t just for Description and Notes. NodeZero also searches for passwords in the adminDescription and comment fields. These fields are editable using the Attribute Editor in Active Directory Users and Computers. By default any domain user can read the values set in these fields.
A Word on False Positives
Finding what looks like a password is one thing; knowing if it’s a valid one is another. This is a problem for human pentesters and tools alike. A discovered string could be an old, reset password, an employee ID, or a password for a different user entirely (password reuse).
NodeZero removes this guesswork. It validates potential credentials by attempting to log in. And if successful, it goes further by abusing the compromised user’s privileges to move laterally, escalate privileges, and access sensitive data. This approach gives defenders a wealth of information and context that can be used to drive effective prioritization and remediation.
Extracting Credentials from Files
NodeZero also leverages LLMs to find credentials hidden within files.
In a previous writeup about GOAD, we wrote about how NodeZero compromised a privileged domain user, jeor.mormont, by extracting his credential from a simple Powershell script file, script.ps1, in the SYSVOL share on a domain controller. That script file happened to be simple enough that a general-purpose regex is sufficient to pull out the credential. The context clue words “user” and “password”, appearing one line after another, are strong anchor words for a regex. And these regexes work reasonably well in real production environments.
Let’s make it harder. Suppose we replace that simple file with a more complex script, backup.ps1:
And, to go further, let’s also place a file called access.txt in a user’s Desktop folder on one of the machines in the range, 192.168.4.22. The contents of this file contain the credential for another domain user, jon.snow:
With Advanced Data Pilfering, NodeZero can extract both credentials and chain them into a full attack, as shown below:
NodeZero does the following:
- Gets initial access as domain user
samwell.tarly - Enumerates
SYSVOLusingsamwell.tarly‘s credential and identifies the filebackup.ps1as likely to contain a credential - Uses an LLM to extract the credential for
jeor.mormontfrombackup.ps1 - Logs into to the domain as
jeor.mormont, and discovers this user is a a local admin on host 192.168.4.22 - Deploys a Remote Access Tool (RAT) to the host using the privileges of
jeor.mormont. - Through the RAT, identifies the file
C:\Users\hodor\Desktop\access.txtas likely to contain a credential. - Uses an LLM to extract the credential for
jon.snow - Logs into the domain as
jon.snow
NodeZero’s support for LLM-assisted credential discovery isn’t limited to SMB shares and compromised hosts. It extends to other common data repositories, including AWS S3 buckets, NFS shares, and Slack.
Assessing the Business Risk of Compromised Hosts and Shares
Every pentester knows that a good report is more than a list of compromised hosts. What truly matters is conveying the business impact, which often comes down to the type of data that was accessed. Was it sensitive customer data, financial records, or intellectual property? For defenders, this same context drives a better understanding of business risk and which security weaknesses to prioritize for remediation.
With Advanced Data Pilfering, NodeZero now automatically classifies the type of data it compromises and links it to tangible business risks.
For instance, in the GOAD environment, we set up an Engineering file share on the host 192.168.4.23 containing R&D data that includes engineering schematics, source code, and legal patent documentation.
📁 \\192.168.4.23\Engineering
|
|-- 📁 Legal_IP
| |-- 📁 Patents
| | |-- 📁 Applications_Pending
| | |-- 📁 Issued
| | |-- 📁 Prior_Art_Research
| |-- 📁 Trademarks
| |-- 📁 Licensing
| | |-- 📁 Inbound_Licenses (IP we use)
| | `-- 📁 Outbound_Licenses (IP we sell)
| `-- 📁 Trade_Secrets
|-- 📁 Product_Development
| |-- 📁 Alchemy_Platform (Software)
| | |-- 📁 Architecture
| | |-- 📁 Source_Code_Snapshots
| | `-- 📁 Security_Audits
| `-- 📁 Gen4_Sensor (Hardware)
| |-- 📁 BOM (Bill of Materials)
| |-- 📁 CAD_Schematics
| |-- 📁 Firmware
| `-- 📁 QA_Test_Results
|-- 📁 Research_Lab
| |-- 📁 Project_Helios (New Battery Tech)
| | |-- 📁 Data
| | |-- 📁 Lab_Notebooks_Scanned
| | |-- 📁 Formulations
| `-- 📁 Project_Quantum_Leap (AI/ML)
| |-- 📁 Models
| |-- 📁 Training_Data
`-- 📁 Strategy
NodeZero gains access to this share after compromising the user viserys.targaryen, as shown in the attack path below:
Once it has access, NodeZero applies its smart data pipeline. It gathers file metadata and samples key files, sending this data to an LLM for deep analysis.
In this example, NodeZero determined that the share contained Intellectual Property, Manufacturing/Production data, Source Code, and Strategic Business Communications. It then automatically mapped these categories to the following business risks:
- Theft of IP & R&D
- Software Delivery Disruption
- Leak of Sensitive Communications
The LLM-assisted rationale justifies why the data was categorized this way, giving defenders the specific, actionable evidence they need.
Assessing the Business Risk of a Compromised Database
With Advanced Data Pilfering, NodeZero also assesses the risk of compromised databases.
In our GOAD environment, we added a synthetic database for a medical application to one of the Microsoft SQL Servers (192.168.4.22). NodeZero compromised the service account for this server, giving it full control of the database, as shown in the attack path below:
NodeZero then extracted the metadata about the database – tables, columns, record counts, etc – and used an LLM to classify the type of data it contained.
NodeZero correctly infers that the database contains Health Data and Personal Data, as evidenced by the presence of “extensive health data including diagnoses, encounters, lab results, and medications” and “personal data including patient demographics and provider information.” It then links compromise of this database to Regulatory Breach Penalties as a business risk.
Conclusion
The examples in this post illustrate how Advanced Data Pilfering (ADP) enhances NodeZero’s pentesting capabilities, both with credential discovery and bridging the gap between technical exploits and real-world business risk.
These examples are representative of the types of results NodeZero is delivering in real-world tests. For instance, in a recent real-world attack path, NodeZero used ADP to compromise the domain:
NodeZero did the following:
- Got initial access by password spraying a domain user
- Compromised a second domain user by discovering its password in an AD attribute
- Used that second domain user to identify an interesting file on a host it had access to
- Used Advanced Data Pilfering to extract a domain admin credential from that file
- Logged into the domain as domain admin
NodeZero would go on to leverage this domain admin access to compromise the client’s Microsoft Entra tenant.
This isn’t just a theoretical threat. Real-world threat actors are doing the exact same thing. In a report from August 2025, Anthropic described a “vibe hacker” using Claude Code to facilitate ransomware attacks against at least 17 different organizations. The attacker used Claude Code to actively find credentials and, most notably, classify and analyze stolen data to weaponize it for extortion, mirroring the two core functions of ADP.
At Horizon3, we believe the future of cyber warfare will be played out at machine speed, algorithm vs algorithm, with humans by exception. If you’re a hacker interested in AI and cybersecurity and creating autonomous production-safe solutions that work at scale with no humans in the loop, we want to hear from you.
