Adversarial Blind Pentest Thoughts

Hey all, recently I have been putting a lot of thought into the value and theory behind blind penetration tests, adversarial simulations, or purple teaming. I've had a lot of conversations with my colleagues and done extensive research into other approaches on the subject. The following are some of those collected thoughts and I'm really looking forward to feedback and further conversation on these topics.

There are many types of pentests and many reasons for having a pentest performed, but generally we can bucket these into two categories, exploratory (discovery of vulnerabilities) and adversarial (testing response times) or a combination of the two. Blind pentests insinuate a form of adversarial testing or testing the detection / response times as opposed to just finding new vulnerabilities. These are excellent exercises and tools for management to learn what the detection team can truly detect as well as what response times are and how well IR teams adhere to forensic process. Normal pentesters will not be equipped to conduct these test.  These should require specialized pentesters, who can both write their own post exploitation tools and understand the impact these tools will have on the targeted systems.

Adversarial tests should take several key things into account, such as existing infrastructure hardening, available Red Team expertise, and existing detection baselines. I also think there should be a clear program to get the Blue Teams ready for such exercises, as most won't be ready at first, in my experience, thus losing the true value of an adversarial test (as we covered above). Purple Teams (Blue and Red Teams together) should move through three natural phases to get ready for these adversarial tests, Academic (researching and creating detection), Controlled Testing (announced planned tests), and finally Adversarial Simulation (unannounced tests). In the first phase it's key to both deeply understand the attacks and how you can technically detect them in their various phases of the kill chain. The second phase involves executing these attacks in a controlled scenario and ensuring you have enough basic detection in place that you can triage these alerts in a meaningful way. Finally you are ready for phase three, or the blind test for your actual response operations around these detection mechanisms. Without these proper steps in place, one will waste a lot of time doing what they perceive are "adversarial tests" when in fact they are still doing are "exploratory tests" looking to confirm detection or control technologies in place.

Many of these adversarial tests need to start with a baseline across many areas and systems, to determine the most critical areas to focus on as well as the initial baseline detection and response times. Threat actions differ throughout various attacker kill chains and must be modified to the specific environment being exploited. Detection and response technologies also differ across the enterprise, just like intrusion tools and methods have differing risks associated to systems and applications within the environment. It is imperative to develop a method to map out potential threats, the risks associated with certain systems in the environment, and ultimately expose potential weaknesses within their environment respective to the risks associated. One can then develop plans to continually iterate on their detection and response plans in these areas and show demonstrable improvement through measurement of these detection and response capabilities across the attack path and kill chain. I highly recommend Gates' / Nickerson's talk at Brucon for an overview of a structured program to carry out these exercise, with an emphasis on which parts of the kill chain to target with testing for the most high value improvements in the security program via means of adversarial testing:



All of this amounts to building on operational experiences and adapting our detection respectively over time, hence the master tracking document should be a living document that links out directly to tickets and postmortems of exercises. Ultimately, this document should be built into detection tool development life cycles, requiring signatures to go under a live fire audit and review process regularly. There will be significant lead time for a program like this, both to develop detection baselines and a proper cadence for exercises. Here it can actually help to kickstart your internal program with an external consultancy, at least until the point that the internal teams are comfortable and these exercises have noticeably moved the SLAs and response times. It's important that neither side should have direct access to either the detection rules at play or the source code to the malware in use, as this could enable them to write techniques to specifically counter the technologies used, vs employing a more general detection methodology.

Thus there has to be a neutral organizing party, the deconfliction point, normally called a ‘White Team.”. This coordinating cell ensures that the exercises remains in scope and that the teams do not “game the game.” The "Kill Chains" or "Attack Paths" should be well understood and planned out by the white teams prior to the Red Teams executing their actions. The Red Team must be expected to record all of the critical steps along this attack path. Recording as much hard evidence on both sides is critical for this deconfliction as well as improving thier detection and response techniques. The Red Team should be able to prove with flags, network captures, terminal logs, and recorded evidence in logs that they carried out their actions or did not carry out actions (the White Team can also tap the Red Team's egress traffic). These IOCs then allow the Blue Team easily verify activity and identify potential detection gaps.

Tests should be highly targeted in the systems they are trying to breach and which detection / control methods they are testing. Here, more than ever, it is critical testers send kickoff and end emails to the White Team such that the White Team knows for certain when live fire hours are and which hosts are active, in the event that an action has unintended effects on production systems. It can further greatly enable both sides to keep detailed logs and metrics about the attacks and analysis. Luckily for defenders, most of their tools already keep forensically sound records and record user interaction. Attackers can emulate good record keeping by recording all of their network activity, logging their command line, and saving the output of tools like burp and nmap. Further, these logs can be maintained by the White Team to provide integrity for the exercises and used to give the Blue Team hints when they are utterly lost. While initially working through these exercises, it can help the White Team to break "actions" down into a turn based steps or "ticks", moving at the same cadence. The Red Team will initially control this cadence and can execute its actions every 15, 30, or 45 min during the controlled exercises. As the exercises advance, you can remove the 'tick' of the game later. These are some details to record at each 'tick':

Red Action: Command executed - time - user - hostname - IP - comment
Blue Detection: Rule triggered - time triggered - time acknowledged - system - automations triggered - comment
Blue Action: Investigation / Containment - time - host - control - comment

In my experience, it's best to iterate on these tests in fast succession, performing multiple small tests (daily or weekly) and determining if they were caught within the given SLAs (1-24hrs). Sometimes it helps to have a daily cadence, such that new attacks are covered in the morning and debriefed by the end of the day. I also think it's important to consider your classification of threats, or what threats you're preparing for. Threats can vary greatly, and a rockstar team at finding APTs may actually have a hard time w/ commodity malware outbreaks, and vice versa. To get a better understand of how I classify various threats and some emulation strategy there, see my post on "known known" vs "known unknown" vs "unknown unknown".