Application or software security is a field of infinite complexity. All of us know that where there is complexity, security issues lurk around in those dark corners. This post is about my ramblings on how I believe Application Security and Incident Response need to come together to handle incidents. As usual, I make references to the cloud because that’s what I do for my day job but the argument is remarkably similar even for on-premises companies that are not in the cloud.
First came the test….
When software fails, before it is deployed, it is typically deemed to be a “safe” failure i.e., someone who is looking can do something about it. In the cloud, what makes software development easy is that the cloud makes it easy for people to look. In AWS, for example, you have Cloudwatch and you also have things like liveness checks in Load balancers which can “trigger” events and you can create “operational handlers” for those trigger events via technologies like AWS Lambda and AWS Eventbridge. This is not new to security people. For as long as I can remember, Security operations and IT Ops teams (from the pre “cyber” days), always had runbooks on what to do when something did not work as expected. Today, this happens a lot as the complexity of software has increased. The software engineers need to ensure that they have “eyes” in their test environment even more than they have in production because, quite frankly, they can have as much instrumentation as they want in the test environments because who cares if there is a minor performance impact. It’s important to re-architect so that there is no impact due to observability but you can still go to town as long as the means justify the ends. Keeping a close eye on test failures, helps avoid failures in production – this is all captain obvious advice. So in the cloud, what does that translate to? In the cloud, you can tag resources and you can have separate dashboards for those tags. Ensuring that you pay a close attention to “thresholds” on when to alert and what to do when certain measures / thresholds hit. You can also get creative in using anomaly detection tooling or “Machine Learning” (there … I said it!). The core point being – the more you are observing your test environment, the less failures you see in production. And when you see less failures in production you can do what I am going to talk about next.
Then came the failure…
When software fails in production, assuming you have been diligent in your testing and have encountered “exception handlers” either in code or in operations, it is categorized as an incident. While operational incidents are just as important and may have as much impact as security incidents, I will restrict this discussion to security incidents (because that’s what I do). What sets apart organizations such as AWS and other mature software shops from the rest of them is their rigor and approach of blameless post-mortems on security incidents. Everyone, loves to say “there will be no finger pointing” but in reality that’s really, really difficult to implement and that’s where the culture of fact-finding has a big impact on “getting to the bottom of it”. A security incident *never* happens without an error on someone’s part – this is critical to recognize. For a security incident to happen, someone had to make a mistake but its important to realize that in the security world, there is no exact science so mistakes are inevitable – what “security maturity” means is how you don’t make a knee-jerk reaction (aka scorched earth mentality) and recover from it stronger. The more maturity companies show in diving into whether the security incident involved a requirements error, design error, implementation error or deployment error tells application development teams where they can improve their development process and also tells their Application security engineers where they can improve their pre-deployment checkers or integration tests as well as areas where tooling is developed to detect deviations from security expectations (or “security invariants”). This also tells the development teams, where they can improve their developer tooling so the misses don’t recur. Also, the operations teams should determine where their checkers can do better and how the response times can be reduced to say half of what it took. Asking the question “why?” repeatedly on reasoning about the events during an incident can give a unique insight on how to improve.
Then came the win!
The situation where the AppSec, AppDev, SecOps, IT Ops teams collaborate to ensure they can help each other to do such blameless post-mortems end up having better telemetry, better tooling process, better detection when something goes wrong (“alerts”) and also a healthier environment of positive feedback. No one is blameless, recognize it, improve from it and help each other for better organizational security.