Machine Learning Security in the age of Supply Chain Attacks


As can be seen from the recent “xz attack” discovery that there nation states have realized that this is likely the “best” vector to impact large-scale systems in big organizations. With the cloud computing providers being the “source of computing” for most large corporations today, we should anticipate that a larger portion of the attacks will fall into this category. Also, just like “sleeper cells” in traditional espionage, such “sleepers” may exist in numerous OSS projects. Does that mean we should stop using open source – hell no. All that means is we just need to be careful. Can we detect these attacks? It’s tough to detect but yes we can detect them by good ol’ school, telemetry and observability.

But that’s not what this blog post is about. I think the most interesting bit from the xz attack for me was that the libraries that get harder to debug and decode are much juicier targets. How does that matter? The ML libraries that are super popular like pytorch and tensorflow and others are quite hard to compile out of bound from scratch. Such libraries can have interesting attack vectors which allow nice pickle compromises. I say “nice” because the family of insecure deserialization has existed in CWE since 2006! It’s older than many other issues and will continue to exist.

My only hope is that maintainers of core ML projects such as PyTorch, Tensorflow, keras and others start showing a slightly higher level of paranoia and build reproducibility so the supply chain attacks can be avoided on such harder to debug libraries.


Security Considerations in Blue-Green Deployments


tl,dr; Blue-Green deployments for critical uptime applications is a strong deployment strategy but if a deployment fixes critical security issues be sure that the definition of “deployment complete” is decommissioning of the “blue” environment and not just deployment of “green” successfully.

Organizations have gotten used to following Continuous Integration/Continuous Deployment (CI/CD) for software releases. The use of cloud solutions such as AWS Code* utilities, Azure DevOps or Google Cloud Source repositories enables enterprises to quickly and securely accomplish CI/CD for software release cycles. Software upgrades undergo the same CI/CD tooling and the dev teams need to choose how to do the upgrades.

There are a few different ways development teams upgrade their applications – full cut-over (same infrastructure, new codebase deployed directly and migrated in one go), rolling deployments (same infra or new infra gradually upgrading all instances), immutable (brand new infra and code for each deployment and migrated in one go), blue-green deployment (same infra, simultaneously deployed in prod, gradually phasing out old instances upon successful tests from a section of traffic). The Blue-green deployment strategy has, therefore, become quite popular for modern software deployment.

What is a Blue-Green deployment?
When you release an application via Blue-Green deployment strategy you gradually shift traffic as tests succeed and your observability (Cloudwatch alarms, etc.) does not indicate any problems. You can do that via Containers (ECS/EKS/AKS/GKE) or AWS Lambda/Azure Functions/Google Cloud Functions and traffic shifting can be done with the help of DNS solutions (Route53/Azure DNS/Google Cloud DNS), Load balancing solutions (AWS Elastic Load Balancer/Azure Load Balancer/Cloud Load Balancing). Simplistically, take your current (blue) deployment and create a full stack (green) and use either DNS or load balancers to slice out a traffic section and test the “green” stack. This is all happening in production, by the way. Once everything looks good, direct all the traffic to green and decommission the “blue”. This helps maintain operational resilience and, therefore, this is a popular deployment strategy. AWS has solid whitepaper I recommend to review to dive in from a solution architecture standpoint if you are interested.

Security Considerations

Some critical security issues (e.g., remote code execution via Log4j, remote code execution via struts, etc.) demand immediate fixes because of their severity. If your blue-green deployments are going to take days and your tests will run over a very long period (say days) then any security fixes you make also will get fixed after a successful “green” deployment and “blue decommission” only. If during that window or prior, an attacker managed to get a foothold into the impacted “blue” environment, then even decommissioning of the “blue” becomes critical to claim the issue is fully remediated. Typically, when incident responders and security operations professionals breathe a sigh of relief is when the fix is deployed. Typically, the software engineering teams consider fix as deployed is when the “green” is fully handling all traffic (and its unrelated to the decommissioning of the “blue” environment). In this case, the incident responders need to remember its not the deployment time when the risk is truly mitigated, its mitigated after completion of cut-over to green and decommissioning of the blue. There is a subtle, yet important, difference here – and it really comes down to the use of shared vocabulary. As long as security operations and software development teams both have this shared definition of what deployment means, there are no misunderstandings.


Security Considerations in use of AI/ML


The world of Artificial Intelligence (AI) seems to be exploding with the release of ChatGPT. But as soon as the the chat bot came into the hands of public people started finding self-sabotaging queries at worst (exploitable issues) and some weird interactions whereby people could write malware that could stay undetected by Endpoint Detection and Response (EDR) bypasses.

What is AI?

Very simply, Artificial Intelligence (per Wikipedia) is intelligence demonstrated by machines. But technically, it is a set of algorithms that can make do things that a human does by making an inference, similar to humans, on the basis of data that was historically provided as “reference” to make the decisions. This reference data is called as training data. And the data which is used to test the effectiveness of the algorithm to arrive at a decision on the basis of that reference, is called as test data. Any good machine learning course teaches how do you design data and how much data to use for training and how much to use for testing and metrics of performance but that is not relevant to our discussion here – however, what’s important is that it is the data that you provide that controls the decision-making in an artificially intelligent algorithm. This is a key difference between typical algorithms (where the code is more or less static and makes decisions on certain states in the program) whereas in an artificially intelligent system you can have the program arrive at different decisions depending on how one decides to “train” the algorithms.

What is ML?

Machine Learning (ML) is a subset of Artificial Intelligence (AI) where the artificial intelligent algorithms evolve their decision-making on the basis of data that has been processed and tagged as training data. ML systems have been used in classifying spam or anomaly detection in computer security. These systems tend to use statistical inference to establish a baseline and highlight situations where the input data does not fall within the norm. When operational data is being used to train ML-based system one has to be careful that we are not incrementally altering the baselines of what’s normal and what’s not. Such “tilting” may happen over time and its important to protect against drift of such systems. Some “drift” is ok but “bad drift” is not – which is hard to predict. E.g., let’s say you classify some data inaccurately and accidentally/maliciously end up using it for training your ML-models but if it inherently alters the behavior of the ML model, then the model becomes unreliable.

What is Adversarial AI?

Adversarial Artificial Intelligence (AI) based threats are ones where malicious actors design the inputs to make models predict erroneously. There are a couple of different types of attack here – poisoning attack (where you train models with bad data controlled by adversaries) or an evasion attack (where you make the artificial intelligence system make a bad inference with a security implication). The way to understand these attacks is that the poisoning attack is basically “Garbage-in-garbage-out” but its this really “special” garbage. This is garbage that changes the behavior of the algorithm in a way that the algorithms returns an incorrect result when it has to make a decision. The inferential attacks are different in that the decision made is wrong because the input is such that it appears differently to the ML algorithm than it does to humans. E.g., Gaussian noise being classified as a human or a fingerprint being matched incorrectly.

Can we attack these systems in other ways?

In a paper presented by Google researchers created a tool (TensorFuzz) that they were able to demonstrate finding a few varieties of bugs in Deep Neural Networks (DNNs). So typical software attack techniques do work against the deep neural networks too. Fuzzing has been used for decades and has caused faults in code forever. At its core, fuzzing is simple, send garbage input that causes a failure in the program. It’s just that the failures in DNN are different and you want to ensure the software relying on the DNN to make a decision handles such failures appropriately and do not cause a security failure with secure defaults.

Protection mechanisms

There are a few simple ways to look at ML systems and security thereof. Microsoft released an excellent howto on how to threat model ML systems. Additionally, using adversarial training data is imperative to ensure that artificially intelligent algorithms performs as you expect them to in the presence of adversarial data. When you rely on ML-based systems, its all the more important that you test it appropriately and continue to do so against baselines. Unfortunately, for Deep Neural Networks transparency of decision making continues to be an issue and needs the AI/ML researchers to establish appropriate transparency measures.


What to do when things go wrong?


I blogged earlier about blameless post-mortems and how one gets to a point that they are able to do blameless post-mortems – by having an operational rigor and observability. This is more of a lessons learnt post about what do you do and what you don’t when things go wrong?

Focusing on the Who?

A lot of times focusing on the “who reported the issue?” can be focusing on a wrong thing. If a report comes from a penetration test or a bug bounty researcher or an internal security engineering resource you need to make sure that the impact and likelihood is clearly understood. There are sometimes where customers (who pay or intend to pay for your service) report problems – these are obviously more important.

Focusing on the How?

How a security issue gets reported is important. As examples where you learn about a security issue via a bug report(1), or where you learn about it via your own telemetry(2) or you learn about it on Twitter! There is a potential for legal ramifications in each of these cases and the risks might be different. When things become public without your knowledge where you were not notified and the information is now public you do have a role to instill confidence in your current customers. The best approach here tends to be of sticking to facts without any speculations. If you are working on incident say so. Don’t say we are most secure when you are the subject of a breach discussion especially because you already have the data that you are not as secure. Identification and Containment of the security issue are top priorities – do not take resources that are doing these actions away to ensure Public Relations are good – doing that will eventually make public relations bad! Involve lawyers in your communications process and mark communications with right legal tags (“attorney client privileged material”) so that if a litigation happens you can clearly demarcate evidence that can be or cannot be part of a discovery.

Focusing on the What?

“What” needs to be done has to be clear with the help of an incident manager. The incident manager is the person who is most well read, subject matter expert, and leads the response process. Having this single-threaded ownership of leading the incident is incredibly important. The role of the incident manager is to ensure they have all the information that they need to make decisions. This also streamlines the process of public relations, legal needs, incident cleanup (eradication and recovery), and helps with swift and focused decision-making. This can sometimes be crisis management depending on impact and otherwise it can be just another day in the Security operations office. The key trait here is focus and goal-based decision making. Adrenaline can run high, tempers can flare – that typically happens when you are unprepared to handle security incidents. The tempers and nervousness can be avoided by being proactive in doing tabletop exercises, incident dry-runs and having good runbooks. But all practice games do is prepare you for the real thing – the real thing is how you handle a true incident. Use the help of key stakeholders to derive best decisions – there often tend to be situations where no answer looks good – and therein comes the customer focus – if you focus on the well being of customers you will rarely go wrong.

Focusing on the Why?

Capture incident response logs in tickets and communications so all the timeline and actions get captured properly with documentation. After the recovery is completed, do a blameless post-mortem of how you got there. Ensure you put a timeline of taking on agreed-upon corrective actions on a timeline that is agreed and don’t waiver – this is a part of operational rigor one needs to follow to really avoid incidents from happening in future. Typically, the reason why issues happen is because something was not prioritized as it should have been. Reprioritize to make sure you can reassess. Sometimes the size of the incident makes it your reprioritization almost coerced – it’s ok to be coerced in that direction. You will find that coercion is simply an acceleration of the actions that you should have taken up earlier. No one is perfect – just come out of it better!

Focusing on the Where?

Where you discuss the issue is important. When sizable incidents happen discuss is openly with the business leaders so that full awareness and feedback is provided in “powerful forums”. This obviously does not mean that you break your attorney client privilege – it just means discuss with the highest leaders in a manner where action items, impact and post-mortem results are provided. This enables business to become resilient over time and develop confidence in the security teams. If you need to do public releases then ensure that lawyers read it and security SMEs read it as well as business leaders read it – only then do such releases. Don’t let the “left hand meet right” situation ever occur. This instills customer confidence in your process.


This was just an attempt for me to pen-down my thoughts as they appeared in my brain. I am sure I forgot a lot such as stress of handling, avoiding knee-jerk reactions, etc. but these are top most important things that I felt were necessary to share. Remember, incident handling gets better with practice – you want the practice be done in practice games not in the olympics! 🙂


The historical evolution of Cross-Site Request Forgery


Having been in application security for more than 2 decades now and officially completing my 18th year now of being meaningfully employed in that space there is just a lot of crud that I have gathered in my brain. Most of that is history of how things came about to be. That stuff is likely not interesting to most but I find it intriguing as to how some seemingly minor decisions of one software vendor can have massive impact to the web application security industry.

Oh the dreaded IE…
Internet Explorer 4, 5 and 6 that started in the Windows XP days (or even earlier, can’t recall) had a setting – the cookie jar was not shared – i.e., if you opened a new window to a site, you would have to log in again unless you used “Ctrl+N” key to open a new window from an existing session. Each new process would have its own cookie jar. For the uninitiated, the “cookie jar” is the internal browser storage of cookies. Cookies are random looking strings that indicate a “trust token” that a web server places in the browser. Since HTTP is a connectionless protocol, this cookie is what preserves the “state” and this is exactly what authorization decisions in HTTP context are typically based on. These cookies are stored in a web browser data storage called cookie jar where each cookie gets stored with the name, value, domain, path (and today, there are few other attributes but that wasn’t the case back in 2004-2005). The browser gets all these parameters from the HTTP response header Set-Cookie. Microsoft, the vendor for Internet Explorer, made a decision that each new window of IE should have its own set of stored cookies that were not shared. Mozilla Firefox and Google Chrome always had a shared cookie jar if I recall correctly.

Along came a Cross-Site Request Forgery (CSRF)…

Jesse Burns from iSecPartners (an NYC-based security consultancy that was acquired by NCC group) back then wrote a paper which I think was the seminal paper on Cross-Site Request Forgery. They called it “XSRF” back then because “XSS” was already in parlance back-then. Thereafter, there were presentations in 2006 about the same by Microsoft. The whole attack was simple. The victim has a browser tab open in which they are logged into a site that has issued that session a cookie value. Due to the browser same-origin policy (a concept that Netscape designed in 1995) that cookie would be resent by the browser in the request as a Cookie HTTP request header whenever an HTTP request was sent to the same domain, protocol (“scheme”) and port. There were few idiosyncracies of IE (which made it infamous back then) such as if the port number did not match IE did not complain and would think that access was allowed per the Same-Origin Policy (SOP). What does that mean? http://example.com and http://example.com:81 would be treated as the same origin! Weird right? It wasn’t the case with other browsers. This was also documented in Michal Zalewski’s book Tangled Web in 2011. Where am I going with this? So while IE did some weird things, it did one good thing – isolate cookie jars. So if you opened up a new window where the attacker ran a payload that sent a request to the site which had handed you a cookie, the new IE window would have no interesting cookies to share with that site – inadvertently protecting the user from being a victim to a CSRF issue. Yes, the hated IE protected the users from being victim to CSRF! Who would have thought? That’s how weird 2005 was 🙂

Fast forward…

Since all browser vendor today have concept of shared cookie jars because who doesn’t like opening new tabs of their favorite cloud consoles without having to re-login right? So what did we the people do? We came up with another attribute that could be added to a Set-Cookie HTTP response header – SameSite attribute which restricted the cookie from being sent unless the request originated from a page on the same site as the cookie issuer.

So there you have it… the history of SameSite and how one of the most hated browsers of the day (IE) did one good thing for users – protect them from CSRF! 🙂


Application Security and the Incident Response Process


Application or software security is a field of infinite complexity. All of us know that where there is complexity, security issues lurk around in those dark corners. This post is about my ramblings on how I believe Application Security and Incident Response need to come together to handle incidents. As usual, I make references to the cloud because that’s what I do for my day job but the argument is remarkably similar even for on-premises companies that are not in the cloud.

First came the test….

When software fails, before it is deployed, it is typically deemed to be a “safe” failure i.e., someone who is looking can do something about it. In the cloud, what makes software development easy is that the cloud makes it easy for people to look. In AWS, for example, you have Cloudwatch and you also have things like liveness checks in Load balancers which can “trigger” events and you can create “operational handlers” for those trigger events via technologies like AWS Lambda and AWS Eventbridge. This is not new to security people. For as long as I can remember, Security operations and IT Ops teams (from the pre “cyber” days), always had runbooks on what to do when something did not work as expected. Today, this happens a lot as the complexity of software has increased. The software engineers need to ensure that they have “eyes” in their test environment even more than they have in production because, quite frankly, they can have as much instrumentation as they want in the test environments because who cares if there is a minor performance impact. It’s important to re-architect so that there is no impact due to observability but you can still go to town as long as the means justify the ends. Keeping a close eye on test failures, helps avoid failures in production – this is all captain obvious advice. So in the cloud, what does that translate to? In the cloud, you can tag resources and you can have separate dashboards for those tags. Ensuring that you pay a close attention to “thresholds” on when to alert and what to do when certain measures / thresholds hit. You can also get creative in using anomaly detection tooling or “Machine Learning” (there … I said it!). The core point being – the more you are observing your test environment, the less failures you see in production. And when you see less failures in production you can do what I am going to talk about next.

Then came the failure…

When software fails in production, assuming you have been diligent in your testing and have encountered “exception handlers” either in code or in operations, it is categorized as an incident. While operational incidents are just as important and may have as much impact as security incidents, I will restrict this discussion to security incidents (because that’s what I do). What sets apart organizations such as AWS and other mature software shops from the rest of them is their rigor and approach of blameless post-mortems on security incidents. Everyone, loves to say “there will be no finger pointing” but in reality that’s really, really difficult to implement and that’s where the culture of fact-finding has a big impact on “getting to the bottom of it”. A security incident *never* happens without an error on someone’s part – this is critical to recognize. For a security incident to happen, someone had to make a mistake but its important to realize that in the security world, there is no exact science so mistakes are inevitable – what “security maturity” means is how you don’t make a knee-jerk reaction (aka scorched earth mentality) and recover from it stronger. The more maturity companies show in diving into whether the security incident involved a requirements error, design error, implementation error or deployment error tells application development teams where they can improve their development process and also tells their Application security engineers where they can improve their pre-deployment checkers or integration tests as well as areas where tooling is developed to detect deviations from security expectations (or “security invariants”). This also tells the development teams, where they can improve their developer tooling so the misses don’t recur. Also, the operations teams should determine where their checkers can do better and how the response times can be reduced to say half of what it took. Asking the question “why?” repeatedly on reasoning about the events during an incident can give a unique insight on how to improve.

Then came the win!

The situation where the AppSec, AppDev, SecOps, IT Ops teams collaborate to ensure they can help each other to do such blameless post-mortems end up having better telemetry, better tooling process, better detection when something goes wrong (“alerts”) and also a healthier environment of positive feedback. No one is blameless, recognize it, improve from it and help each other for better organizational security.


A brief history of SSRF


Server-Side Request Forgery is a security issue in applications where an attacker is able to get a server to send some type of a request (these days mostly HTTP/s requests) that the server should not be able to send. This issue is the classic abuse of trust vulnerability – the server tends to sit in a “trusted” environment (e.g., DMZ, your cloud VPC, etc.) and the users of the application sit outside the trust boundary (e.g., mobile devices, cafe, home, corporate environments, API clients within and outside cloud).

A brief history
In this blog post though, I won’t be talking about all the fancy new things that have had SSRF issues – you can likely find a few hundreds of those anyway! I am going to be talking about a brief history of this issue and what happened before we gave this issue a “name” – SSRF. The earliest references to the name “SSRF” appear to come from a talk done in BlackHat US 2012, and wayback machine tells me that the CWE-918 page was authored sometime around 2013. If you look closely at the CWE-918 page though, you will find that there were old CVEs dating back to 2002, and 2004. There was a Shmoocon talk about it in 2008 too but the term SSRF was not established until 2012.

The Issue

I was working on a penetration test for a financial services firm in 2010 of a popular load balancer that offered a GTM (Global traffic manager) solution and that allowed folks to login and obtain restricted execution environments wherefrom certain applications could be exposed and that would allow remote workers or untrusted entities who you only want to expose certain applications could use. The issue was in a POST request post login and IIRC even pre-login (though my details on this are fuzzy). This might be circa 2010 timeframe and to my knowledge the issue was not issued a CVE and wasn’t associated with Knowledge base article – I may not know 100%. The guidance from the vendor was simple – update the software and move on. The issue was this – an authenticated or an unauthenticated user would send an HTTP POST request to a page with a base64 encoded parameter that included a hostname which would trigger a DNS request on the back end of the GTM site. The time it took for the response to get back would indicate whether the domain was legitimate or not. So I used the a popular dictionary and enumerate all the hostnames from that directory that were legitimate and the ones that were not sitting outside on the Internet and mapping the hosts on the internal network.

The backstory
What the vendor of the GTM software did not know was how critical this application was to the business of the customer. They seemed to be dragging their feet without updates and meanwhile the customer – a financial institution with lots at stake could not go live. The pressure mounted on the IT staff to fix the issue and the vendor while being responsive was unable to give a firm date quickly – remember this was 13-14 years ago prior to bug bounties and responsible disclosures still were quite clunky! And the customer was also advising me to push the software vendor so we could discuss. Thankfully, on the vendor side, there was a solid security person who understood the issue immediately and its impact and advised the software teams to do what was right. They made the process post authentication and they also added tokens, limits and constant time responses to fix the issue.

Fast forward
Today, obviously things are a lot better. And I wrote this blog post so the old me can look back and point to this in a meaningful way without forgetting the old experiences among the new.