Machine Learning Security in the age of Supply Chain Attacks


As can be seen from the recent “xz attack” discovery that there nation states have realized that this is likely the “best” vector to impact large-scale systems in big organizations. With the cloud computing providers being the “source of computing” for most large corporations today, we should anticipate that a larger portion of the attacks will fall into this category. Also, just like “sleeper cells” in traditional espionage, such “sleepers” may exist in numerous OSS projects. Does that mean we should stop using open source – hell no. All that means is we just need to be careful. Can we detect these attacks? It’s tough to detect but yes we can detect them by good ol’ school, telemetry and observability.

But that’s not what this blog post is about. I think the most interesting bit from the xz attack for me was that the libraries that get harder to debug and decode are much juicier targets. How does that matter? The ML libraries that are super popular like pytorch and tensorflow and others are quite hard to compile out of bound from scratch. Such libraries can have interesting attack vectors which allow nice pickle compromises. I say “nice” because the family of insecure deserialization has existed in CWE since 2006! It’s older than many other issues and will continue to exist.

My only hope is that maintainers of core ML projects such as PyTorch, Tensorflow, keras and others start showing a slightly higher level of paranoia and build reproducibility so the supply chain attacks can be avoided on such harder to debug libraries.


Security Considerations in Blue-Green Deployments


tl,dr; Blue-Green deployments for critical uptime applications is a strong deployment strategy but if a deployment fixes critical security issues be sure that the definition of “deployment complete” is decommissioning of the “blue” environment and not just deployment of “green” successfully.

Organizations have gotten used to following Continuous Integration/Continuous Deployment (CI/CD) for software releases. The use of cloud solutions such as AWS Code* utilities, Azure DevOps or Google Cloud Source repositories enables enterprises to quickly and securely accomplish CI/CD for software release cycles. Software upgrades undergo the same CI/CD tooling and the dev teams need to choose how to do the upgrades.

There are a few different ways development teams upgrade their applications – full cut-over (same infrastructure, new codebase deployed directly and migrated in one go), rolling deployments (same infra or new infra gradually upgrading all instances), immutable (brand new infra and code for each deployment and migrated in one go), blue-green deployment (same infra, simultaneously deployed in prod, gradually phasing out old instances upon successful tests from a section of traffic). The Blue-green deployment strategy has, therefore, become quite popular for modern software deployment.

What is a Blue-Green deployment?
When you release an application via Blue-Green deployment strategy you gradually shift traffic as tests succeed and your observability (Cloudwatch alarms, etc.) does not indicate any problems. You can do that via Containers (ECS/EKS/AKS/GKE) or AWS Lambda/Azure Functions/Google Cloud Functions and traffic shifting can be done with the help of DNS solutions (Route53/Azure DNS/Google Cloud DNS), Load balancing solutions (AWS Elastic Load Balancer/Azure Load Balancer/Cloud Load Balancing). Simplistically, take your current (blue) deployment and create a full stack (green) and use either DNS or load balancers to slice out a traffic section and test the “green” stack. This is all happening in production, by the way. Once everything looks good, direct all the traffic to green and decommission the “blue”. This helps maintain operational resilience and, therefore, this is a popular deployment strategy. AWS has solid whitepaper I recommend to review to dive in from a solution architecture standpoint if you are interested.

Security Considerations

Some critical security issues (e.g., remote code execution via Log4j, remote code execution via struts, etc.) demand immediate fixes because of their severity. If your blue-green deployments are going to take days and your tests will run over a very long period (say days) then any security fixes you make also will get fixed after a successful “green” deployment and “blue decommission” only. If during that window or prior, an attacker managed to get a foothold into the impacted “blue” environment, then even decommissioning of the “blue” becomes critical to claim the issue is fully remediated. Typically, when incident responders and security operations professionals breathe a sigh of relief is when the fix is deployed. Typically, the software engineering teams consider fix as deployed is when the “green” is fully handling all traffic (and its unrelated to the decommissioning of the “blue” environment). In this case, the incident responders need to remember its not the deployment time when the risk is truly mitigated, its mitigated after completion of cut-over to green and decommissioning of the blue. There is a subtle, yet important, difference here – and it really comes down to the use of shared vocabulary. As long as security operations and software development teams both have this shared definition of what deployment means, there are no misunderstandings.


Security Considerations in use of AI/ML


The world of Artificial Intelligence (AI) seems to be exploding with the release of ChatGPT. But as soon as the the chat bot came into the hands of public people started finding self-sabotaging queries at worst (exploitable issues) and some weird interactions whereby people could write malware that could stay undetected by Endpoint Detection and Response (EDR) bypasses.

What is AI?

Very simply, Artificial Intelligence (per Wikipedia) is intelligence demonstrated by machines. But technically, it is a set of algorithms that can make do things that a human does by making an inference, similar to humans, on the basis of data that was historically provided as “reference” to make the decisions. This reference data is called as training data. And the data which is used to test the effectiveness of the algorithm to arrive at a decision on the basis of that reference, is called as test data. Any good machine learning course teaches how do you design data and how much data to use for training and how much to use for testing and metrics of performance but that is not relevant to our discussion here – however, what’s important is that it is the data that you provide that controls the decision-making in an artificially intelligent algorithm. This is a key difference between typical algorithms (where the code is more or less static and makes decisions on certain states in the program) whereas in an artificially intelligent system you can have the program arrive at different decisions depending on how one decides to “train” the algorithms.

What is ML?

Machine Learning (ML) is a subset of Artificial Intelligence (AI) where the artificial intelligent algorithms evolve their decision-making on the basis of data that has been processed and tagged as training data. ML systems have been used in classifying spam or anomaly detection in computer security. These systems tend to use statistical inference to establish a baseline and highlight situations where the input data does not fall within the norm. When operational data is being used to train ML-based system one has to be careful that we are not incrementally altering the baselines of what’s normal and what’s not. Such “tilting” may happen over time and its important to protect against drift of such systems. Some “drift” is ok but “bad drift” is not – which is hard to predict. E.g., let’s say you classify some data inaccurately and accidentally/maliciously end up using it for training your ML-models but if it inherently alters the behavior of the ML model, then the model becomes unreliable.

What is Adversarial AI?

Adversarial Artificial Intelligence (AI) based threats are ones where malicious actors design the inputs to make models predict erroneously. There are a couple of different types of attack here – poisoning attack (where you train models with bad data controlled by adversaries) or an evasion attack (where you make the artificial intelligence system make a bad inference with a security implication). The way to understand these attacks is that the poisoning attack is basically “Garbage-in-garbage-out” but its this really “special” garbage. This is garbage that changes the behavior of the algorithm in a way that the algorithms returns an incorrect result when it has to make a decision. The inferential attacks are different in that the decision made is wrong because the input is such that it appears differently to the ML algorithm than it does to humans. E.g., Gaussian noise being classified as a human or a fingerprint being matched incorrectly.

Can we attack these systems in other ways?

In a paper presented by Google researchers created a tool (TensorFuzz) that they were able to demonstrate finding a few varieties of bugs in Deep Neural Networks (DNNs). So typical software attack techniques do work against the deep neural networks too. Fuzzing has been used for decades and has caused faults in code forever. At its core, fuzzing is simple, send garbage input that causes a failure in the program. It’s just that the failures in DNN are different and you want to ensure the software relying on the DNN to make a decision handles such failures appropriately and do not cause a security failure with secure defaults.

Protection mechanisms

There are a few simple ways to look at ML systems and security thereof. Microsoft released an excellent howto on how to threat model ML systems. Additionally, using adversarial training data is imperative to ensure that artificially intelligent algorithms performs as you expect them to in the presence of adversarial data. When you rely on ML-based systems, its all the more important that you test it appropriately and continue to do so against baselines. Unfortunately, for Deep Neural Networks transparency of decision making continues to be an issue and needs the AI/ML researchers to establish appropriate transparency measures.


What to do when things go wrong?


I blogged earlier about blameless post-mortems and how one gets to a point that they are able to do blameless post-mortems – by having an operational rigor and observability. This is more of a lessons learnt post about what do you do and what you don’t when things go wrong?

Focusing on the Who?

A lot of times focusing on the “who reported the issue?” can be focusing on a wrong thing. If a report comes from a penetration test or a bug bounty researcher or an internal security engineering resource you need to make sure that the impact and likelihood is clearly understood. There are sometimes where customers (who pay or intend to pay for your service) report problems – these are obviously more important.

Focusing on the How?

How a security issue gets reported is important. As examples where you learn about a security issue via a bug report(1), or where you learn about it via your own telemetry(2) or you learn about it on Twitter! There is a potential for legal ramifications in each of these cases and the risks might be different. When things become public without your knowledge where you were not notified and the information is now public you do have a role to instill confidence in your current customers. The best approach here tends to be of sticking to facts without any speculations. If you are working on incident say so. Don’t say we are most secure when you are the subject of a breach discussion especially because you already have the data that you are not as secure. Identification and Containment of the security issue are top priorities – do not take resources that are doing these actions away to ensure Public Relations are good – doing that will eventually make public relations bad! Involve lawyers in your communications process and mark communications with right legal tags (“attorney client privileged material”) so that if a litigation happens you can clearly demarcate evidence that can be or cannot be part of a discovery.

Focusing on the What?

“What” needs to be done has to be clear with the help of an incident manager. The incident manager is the person who is most well read, subject matter expert, and leads the response process. Having this single-threaded ownership of leading the incident is incredibly important. The role of the incident manager is to ensure they have all the information that they need to make decisions. This also streamlines the process of public relations, legal needs, incident cleanup (eradication and recovery), and helps with swift and focused decision-making. This can sometimes be crisis management depending on impact and otherwise it can be just another day in the Security operations office. The key trait here is focus and goal-based decision making. Adrenaline can run high, tempers can flare – that typically happens when you are unprepared to handle security incidents. The tempers and nervousness can be avoided by being proactive in doing tabletop exercises, incident dry-runs and having good runbooks. But all practice games do is prepare you for the real thing – the real thing is how you handle a true incident. Use the help of key stakeholders to derive best decisions – there often tend to be situations where no answer looks good – and therein comes the customer focus – if you focus on the well being of customers you will rarely go wrong.

Focusing on the Why?

Capture incident response logs in tickets and communications so all the timeline and actions get captured properly with documentation. After the recovery is completed, do a blameless post-mortem of how you got there. Ensure you put a timeline of taking on agreed-upon corrective actions on a timeline that is agreed and don’t waiver – this is a part of operational rigor one needs to follow to really avoid incidents from happening in future. Typically, the reason why issues happen is because something was not prioritized as it should have been. Reprioritize to make sure you can reassess. Sometimes the size of the incident makes it your reprioritization almost coerced – it’s ok to be coerced in that direction. You will find that coercion is simply an acceleration of the actions that you should have taken up earlier. No one is perfect – just come out of it better!

Focusing on the Where?

Where you discuss the issue is important. When sizable incidents happen discuss is openly with the business leaders so that full awareness and feedback is provided in “powerful forums”. This obviously does not mean that you break your attorney client privilege – it just means discuss with the highest leaders in a manner where action items, impact and post-mortem results are provided. This enables business to become resilient over time and develop confidence in the security teams. If you need to do public releases then ensure that lawyers read it and security SMEs read it as well as business leaders read it – only then do such releases. Don’t let the “left hand meet right” situation ever occur. This instills customer confidence in your process.


This was just an attempt for me to pen-down my thoughts as they appeared in my brain. I am sure I forgot a lot such as stress of handling, avoiding knee-jerk reactions, etc. but these are top most important things that I felt were necessary to share. Remember, incident handling gets better with practice – you want the practice be done in practice games not in the olympics! 🙂


The historical evolution of Cross-Site Request Forgery


Having been in application security for more than 2 decades now and officially completing my 18th year now of being meaningfully employed in that space there is just a lot of crud that I have gathered in my brain. Most of that is history of how things came about to be. That stuff is likely not interesting to most but I find it intriguing as to how some seemingly minor decisions of one software vendor can have massive impact to the web application security industry.

Oh the dreaded IE…
Internet Explorer 4, 5 and 6 that started in the Windows XP days (or even earlier, can’t recall) had a setting – the cookie jar was not shared – i.e., if you opened a new window to a site, you would have to log in again unless you used “Ctrl+N” key to open a new window from an existing session. Each new process would have its own cookie jar. For the uninitiated, the “cookie jar” is the internal browser storage of cookies. Cookies are random looking strings that indicate a “trust token” that a web server places in the browser. Since HTTP is a connectionless protocol, this cookie is what preserves the “state” and this is exactly what authorization decisions in HTTP context are typically based on. These cookies are stored in a web browser data storage called cookie jar where each cookie gets stored with the name, value, domain, path (and today, there are few other attributes but that wasn’t the case back in 2004-2005). The browser gets all these parameters from the HTTP response header Set-Cookie. Microsoft, the vendor for Internet Explorer, made a decision that each new window of IE should have its own set of stored cookies that were not shared. Mozilla Firefox and Google Chrome always had a shared cookie jar if I recall correctly.

Along came a Cross-Site Request Forgery (CSRF)…

Jesse Burns from iSecPartners (an NYC-based security consultancy that was acquired by NCC group) back then wrote a paper which I think was the seminal paper on Cross-Site Request Forgery. They called it “XSRF” back then because “XSS” was already in parlance back-then. Thereafter, there were presentations in 2006 about the same by Microsoft. The whole attack was simple. The victim has a browser tab open in which they are logged into a site that has issued that session a cookie value. Due to the browser same-origin policy (a concept that Netscape designed in 1995) that cookie would be resent by the browser in the request as a Cookie HTTP request header whenever an HTTP request was sent to the same domain, protocol (“scheme”) and port. There were few idiosyncracies of IE (which made it infamous back then) such as if the port number did not match IE did not complain and would think that access was allowed per the Same-Origin Policy (SOP). What does that mean? http://example.com and http://example.com:81 would be treated as the same origin! Weird right? It wasn’t the case with other browsers. This was also documented in Michal Zalewski’s book Tangled Web in 2011. Where am I going with this? So while IE did some weird things, it did one good thing – isolate cookie jars. So if you opened up a new window where the attacker ran a payload that sent a request to the site which had handed you a cookie, the new IE window would have no interesting cookies to share with that site – inadvertently protecting the user from being a victim to a CSRF issue. Yes, the hated IE protected the users from being victim to CSRF! Who would have thought? That’s how weird 2005 was 🙂

Fast forward…

Since all browser vendor today have concept of shared cookie jars because who doesn’t like opening new tabs of their favorite cloud consoles without having to re-login right? So what did we the people do? We came up with another attribute that could be added to a Set-Cookie HTTP response header – SameSite attribute which restricted the cookie from being sent unless the request originated from a page on the same site as the cookie issuer.

So there you have it… the history of SameSite and how one of the most hated browsers of the day (IE) did one good thing for users – protect them from CSRF! 🙂