January 24, 2023 started off as a normal Tuesday morning in the U.S. equity markets. Trading firms, retail investors, and other market participants entered their orders ahead of the opening bell at the New York Stock Exchange. As the pre-market session came to a close, the cutoff for orders for the opening auction came and went. The Designated Market Makers prepared for the start of the core session at 9:30 AM Eastern time, ready to provide liquidity on both the buy and sell side based on the opening print from the auction.
But on this morning, when the opening bell rang at 9:30 AM, the opening auction didn’t happen for thousands of stocks listed at NYSE. The knock-on effects were massive: without an official opening print, millions of dollars of stock trades entered for the open never executed; market makers lacked a reference price to stabilize the market in individual securities; trading limits (known as Limit Up Limit Down) couldn’t be set in the absence of a midpoint to calculate percentages from. In the first few seconds and minutes of trading, prices fluctuated wildly, in some cases triggering automatic pauses, or halts, for certain stocks.
The chaos ended about twenty minutes later, and twenty-four hours later, Intercontinental Exchange, which owns NYSE, had identified a root cause: a setting in the disaster recovery system, due to human error. Bloomberg’s reporting clarified NYSE’s explanation, stating that the DR system, located in Chicago, was left partially turned on after routine nightly testing.
It’s easy to classify this event as just a financial systems hiccup or a failure of IT operations. Look more closely, though, and you’ll find lessons for cyber security practitioners as well. Here are four takeaways from the NYSE opening auction glitch.
Lesson 1: Test your disaster recovery systems regularly.
As reported by Bloomberg, NYSE runs an internal DR test nightly, bringing their systems online and then shutting them down (except on that one Monday night). That’s on top of the industry-wide yearly tests with the major exchanges and market participants. All of this is in response to government regulation, dictated by the U.S. Securities and Exchange Commission (SEC) under what’s known as “Regulation Systems Compliance and Integrity,” or Reg SCI.
Whether your DR requirements come from the government or the C-suite, the more you test your DR plan, the more your engineering teams become comfortable executing it. They know what to expect, and no one is surprised working through checklists and playbooks. When you’re in crisis mode, you don’t need the added stress of an unfamiliar process.
It’s not uncommon for teams to cite a long list of reasons why now isn’t a good time to plan or execute a DR test:
- We operate 24/7 and can’t take the downtime
- We don’t have enough people to keep the production system running *and* separately test the DR
- It’s not a full test, so it’s not worth the effort
- We’re worried we might break something
The fear and worry associated with a DR test exists because the DR hasn’t been tested. The easiest way to put that fear and worry to bed? Regularly test your DR. It’s the only way to know that your plan is correct, well-structured, and easily executed by your technical teams.
The lesson for security teams? Testing your DR is also the only way to know that you can recover from a security incident, like ransomware.
And remember: it’s not a DR plan if you’ve never tested it at all.
Lesson 2: Make sure your business processes are well-documented.
There’s a fascinating quote in the Bloomberg article, based on conversations with unnamed sources at NYSE:
NYSE executives spent hours pinpointing the problem until they were confident there wouldn’t be further fallout, said the people, who spoke on the condition they not be identified discussing the internal matter.Bloomberg, January 25, 2023:
When an anomaly occurs in a technology system in the 2020s, the first thought is “Have we been hacked?” It’s not an unreasonable concern, given the list of high-profile intrusions that piled up in the previous decade. So when you’re able to identify a technical root cause, rather than a security incident, the executive team wants to be absolutely sure there isn’t another shoe waiting to drop.
I subscribe to Occam’s Razor for troubleshooting technical problems. More often than not, the hypothesis built on the fewest assumptions is not only a good starting point, but frequently the correct answer. Unless there’s an obvious security angle — a ransomware screen, a defaced website, etc. — to an incident, start with technical root causes, which generally make fewer assumptions.
Investigating a technical issue is significantly easier when your business processes are documented. In NYSE’s case, Pillar (their trading platform) probably generates haystacks worth of logs to cover compliance and audit requirements. Logs likely led the technical teams to the root cause — “knock-on effects from leaving the DR turned on” — but wouldn’t explain why the software was programmed to short-circuit the opening auction if the markets were “open.”
Documentation of business processes, however, would explain why the software was designed that way. If the purpose of the opening auction is to establish an orderly open to the core session, then it makes sense that the opening auction can only happen when the market is closed. Presumably the software developers took that business process and encapsulated it in code, and, contrary to popular belief, computers generally only do what us humans instruct them to do.
Why should the security team care? Well-documented business processes can quickly answer questions about the behavior of systems implementing those processes. When you’re dealing with a potential security incident, time is your most valuable asset, and the executive team wants answers with confidence. Having business process documentation on hand makes that happen.
Lesson 3: Your ITOps team should be monitoring your DR site.
Your IT Operations (ITOps) team exists, in part, to monitor your infrastructure, systems, and critical applications. That monitoring comes from event logs and metrics generation, gathered across all of your assets. When standing up your disaster recovery site, it’s important to add your DR systems to your ITOps monitoring solution, whether it’s your existing platform or a DR-specific one.
A cold DR site won’t generate logs and metrics when everything’s powered off, but that’s exactly why you should be actively monitoring those systems. If you start to see DR systems come online and you’re not executing your DR plan in response to a test or a real crisis, it’s time to pick up the phone and ask questions.
A warm DR site, on the other hand, can be tricky for your ITOps team to monitor. You want to make sure that core infrastructure, like networking devices and virtualization hosts, are running and healthy. But other production systems — and this may have been the case with NYSE’s glitch — shouldn’t be running except during a DR exercise or event. While it results in a more complex monitoring baseline, it’s important for your ITOps team to understand how the DR site should look in your monitoring platform, and set alarms and thresholds accordingly.
Avoid surprises caused by your DR site: make sure your ITOps team is monitoring it, at all times. They may spot a security issue before your Security Operations (SecOps) team does.
Lesson 4: Your SecOps team should [also] be monitoring your DR site.
Your SOC has one big advantage in this scenario: its primary job is to spot things that look weird.
Services running at a DR site during normal business operations? Depending on how your systems are architected, that might be weird. And the analytics that your Security Operations Center (SOC) relies on to do its job — to identify, assess, and respond to security incidents — will help you spot weird things happening in your network.
ITOps focuses on performance and reliability. That’s great for wringing out the last bit of efficiency from a server or plugging a memory leak in a Java virtual machine (JVM), but it won’t necessarily pick up on the fact that someone left a server running from last night’s DR test. A security analyst, on the other hand, might look at the four-week trend of network traffic within the DR datacenter, stacking by day of the week. Suddenly, today’s traffic appears elevated compared to the three previous Tuesdays, as if some process were still running…
There are products, like Cisco Secure Network Analytics, that have this kind of functionality built-in, albeit with limited knobs and switches to control dynamic thresholds for alerting. Many security information and event management (SIEM) platforms, such as Splunk Enterprise, support custom analytics with more granularity. Take this hypothetical Splunk Processing Language (SPL) search for web server logs at a DR site located in Chicago:
SPL> index=rjf-services* sourcetype=apache host="*.cermak.rjf.com" | timechart span=15m count by host
It’s a simple query — the total number of Apache events by server running at the Cermak datacenter, in 15 minute splits — but your SOC can start to build analytics around it quickly, perhaps leveraging the Splunk Machine Learning Toolkit for outlier detection. Or, for a more advanced approach, take advantage of the speed of a metrics store or a summary index (either of which your ITOps team may already have implemented in Splunk), and run trend analysis over weeks, months, or even years worth of data in seconds.
Think of SecOps as an ITOps force multiplier: sometimes, an operations issue can look like a security issue.
NYSE’s glitch may not have been the result of a security incident, but in this day and age, technical issues are scrutinized for any possible security nexus. And even when the root cause is determined to be technical in nature, the security team can expect to be brought in for their own insight and analysis. Organizations and cyber security practitioners can learn from this event and improve their technical and security posture around disaster recovery plans and systems.