Security Reference Architecture

Security Reference Architecture – Technology

Jun 01, 2020

ThetaPoint’s Security Reference Architecture has abstracted the Security Operations Center’s (SOC) technology stack into four simple buckets: Input, Output, Transform, and Analyze (IOTA)

These categories focus on the critical services that the SOC performs. Many proprietary tools provide more than one of these services, but since vendor independence is one of our overarching goals, this division allows us to focus on the key features needed to support our production use cases. This simplified stack helps frame discussions about organizational responsibilities in the context of supporting the SOC as well as how the SOC can leverage parts of the enterprise technology stack to meet its mission.

The Baseline Technology stack shows a natural progression of data through the SOC. Data enters through the ingest pipeline (Input), is manipulated and transported through the utility layers (Output and Transform), and lands in a robust workflow (Analyze). This emphasizes the idea that all source data should be utilized and valuable to support a production workflow. The advancement of data also maps to a progression of added value. At each layer, a transformation of data converts specific records into something more valuable and easily digestible.

It should be noted that all four pipeline stages (Input, Output, Transform, Analyze) are supported by an underlying representation of the entities being monitored (infrastructure, business domain objects, applications, identities, etc.). This Model is critical for automatically performing enrichment and data reduction tasks. A consistent representation of the monitored entities ensures reliable naming conventions, high-criticality entities are tagged as such, and data can be dropped or aggregated if it is of low value. The technologies used in a SOC and their primary features and functionalities also map to the four stages.

Input (Ingest)

Data ingest tends to be what most people perceive as the Security Monitoring service. Many organizations have overfitted their security operations to the specific sensor and SIEM platforms they have purchased. That there are many ways to consume security data should come as no surprise. It should also not be a surprise that the security technology space changes quickly, and our expectation is that the value of today’s product investments will inevitably decrease over time. This certainty should indicate that you should be well positioned to swap source systems in and out as your business needs arise. In our experience, we’ve found it useful to corral data collection into a consolidated infrastructure that can cover a variety of scenarios.

Configuration churns frequently at the ingest layer due to transient and operational change. A thin layer of abstraction is often needed to mitigate this churn. This can take the form of a well-known naming convention for receivers, configuration conventions for pollers and collectors, or standard scoping rules for data sweeps. Load balancing can be used in either direction to facilitate this data flow. By standardizing the mechanics of the different data collection modes and their abstractions, configuration churn can be minimized, and the data flow can be better instrumented.

Modes of Data Collection

There are many APIs, protocols, formats, and strategies different IT systems use to generate usable event data. It can get confusing if your approach to Windows logging is wildly different from application logging when both are needed to support an automated analytical service that needs both events. Pulling all collection into common layer can be simplified if you think of all collection as falling into one of four Modes:

Receive
- Network services configured at the monitored device to receive event and metric data
- Syslog, Windows Event Forwarder, statsd, graphite
Poll
- Services that periodically wake up and query a specific resource
- Periodic database queries, active windows event log queries
Collect
- Pollers that collect from many predefined resources or temporary spools
- Web log collection
Sweep
- Collectors that operate on environment-wide scopes where the quantity and content of data is not known beforehand
- Vulnerability scans

Infrastructure Abstraction

Data ingest should be pervasive enough that it should be easy to gather any event you need to support a use case. This could require some base technology investment in load balancing, service discovery, network topology, or configuration management to expose the data collectors to the infrastructure they need to service. This is the sort of service that should be provided by an IT department, but its importance as a SOC enabler cannot be overstated.

Output

Outbound Data Flows

Finally, your Ingest stack can be further simplified by dividing its outputs into Events or Metrics. Events are records of a specific action happening on a specific system. Metrics are measurements or summaries of events over time. Events tend to provide most of the raw material consumed in the automated analytical services, but metrics can be just as important for analysis as they are for keeping an eye on the overall system health.

Since tuning is essential to reducing workload and increasing sensitivity to critical alarms, we have found it most helpful to lift the service into its own abstraction and to make it more accessible to SOC analysts within their critical path workflows. This tends to manifest itself as simple white/black lists or more organization-specific model information (“ignore everything from guest network ranges”, “sound the alarms for access violations on critical data stores”).

In the Reference Architecture, tuning is so critical, that it’s wired into everything that passes to the SOC. Every automated alert should have some abstraction for ignoring events or boosting their criticality. This concept should be a mouse click away for the front-line analysts so that the knowledge gained in a case adjudication can propagate as quickly as possible into the rest of the SOC.

Message Bus

The need for a message bus is necessary to allow all the Technologies in your SecOps environment the ability to communicate with each other in a cohesive manner. This technology platform allows applications to publish and subscribe to a high throughput, low latency and fault tolerant message bus to insure events are delivered and received accordingly.

A perfect application of this technology is the convergence of SIEM and Big Data. As more and more logs are collected, finding an efficient and cost-effective way to stream this data to the appropriate purpose-built application is a tremendous challenge. This data is relevant as Threat Researchers and Incident Response teams look for Advanced Persistent Threat (APT) or low and slow attacks within their organization. In that model, the more data made available to the team, the better an opportunity to identify outliers or abnormal behavior/activity. Prior to the introduction of HADOOP and cloud computing, this was an expensive proposition, both in terms of network bandwidth and constraints and the cost of storage and compute power. Today, logs can be collected once and applications can subscribe to the events that matter to it saving a tremendous amount of time and energy not to mention network/compute costs.

Transform

Data rarely arrives out of the Input layer completely pure and ready to use. At a minimum, there are usually issues related to incompatible formats and transport protocols that make specific data flows less than useful or difficult to integrate.

Furthermore, many events arrive lacking critical contextual information that can make workflows time consuming and frustrate analysis. A classic example of this would be IDS/IPS alarms that only arrive with a signature ID, and source/destination addresses. Discovering useful metadata related to the asset endpoints based purely on IP address can be a time-intensive process to say the least.

We’ve found it useful to build a consolidated platform that can perform basic processing before transport to the workflow components. This utility platform can drop or aggregate low-signal events, join events with contextual data, or even do basic analytics while also serving as a transport substrate. By doing commodity processing, transportation, and integration, this infrastructure lowers the amount of effort needed to compute automated analytics or perform manual analysis workflows.

Model

The systems model is the most essential ingredient for the Baseline Technology stack. The Model provides a common set of abstractions and representations for all the monitored elements, their interactions, and relationships. All components within the Baseline Technology stack consume the Model in a machine-readable format. This universal frame of reference makes observed event data more consistent and simplifies the interfaces between stack components.

Whereas the Input, Output, and Analyze technologies are usually concrete implementations of services on a computing platform, the Model is purely an abstraction meant to represent what is being monitored and its context. We’ve found that the more work put into modeling often translates into higher levels of value extracted from the workflows (and thus the SOC overall).

We have found it valuable to synthesize the Model itself from streams of event data. We use the Input and Tranform components to transform data repositories, such as Configuration Management Databases (CMDBs), enterprise directories, vulnerability scans, and spreadsheets, into timestamped event records. This allows your operations team to use powerful event processing features and logic to create a living model from a constant stream of incremental updates. Compared to “one-shot” loads of network maps, the event-based approach ensures both recency and high fidelity, and it allows your organization to watch model data as though it were a critical event flow.

There are very few vendor products that provide this service. Even well-executed CMDBs can fail to represent the types of relationships a SOC needs to model to perform its analytical functions (such as who can access what, how traffic should flow, or what systems are currently lacking a specific security patch). As such, even though an enterprise CMDB can perform many types of modeling functions, we have found it more useful to lift the Model into a separate repository that has its own SOC-supporting features and governance.

Analyze

We’ve banished the distinction between SIEMs, data lakes, task tracking, and analytics and have lumped it all into the heading of ‘Analyze’. This keeps the focus on what is being done (the analytical work) and erases the barriers that often arise because of product silos. Alert generation and tuning are the most essential functions within this stack, followed by the workflow support utilities that help track open cases, communicate events to analysts, or execute automated actions on their behalf.

Automation

The frontline defense for your SOC is its automated analytical function. For many years, this space has been the domain of proprietary SIEM software that makes it possible to comb through the billions of events looking for the proverbial needle in a haystack. In the Reference Architecture, it is just a socket into which any type of analytical system can be plugged. Many automated analytics are simple filters on field data or joins with an external reference table. Advanced organizations can have more sophisticated online software backed by machine learning and artificial intelligence. They should both have a place to provide their value without having to think about each analytical service as a separate technology domain.

The Automated Analytics also provide a couple of key mapping functions. Their results should be alerts that have been mapped into a common threat taxonomy provided by the model. The entities referenced in the alerts should also be mapped onto the model so that related asset and identity information is contained within the alert itself. This ensures that the alerts are consistently marked, and that relevant asset and identity information is readily available for an analyst to make a speedy adjudication. Pushing this enrichment step down to an automated layer saves time and prevents errors compared to performing it manually in a later investigation step.

Tuning

Anyone who has operated a security product for any length of time knows that tuning is essential to extracting value from your investment. However, we have observed that many organizations can only tune security events by policy at the source system. This style of tuning often rises to the level of a major change and can lock your operational effectiveness to the pace of organizational bureaucracy.

To help illustrate this point, let us assume that you have correctly tuned an existing Windows audit policy for your production environment. The marketing team introduces a new application that generates a new Windows event code. This may not have triggered any analytics in a development environment, but once live in production, the application generates such a high volume of events that your analysts become overwhelmed with alerts. If your organization can only tune the event flow by performing a domain-wide policy change, you may be forced to endure the extra workload (both in computing and analyst hours) for weeks or months as you socialize the change through the bureaucracy of your change control processes.

Workflow

The workflow layer has two primary functions: presenting relevant information for analysis and capturing outcomes from a case adjudication. These outcomes can come in the form of knowledge gained during analysis (“event code XXX is critical on network YYY”) or actions initiated by an analyst (“quarantine system ZZZ for recovery”).

Analysts spend most of their time on a workflow platform. It should make common tasks easier and faster. The most common tasks are queries to event repositories and visualization of those events. Analysis grinds to a halt when those critical paths slow down. It’s not entirely obvious that repository query performance can have an impact on your organization’s risk posture, but there is a direct correlation between enabling the investigation and analytical tasks performed by humans and your SOC’s overall effectiveness.

Another aspect of the workflow layer is that it should reduce analytical uncertainty. This can be enforced by preferring dropdown-style event codes over free text inputs or pre-computing related events before presenting the case for analysis. We’ll have much more to say about making workflow technology more effective when we describe our Analytical Use Case Development Process soon.

Finally, the workflow layer should encapsulate and abstract the possible actions your analysts can take after case adjudication. This can be as simple as providing a common set of tools and scripts to kick off incident response tasks, or something more sophisticated like an API of serverless functions provided by a vendor platform. Encapsulating these actions creates an audit trail and provides critical information when it comes to determining the amount of work that is being done as well as the value provided by your SOC.

IOTA in Action

To bring everything together, we display a sample IOTA Architecture that focuses on modernizing a legacy ArcSight implementation while adding complementary technology into cohesive SecOps architecture.

For more examples on the various ways the SRA can be utilized, please visit our IOTA in Action Blog Series.

Key Takeaways

Simplifying the tech stack into 4 boxes help to discuss their complex interactions in terms of the business services they provide.
The simplified technology model also makes it easier to consider service composition from existing enterprise technologies or replacement of existing technologies when investments are made.
The “forcing function” of your SOC always rests in your workflows – constraints there degrade your investments, and increased focus on workflow enablement will increase the value of your SOC.

What’s Next

We have published the framework in high level detail on our Blog, and hope to engage you in a collaborative discussion of the challenges you are experiencing and the solutions we have developed. Please contact us to continue the dialog.

SRA – Solution: https://www.theta-point.com/solutions/security-reference-architecture/
SRA – Blog Series: https://www.theta-point.com/blog/category/security-reference-architecture/
SRA – IOTA in Action Blog Series: https://www.theta-point.com/blog/category/security-reference-architecture/iota-in-action/

About ThetaPoint, Inc.

ThetaPoint is a leading provider of strategic consulting and managed security services. We help clients plan, build and run successful SIEM and Log Management platforms and work with the leading technology providers to properly align capabilities to clients needs. Recognized for our unique technical experience, in addition to our ability to quickly and rapidly solve complex customer challenges, ThetaPoint partners with some of the largest and most demanding clients in the commercial and public sector. For more information, visit www.theta-point.com or follow us on Twitter or Linked-In.

Reference Architecture Security Operations