Sessionize Heterogeneous Event Types With This One Weird Trick

Context makes all of the difference for alert disposition and adjudication. Having all related events readily available shortens the decision cycle and eases the cognitive load associated with finding sufficient evidence to inform judgement. I have already discussed how important it is to front-load event processing inside the utility stack, but have not shown how you can connect those two goals to make it possible.

How I Learned to Stop Worrying and Love the community_id

Analysts typically rely on comparisons of equality (or approximate equality) for stitching more than one record together to form context around an alert. When you getevent_a.src_ip==event_b.src_ip or event_c.src_user CONTAINS(event_d.dst_user), you have some Boolean predicates that filter in or filter out different records, potentially from different instruments, to group things together into a single picture.

Learning how to key and pivot across different event sources can frustrate even the most experienced analysts. Even when parsed into a well-structured representation and a formally-specified schema, at the end of the day, what one vendor calls a destination_address, another might call target_ip. Finding these algebraic relations between record types and fields with Boolean predicates becomes combinatorially complex (O(n!m!)), and increasingly brittle.

Instead of chasing vendor schema changes, parser breakage, and other predictable sources of errors that may interfere with your precious alerting content, why not synthesize a data field that can be consistently calculated over large groupings of events? Enter community_id, a consistent hash that can be computed over any record containing an IP 5-tuple.

Wait a Minute, What's a 5-tuple?

The 5-tuple is any representation of an IP connection that contains the IP protocol number, and source/destination address/port. It's called the 5-tuple because that's 5 fields (go figure). Most records contain portions of the complete 5-tuple, but even incomplete information can be supplemented through inference or external lookups.

Alright, How Do I Compute the community_id?

Corelight published a reference spec for version 1 that fits in half a page of pseudocode, but if that's still too much for you, the basic gist is this:

"1:" + base64(sha1(seed + saddr + daddr + proto + 0 + sport + dport))

Each address and port field is packed in network byte order, and the seed is chosen as a site-specific integer nonce, which defaults to 0. Since SHA-1 crams everything down into a single 20-byte representation, the general algorithm works on both IPv4 and IPv6 flows.  

What's the Big Deal?

Out of the box, you can add community_id to zeek records with the zkg plugin, and suricata added runtime support via configuration.  More interestingly, Elastic has started to add support to various components of its beats frameworks. Now we're getting somewhere. With a consistent calculation of the community_id hash, we can tie NIDS records to host-generated audit records like socket creation and teardown. Now you can correlate a specific IDS alert with a specific PID on a specific host at a specific time, all without any EDR involved.

What's Next?

This was just a small blog snack (and an excuse to trot out some clickbait tropes). If you'd like to see some more specifics (and the lab setup showing how it all works together), please feel free to book some time with me, or contact us for more information on how you can make your analysts more effective and accurate through one of our many services.

Join the Discussion