Getting event data out of your services and ready for analysis can be a surprisingly big challenge. Particularly in cases where you are trying to aggregate many streams of heterogeneous data from different sources, loading and transforming data to prepare it for analysis can be one of the more difficult and time-consuming parts of creating an anti-abuse system. Taking time to plan and architect your data collection with an eye on the important issues can help you dodge the traps and get to a good implementation with the minimum headache.

This post is not a technical guide, though please do stay tuned for a more technical post on this topic. Instead, the focus here is on best practices, tips, and tricks to help you make the right decisions in planning and implementing your data collection.

Fast, cheap, good: pick two

As the “front door” of your anti-abuse system, your data gathering needs to quickly get the data into a form ready for analysis, and it needs to do so reliably. Slow-to-arrive or dropped data results in bad decisions or missed incidents. Often the first instinct is to shoehorn the data in in whatever way is the cheapest and easiest. While that’s not a terrible idea - you should definitely be building the cheapest system that meets your needs - considering other concerns can help optimize your effort so you can get the most out of your investment.

Cost

As I mentioned above, the first (and often the only) constraint considered when getting data into an anti-abuse system is cost. There are two primary costs: build costs and operational costs. Build costs are incurred up front, in terms of effort and time needed to engineer a solution. In general, the build cost goes up as the complexity of the system goes up. Operational costs are ongoing, and include computational resources and maintenance. Operational costs are correlated with data volume, sophistication of analysis, and low latencies. Both types of costs should be considered when evaluating a potential solution. Finding the right tradeoff on cost is often a matter of cutting complexity and sophistication down until it solves the problem well enough for your needs.

Latency

From the perspective of this topic, the latency of an event is the time between the event was generated, or perhaps the time the action occurred, and the time you have performed your analysis and can take action. This latency can be further broken down into component parts depending on the architecture of your system.

For example, perhaps logs are flushed from the application on 5 second intervals. Those logs are picked up off disk by an aggregation tool when the file size reaches a certain size limit and uploaded to a central log store; on average this happens every minute. A loading process runs every 5 minutes to sweep up the new files from the log store and transform them, writing them into an indexed data store. Once every 10 minutes, a job runs against the data store to find instances where account send rates have gotten too high. Each of these steps contributes to latency. You can sum half of the component delays to get a rough estimate of the average total latency. In this case, the average latency would be 2.5 + 30 + 150 + 300 = 482.5 seconds. Of that time, 2.5 + 30 + 150 = 182.5 seconds is the latency of the data loading.

For some systems, this may be fine. For others, there may be a need or desire to have the latency be shorter. Understanding both the sources of latency and the latency requirements you have for your system will guide your architectural choices. In general, improved latency comes at the cost of increased complexity and decreased extensibility, so choosing here again its important not to overbuild.

Resilience and monitoring

An often-overlooked aspect of your data ingestion is how well it can be administered. There are two primary pieces to administration.

First, the system needs to be monitorable so that ops personnel can identify when it’s not working. Failure to monitor the system well results in lost data. In turn, lost data results in poor results and thus poor outcomes. Good data ingestion monitoring examines each stage of the system for failures and has processes for automatically escalating non-transient issues. This means monitoring the processes that produce logs, the processes that transport logs, and the volume seen by the upstream consumer or consumers. Handling application failures gracefully by shunting logs into a side channel for later human review (and noting that there were failures at all!) is another important part of monitoring. Many systems have built-in dead letter queuing, but for some infrastructure you may need to roll your own. Beyond uncovering and helping to diagnose coding errors, periodic review of the things in the dead letter queue can help uncover areas where monitoring can be improved.

Second, the system needs to be designed for resiliency in the distributed-systems sense of the word. Networks fail, processes die, disks fill up. Without considering these failure modes and how the system or its administrators can respond, you may wind up with a system that takes a long time to recover from issues, again resulting in lost data. Your choice of infrastructure components can help greatly, but to build truly resilient systems you must take a holistic view of the system as a whole. As with latency, your own specific requirements should guide your choices.

Implementing your collection

Push or pull?

When you have control over the infrastructure that is producing the logs, and particularly when you write the software that is doing the logging, there can be some question as to whether or not you should be pushing the logs directly to your anti abuse system, or should be pulling them from an API. In other cases, like cloud services, you might not have a choice: use the provided API or hit the APHighway.

One good implementation is to unify the two by having the anti-abuse system listen for events regardless of source, and to have integration shims upstream to handle collection from the source. Those shims collect events from their given source, handle any batching, and push the results to the anti-abuse platform. For third-party software producing logs on disk, tools like fluentd already do a lot of this. For API-driven collection, the shim can be a periodic polling process specific to the service. In either case, the shim abstracts out the details of where the logs are coming from, making future integrations easier and reducing component coupling.

Transformation

Another common hurdle is the decision of if and how to transform the data once it’s in a place for processing. The first layer of transformation, from unstructured log data to a structured format like JSON, is a requirement for future processing. Depending on the source of your logs, you may be able to configure (or write!) the capability to output structured logs natively. If not, consider using a tool like fluentd or logstash with a rich set of capabilities and plugins. These tools and their ecosystems can bear the brunt of the hard work of parsing logs and, heaven forbid, writing regexes. It also pushes the computational task of log transformation to the edge instead of your central processing.

The next, optional layer of transformation is to convert the logs from all sources into a common event schema. This has its pros and its cons. On the plus side, it makes it easier to do cross-service analysis, finding commonalities among the different event sources integrated into your system. This can be a big plus if the systems are similar in scope or operate on the same data. On the minus side, it can be challenging to choose an event format that is flexible enough to support adding new or changing existing log sources. If you have to change the common schema every time you add a new source, you’ve given up most of the benefits of the common schema in the first place. Schemas that are flexible enough to encompass different types of services are typically too diffuse to make analysis that much easier.

If you do decide to go with a common schema, take a look at existing standards for interoperability in the domain you’re building your anti-abuse platform. For example, in the healthcare domain there are standards like FHIR that provide guidance if not a complete template for what format you should use. There may also be useful standards suggested with your tooling, like the Elastic Common Schema. In general, if you can find a standard that fits your needs exists, it can save a you a lot of pain.

Pushing and pulling: Best practices for getting data into your anti-abuse system