Back to Blog

Processing CDN Logs: Unlocking the Door to Better Streaming

streaming platform

Access to CDN log data has not been an issue for years. Most of the bigger CDNs provide programmatic access to logs so that streaming operators can pull those logs into systems like New Relic or Splunk for better analysis. But these logs can be tremendously large taking a significant amount of time to ingest and process. Combine that with a multi-CDN environment and an organisation can quickly get overwhelmed especially when those logs are not standardised or normalised. Unfortunately, CDNs represent their log data in their own way which creates more work for operators who want to understand delivery-related issues. Still, CDN log data can provide invaluable insights into streaming performance and the viewer QoE and operators are desperate for a strategy to help them make better, and more effective use, of CDN log data.

More data, more complexity

You would think that including more CDN data would mean more insights and better understanding. But because of the challenges of normalisation, standardisation, and data volume, that is not necessarily the case. Within multi-CDN architectures, more data can actually create more complexity and lengthen the time to derive insights. When an operator is dealing with real-time events, like a live sports match, adding more time to data analysis can result in a higher degree of viewer dissatisfaction and even cause churn. Operators need a way to deal with a set of core challenges related to making the most effective and efficient use of CDN log data:

  • Ingestion. First and foremost, CDN logs must be accessible programmatically. Through API ingestion, operators can fold CDN log data into operational workflows through automation. Operators shouldn’t have to request their logs. But during live events, this ingestion may be delayed (this can even be a function of API rate limiting by the CDN) because of the stress already on the CDN network as they are generating and delivering logs in real-time as well.
  • Normalisation. When pulling log data from multiple CDNs, it needs to be normalised before it can be put into dashboards and other tools. But this can be time consuming as well, especially on large datasets. This is less of a technical challenge and more of a people issue. But like ingestion, this needs to be automated. Normalisation shouldn’t happen once the data comes in. Data should be normalised to an internal data dictionary prior to hitting the internal storage.
  • Storage. Speaking of storage, with a multi-CDN architecture, this can get out of hand quickly. But the challenge isn’t just about storing the amount of data, it’s also about making use of the data in storage. When the data volume grows, analysis slows down as indexing must be carried out continually and the time to index grows along with the size of the data pool. What’s more, operators must strike the balance between data being available in real-time and data accessible in long-term storage.
  • Querying. Much like indexing, querying is significantly impacted by the size of the data pool and the size of the index. Operators want to utilise as much data as possible to get into the granularity of issues related to CDN delivery and performance. But if queries take minutes instead of seconds, viewer QoE issues can’t be addressed in real-time or near real-time.
  • Visualisation. Of course, the end-goal of any data set for a streaming operator is to see the results in a dashboard. But when data sets grow, when the process for normalisation is long, when indexing slows down, when queries take longer, the impact flows into visualisations – dashboards are populated minutes after data comes in, rather than seconds, and operations engineers are unable to solve problems in real-time.

All of these challenges have a related vector as well–cost. As the data grows, storage costs grow. As the data grows, and queries take longer, CPU cycles increase and that results in more costs. Failing to solve QoE issues in real-time, which is a result of all the time it takes to process and work with the CDN log data, can result in churn which, of course, has a much bigger cost to the business.

Making CDN log data analysis real-time

Thankfully, there are ways to solve the challenges which are adding more time as well as the costs associated with handling all that data.

Strategy #1: Data subsets

First, operators need to deal with data subsets. Key fields, which collectively might be used to generate a single metric, should be collated in a separate table. For example, if there is a specific metric which indicates a potential performance issue, the speed at which an operator can see the problem, the sooner the issue can be addressed…before it becomes a QoE problem. Data subsets like this can help improve indexing and query performance to populate graphs in dashboards.

Strategy #2: Data sampling

Second, operators need to understand that they don’t need all the data. Going back to data subsets, let’s say the operator creates a subset around buffer ratio. The buffer ratio is directly related to potential CDN problems. The operator pulls all the data in real-time. The subset is carved out and the rest of the data is dumped into long-term storage (so that it can be looked at later as needed and pruned if it’s not relevant). The subset, then, is a real-time data source but it only represents every 10th session. The visualisation displays that metric and changes when the metric hits a certain threshold set by the operator. When that happens, the system switches gears and starts pulling in all of the sessions which allows the operator to drill into the full data set to understand the problem and take immediate action.

Strategy #3: De-couple data storage and reporting

Third, operators should look at reporting (visualisation) and data storage separately. Unfortunately, that is not often the case–dashboards are connected directly to the CDN log data pool. But if operators embrace subsets, the CDN log storage data pool can be separate and the visualisation can operate against the subset. This can reduce overall costs, by cutting down on compute cycles needed to index and keeping a much smaller volume of data in active storage (which is more expensive than long-term).

Strategy #4: Create a real-time CDN data pool 

Ultimately, every streaming operator should be looking at how to create a real-time CDN data pool. This pool is a subset of the long-term storage of the complete logs, combining data from all the different vendors, and dealing only with the specific data points needed for operational metrics. What’s more, it’s a sampling of the data until such time as every record is needed and normalised as part of creating the subset. Everything is automated, performant, and cost-optimised.

CDN log handling doesn’t have to be as difficult (or costly) as it is

CDN logs are crucial to streaming operations and, thankfully, they are accessible programmatically which means they can be ingested into operational workflows in an automated fashion. But there are numerous issues which present challenges to using large volumes of CDN log data in a real-time fashion. Still, there are ways to solve those challenges which can improve performance of queries and indexing while also driving down the costs of storage.

Interested in learning more about this topic? Watch the workshop “Processing Video CDN Logs at Scale in a Cost-Effective Way” where you’ll hear from streaming operators as well as leading vendors on technologies you can use today to improve how you handle and use CDN log data.