by Tye Gaywood 28th March 2019

Wait, What Happened During That Stream?

Today streaming video service operators have a lot of valuable data at their fingertips through the use of a wide range of monitoring tools, from ingest right through to delivery. Problem is, the data is often disconnected requiring a significant amount of manual labor to piece together the story around a streaming incident. This is especially problematic when the incident is undermining Quality of Experience (QoE). Without a way to quickly uncover the root issues, subscribers might begin to flee and let social media know their dissatisfaction. Ultimately, that was the genesis for our StreamE2E platform—a real-time view into the entire end-to-end monitoring chain of video streams . But just having a live view of the data doesn’t necessarily make it an easy task to sleuth the underlying problems, especially when that problem is the result of a cascade of related events.

Moving from Triage to Post Mortem

With many OTT service providers, resolution time is a critical KPI. Although there aren’t any SLAs with regards to service performance, consumers are unforgiving about streaming quality. Raised on the consistency of broadcast television, people will take to online channels to voice their anger over quality and other service degradation.
Because of that, when problems do happen, whether it’s availability, bitrate performance or latency, operations teams spring into action. They work with vendor partners, like CDNs, and internal teams to triangulate the problem and source a solution as quickly as possible. That’s why the StreamE2E platform has been so ground breaking—it enables those operations folks to pull data from a variety of sources and see the complete picture in a linear progression through a single dashboard.
Triage, though, is just a process to fix the issue. It doesn’t provide a way to understand what was really going on, what caused the problem in the first place. And once an incident has gone, it’s almost impossible to visualise and analyse it again in the same way as when it was actually happening. Only that’s exactly what’s needed. Operations must come together with their vendors to understand how the problem occurred, which is often a tedious, manual process of piecing together those disparate data sources. This post mortem is critical to making the technical adjustments required to ensure a future, high-quality QoE.

Context is Everything in Analytics

Part of the problem with doing any sort of post-mortem is that you are only looking at the data, not the context. It’s hard to imagine the data in the linear delivery workflow without actually seeing it there, but that context is critical to understanding how one step in the workflow impacted the next or others downstream. So how can OTT operations teams replay an event in order to better see it, to watch the data flow through, the cascade of events, and where the issues are within the delivery workflow?

A DVR For Data?

The answer to that is our new StreamE2E Incident Playback service. This service allows operators to replay historical incidents as though they are happening right now.

Figure 1: StreamE2E Incident Playback Interface

As shown in Figure 1, by using this new service, operators can playback an incident in full or from any point in time in the streaming workflow, giving them a view of the end-to-end data in one, intuitive visualisation. In the screenshot above, the top “availability” section of the dashboard shows where incidents have occurred. An operator simply needs to click anywhere on the timeline (indicated by the redline) and then rewind or go forward in time to replay the event. The tool mimics each monitored element, presenting an exact visual replica of the actual workflow. This is a much more granular way to examine problems that may have contributed to the slowdown or failure. It’s like a DVR for stream monitoring data. When employed by operations teams and delivery partners working on issues post-mortem, it can enable them to rewind the delivery workflow. Imagine being able to see each step of delivery and where performance diverged from expected thresholds?

 

Figure 2: Complete View of All Metrics

But just seeing the event replay is only half of the solution. Everyone needs to be on the same page, even those without access to the dashboard. That’s made possible by the “metrics view” as shown in Figure 2. Once the operations team has determined a point in the timeline, they can opt to view all the metrics (from every tracked workflow element) and even export to Excel or PDF to share with vendor and internal partners. This ensures that all parties responsible for specific workflow elements see how they are impacting each other through video delivery.

Cooperative Analysis: The New Norm

Although having the opportunity to “rewind” an incident is great, and a single visual dashboard of the delivery workflow is very insightful, what the StreamE2E platform and the Incident Playback service truly offer is a collaborative approach to streaming video data analysis. It enables everyone, internal operations teams and external vendor partners, to “speak the same language.” No, it doesn’t ensure that everyone agrees on the KPIs. But it does ensure that everyone can see the data behind them and each KPI´s relation to other elements within the delivery workflow. What’s equally important, though, is Incident Playback doesn’t require manual efforts to connect the dots. The StreamE2E platform does that automatically by keeping the data in context which allows teams to work through the issue, while Incident Playback enables operators to move back and forth through a performance-related event to see how each workflow element fared.

Problem Management: A Revolutionary Approach?

We’d love to say that our approach to reviewing incident data is our own revolutionary idea, but actually, we are drawing from some of the most important ITIL concepts: Problem and Incident Management. Touchstream’s Incident Playback represents is what ITIL defines as “Problem Management”, a process which reviews Incidents to identify and subsequently remove the underlying cause . What’s revolutionary is that we are bringing robust and proven ITIL processes into the streaming video Quality of Service (QoS) world. The ITIL framework was designed to help businesses manage risk and improve customer relations as well as enabling the creation of a stable IT environment. Think about it—the StreamE2E platform with the new Incident Playback service is cloud based, requiring nothing to be installed. That’s stability. And by enabling operations to see streaming data across the workflow while also allowing them to see a repeat of incidents, risk of subscriber churn can be radically reduced.

Conclusion

The days of manually piecing together data from disparate systems to understand an incident are gone. With Touchstream’s StreamE2E Incident Playback service, you can now replay incidents within the context of the delivery workflow. This can significantly improve an OTT service provider’s ability to diagnose the root cause of issues, rather than just treating the problem, and ensures a high-quality viewer experience. In the long-term, employing such a system will significantly mitigate the risk of subscriber churn while also adding to the overall stability of the entire IT system.