Hey, Cameron here!
I promise not to sell your address or spam you.
I want to send you specific WebRTC material so useful, that you say wow!
- Cameron Elliott
Table of Contents:
This guide is for those using STUN/TURN with WebRTC.
It is to help you decide why you might, or why you might not monitor and track the health and performance of your STUN/TURN servers.
This will also help you decide how to implement monitoring if you decide to do it on your own.
If you need STUN/TURN monitoring, but don’t want to build it yourself, RTC9.COM offers free and paid services.
If you choose to run your own server, and you choose to build your own monitoring infrastructure, you can use rtc9-turnhealthmonitor which provides a standard HTTP Prometheus endpoint for most monitoring systems.
Health monitoring for TURN for WebRTC provides insight into the health and performance of running STUN/TURN servers.
If you are having issues with TURN packet loss, latency, or jitter, or even plain downtime, it can affect WebRTC call quality, and without monitoring or insight, if users are complaining about call quality, you will have no idea if it is temporary packet loss, or a overloaded or malfunctioning TURN server. Or your TURN server might just be running on an over-provisioned busy VPS.
By monitoring some key performance metrics of your TURN servers continuously, you can quickly identify or eliminate the TURN servers as the issue, and proceed to get your entire WebRTC system operating at peak performance.
During development of WebRTC systems reliability and robustness are usually a much lower concern than just getting applications working. Sometimes it’s just about the last concern compared to getting to market.
When real customers, beta-customers, management, and prospects start using WebRTC services needing STUN/TURN, concerns about reliability can come to the front a few different ways:
NO! TURN health monitoring alone, probably won’t get you
to five-nines of reliability.
You really need a few more things to get there, like system-wide redundancy. Subscribe to my newsletter to get more info on achieving the highest levels of availability.
This part is not for managers, except the most curious! This is a high-level guide to implementing TURN health monitoring. This guide requires an experienced devops person to implement.
This is an image of the continuous packet TURN session monitoring system we build. The first version. You might decide on different metrics. This image & section is just to show what’s possible, and does not include setup details for Grafana.
Starting with the , this image shows:
The three key metrics for monitoring TURN server health are: jitter, latency, and loss as mentioned in the first part of this guide.
You need an automated way to run TURN sessions and measure jitter, latency, and loss.
So, whether you use rtc9-turnhealthmonitor or cobble together your own methods to periodically measure health metrics, you ideally want to collect those metrics for real-time monitoring and later analysis, if needed.
The three systems below stood out to me when building a TURN health monitoring system:
I currently use Prometheus as a system for collection and storage of metrics discussed in this article, but after having set it up, and managed it for a while, I honestly would look for
a simpler solution. Granted, I am always looking for a simpler solution!
If I had not discovered TimescaleDB, my second choice would likely be Influxdb.
I suspect the installation and maintenance of Influxdb would be simpler than Prometheus, which I
had installed using Ansible playbooks from CloudAlchemy.
While everything is currently working fine, I worry about to main things:
These topics are worth an article in itself, please write me if this interests you, and/or join my mailing list.
So, once you have chosen a tool to periodically measure current jitter, latency, and loss, and you have chosen a metrics database and monitoring system, you actually need to get the metrics into the database.
If you chose Prometheus for your monitoring system, rtc9-turnhealthmonitor provides it’s metrics in Prometheus format, and it is standard straight forward work to wire the two together so your metrics end up in Prometheus.
If you chose Influxdb and Telegraf for your monitoring system,
Telegraf will poll and ingest metrics into Influxdb from standard Prometheus HTTP
endpoints. Get rtc9-turnhealthmonitor working, and start polling it with Telegraf.
The Prometheus poller for Telegraf is described here.
If you chose TimescaleDB for your metrics data store, you probably need to be using timescale-prometheus for metrics collection and ingest. Which will simply use Prometheus for metrics collection. Even thought it is sunsetted, I personally like the simplicity of pg_prometheus, but I wouldn’t recommend that route unless you are very experienced or a brave soul or both. if I get some more experience with timescale-prometheus, I’ll write about it, and share to my newletter list.
At RTC9.com, if we use TimescaleDB for metric storage, we likely won’t use Prometheus for metric collection, we might look for a simpler route. If you look at the help docs in pg_prometheus,
a single bash command-line example shows how
curl can be used to pull and insert metrics
into the database.
There is not much more to say about Grafana here, it seems to be the most popular open-source package for graphing and looking at real-time monitoring metrics.
It can take a bit of time to figure out what the right thresholds are for alerting on. While we use bare metal for our TURN server, we use VPSs for our TURN polling servers. Because these are subject to a wide performance range, due to the nature of VPSs, it took us a while to figure how the best trade-off between alerting too-soon, vs. waiting-too-long for metric ranges to be considered alertable.
If you are in the development phase of your WebRTC STUN/TURN effort, and running your own TURN server, TURN health monitoring might be overkill, but you may want to consider your plan for when you go production.
If you have prospects, customers, management using your WebRTC applications, and you are running TURN servers without any kind or inside performance-metrics visibility, you might ask, what are you going to do if & when people say they had a problem with a call. Being able to research problem reports and proactively fix performance issues makes the difference between systems that are a nightmare and those that a devops person can be proud to run.
STUN and TURN are services, often used in WebRTC, that help endpoints do the following:
It is generally reported that for a large set of connections between peers that TURN relay services are needed about 15% to 20% of the time. Some scenarios should rarely need TURN, if ever, for example non-firewalled non-NATted hosts in WebRTC sessions.