Round-Robin Business Intelligence (RRDs Explained)
Recently I’ve needed to integrate a monitoring system with a BI reporting system. The reporting system has a SQL back-end and the input from the monitoring system comes in the open standard of a Round Robin Database (RRD) file.
RRDs store data cyclically so that a predefined database size is used and not allowed to grow. The uniqueness of this storage format is that a fixed number of data points are stored at different granularities with specific consolidation functions to aggregate the data. This allows a variety of data resolutions, alongside long-term storage of data (with limitations).
That explanation may not have made sense, so I’ve whipped up a Tableau Public chart to help illustrate what’s going on.
A long, but trivial example
For the sake of an example, let’s say I’m interested in how the temperature in my bedroom fluctuates with respect to outdoor temperature (maybe I want to really optimize my heating/cooling bills and knowing how much my bedroom temperature lags outdoor temperature is a key measure). Also, for the sake of curiosity, I want to know what time these temperatures all occur at.
This means I’ll have 2 digital thermometers attached to a computer sampling inside and outside temperature every 60 seconds (both sample at exactly the same time). This data is continually recorded and stored for me to look at when I have time to make a clever temperature control system.
After 1 hour, I have 60 rows of data in a time-value series with 180 distinct values (60 time points as seconds, 60 indoor temperature data points, 60 outdoor temperature data points) representing the minute-by-minute temperature fluctuations of my room over the past hour. Great!
Being able to measure the temperature of a room over an hour is cool and all, but if I want to actually look at that data and get something out of it, I might be more interested in how the temperature of my room fluctuates across a full 24 hours. So, if I leave my system to run, at the end of the 24 hours I have 1440 samples of data (for the sake of simplicity, each row of time/indoor temperature measurement/outdoor temperature measurement will be referred to as 1 data sample).
Measuring one day out of the year might give some information, but that’s just one day. I should probably get more data and sample for 2 weeks (20,160 samples), or maybe even 4 weeks (40,320 samples).
Hmm, but picking just one month out of the year doesn’t tell me how indoor temperature fluctuation changes based on seasons, so I probably need a full year of data (525,600 samples). But, what if this year was a mild one? I should keep my experiment running for, maybe, 3 years just to be sure (1,576,800 samples).
What if this experiment wasn’t limited to my house? What if I was working on measuring the effect of external temperatures on house of different ages/qualities across a city? In a city of a few million people, I might want to sample 5,000 properties in the same way (7,884,000,000 samples representing 23,652,000,000 distinct data points).
Storage requirements for all this data
What’s the impact on data storage, you ask? For simplicity, let’s use 4-byte data types for each measurement. Ignoring compression, this gives us about 88 GB of space needed to store all that data. 88 GB isn’t a lot of data in the grand scheme of things, but that much data for just making two temperature measurements a few years for a few thousand sites? A bit excessive…
I think we can all agree that measuring and storing temperature fluctuations over 3 years on a minute-by-minute basis is a little ridiculous. But for some monitoring solutions, having data recorded every minute on the minute is very important.
Any business requiring high availability need to keep track of outages on the order of minutes or face severe issues (financial and possibly others). I work in the healthcare industry and minutes of downtime have huge implications (the biggest being patient safety).
How can we cut storage costs?
With any monitoring solution, the question that will always arise is, “what resolution do we REALLY need?”.
In the thermometer example, given that we’re only measuring ambient temperatures, we can safely assume that there won’t be vast minute-by-minute fluctuations. In fact, we can probably drop our measurement frequency to every hour and not lose much useful information. But, the thermometers we bought automatically sample every minute, so we still have this flood of data coming in. How do we deal with it?
The answer is that we measure minutely, but record hourly. This will drop the total number of samples (and thus, storage required) by a factor of 60. So, do we just throw away 59 measurements and take every 60th? Well, we could, but since we have that data anyway why not make use of it? Or, what if there was a glitch on that 60th sample and we never received the measurement? That would cause us to effectively lose an entire hour of data.
The way we make use of all the measurements is by aggregation. We take multiple measurements and intelligently join them together to give us a single measurement of (hopefully) more significant meaning.
But how do we want to aggregate that hour of data? Sum up all the values? Take an average? Take the maximum or minimum temperatures over that hour? Some combination of the above? Or do we even need hourly data? Couldn’t we just jump straight to daily data (like many weather services do)? Again, how do we aggregate the that daily data?
Let’s say we decide on some aggregation function and some frequency of recording data… How do we deal with adding new houses? And where are we going to store all this information? Our data storage is currently growing without bound, and as we add more sites, it will grow even faster!
RRDs to the rescue!
… Sort of… I have a love/hate relationship with RRDs. I completely see the value in them, but I’m also stuck with dealing with their quirks.
RRDs work by storing measurements at different granularities for different amounts of time. In many monitoring solutions, we need very fine resolution measurements for a short period of time, but then after that we need progressively less resolution for longer stretches of time, and then after some time, we no longer care about the data. RRDs work using this concept.
As an example, I have an RRD file open right now. I have data recorded every minute for the past 2 days, every 5 minutes for the past 10 days, every 30 minutes for the past 3 months, every 6 hours for the past 4 years, and nothing past 4 years.
This gives the high-resolution I need in the short-term, combined with my progressively lower resolution need in the medium term, and finally my lack of need for data past a certain amount of time. All of this data came from measurements which occurred every 5-15 seconds.
Back to the question of how this data is aggregated together. Well, you choose that too. What’s even more valuable, you can choose multiple ways of aggregating the same data. From that same RRD file, I have the resolutions listed above aggregated in three ways: averages, minimums, and maximums over their respective time periods.
Why go through the hassle?
What’s the benefit of all this? Well, because RRD files are cyclical, they offer these aggregated values at varying granularities on a rolling basis. That means that the most recent minute of data will occupy the space previously taken up by a minute of data 2 days ago.
What does that mean? RRD files don’t grow past a certain size! You specify the aggregations, resolutions, and time periods and then because data is overwritten cyclically, there is a definite maximum size of storage required. What this means is that as you add more and more sites (or more information to be monitored), there is a fixed storage requirement needed for each new site or new monitoring requirement. From my example, it means that each new property takes up exactly an extra 5 MB of storage - instead of growing without bound - making my overall temperature monitoring system extremely scalable.
That’s all for my explanation of what RRD files are and how they work (essentially). My next post(s) will deal with my trials and tribulations of using RRD data for BI purposes using SQL Server Business Intelligence Development Studio.