by Thomas Sandholm, Architect @BlueFox.IO
We discuss lessons learned from scaling our analytics backend using state-of-the-art time-series database technology. InfluxDB has a lot to offer, if used the right way. We take you through some of our observed sweet spots and pitfalls.
We were having some scalability challenges with our existing analytics backend, which comprised a wild combination of Cassandra, Elasticsearch, MySQL, and Redis.
There were issues with disks filling up, databases, and some of the most powerful AWS instance flavors were having performance issues. To add insult to injury, we also needed to scale fast to meet customer demand, without increasing the already astronomical AWS bill.
Cassandra and Elasticsearch are great tools, but for our particular use case they weren’t exactly right for the job. At the very least, they weren’t able to provide the full solution. A lot of time was spent sending data back and forth between the database and our application, so that we could do our custom analytics and then write data back to serve queries.
After a reevaluation of our core features, it became clear that a simple time-series database would get us almost all the functionality we needed, and while still keeping most of the processing within the database server. Enter InfluxDB.
We are generally very happy with InfluxDB, it’s run in production for six months without any issues. The main benefit is resource efficiency. We can achieve a lot with a very small resource footprint.
It comes as no surprise that time-slot aggregated data, i.e. sums of metrics in hourly and daily buckets, is where InfluxDB shines. This feature, to efficiently aggregate time series with a simple query, was well worth the migration alone.
At a close second comes InfluxDB’s retention policies. As your product matures and the infrastructure scales up with demand, it’s great to have an easy knob to adjust retention of data up or down to avoid the catastrophic disk-full crashes. In essence you create a retention policy, i.e. customer visit frequency, then set how long you want to keep data tagged with this policy. Sounds simple, and it is. Considering the alternative of fiddling with TTL configurations in application code, the usefulness of this feature cannot be overstated.
Another life-saver for us was the into clause feature. You can run a query and write the results directly back into the database without round-tripping to the application client. This used to be a common pattern for us, as I mentioned above. So this feature alone improved our processing pipeline time by an order of magnitude.
The final observed benefit of InfluxDB is what sits under the hood. All the data are sharded to allow for parallelism, and scale-out using the same core storage technology applied in most popular NoSQL databases today (Cassandra, LevelDB, MongoDB, RocksDB), i.e. log-structured merge trees(LSM trees). In InfluxDB, they are aptly called time-structured merge trees. This means that similar scalability designs work well, write speeds are fast, and access to data in the same shard is efficient. Combine this with a schema-less design of your data and you have a winning configuration. You add tables (called measurements) and columns as you go. You only need to create the database, which doesn’t require any schema.
So performance is generally great, but there are definitely lots of opportunities to mess things up along the way, which leads us to…
If it’s not designed properly, your performance will at some point start to suffer. As I mentioned, InfluxDB shards all the data akin to other LSM-tree databases, but the default is that a new shard is created each week, and data are kept forever (infinite retention). Depending on your ingest load, the write times will eventually suffer with such a configuration, but being too aggressive in splitting the data in shards, which is done through time durations, could render queries astonishingly slow. So, it’s a tradeoff. The sweet spot is somewhere where the shard is small enough to both make writes fast and serve most queries.
Another concern is that shard configurations have an interesting dependency on retention periods. When data are discarded because of the retention date expiring, the entire shard is lost. Therefore, if shards are too large (let’s say one month), and you have a retention period of one month, it will mean that two months of data have to be kept on the disk.
Another drawback is that the open source version does not (in contrast to tools like Cassandra and Elasticsearch), come with support for distributed deployments (clustering), despite the fact that the underlying database was designed for it. You need to upgrade to the paid InfluxEnterprise or InfluxCloud versions to distribute your database across nodes.
This is a bit of a selfish item, because our use case depends heavily on it:lack of support for the histogram function. This is a great feature and it was available in v0.8 but has been dropped ever since, including the recently released v1.3. It would have saved us lots of headaches, but since we also wanted all the performance improvements of versions 0.9+, downgrading was not an option. We dabbled around a bit with percentiles, but it generated graphs that were hard to understand for our users, and doing a mathematical conversion is not feasible unless you have very smooth distributions and many data points, which in turn kills performance. We ended up doing a somewhat restricted custom solution, but we would still move back to the histogram feature in a heartbeat if it’s ever reintroduced.
The other gripe we have is that we’d like to use the Cloud offering, but it’s backup retention policies are too limited. We don’t want to keep all the data in the live InfluxDB database as it impairs performance (see discussion above on shards and retention policies), and forces us to buy the more expensive live clusters with increased disk space. Instead we want to keep backups of archived data in something like S3, which allows us to do one-off analysis of old data for R&D purposes. Again, we ended up implementing our own solution that does exactly that.
So What’s the Verdict
One of the main lessons learned when scaling our solution was to introduceNagling between our application and InfluxDB. Buffering measurement points on a per-sensor-stream basis, and then writing them in a single batch when the buffer fills up, allowed us to improve write throughput by up to 10x. Because our application containers (in AWS ECS) are stateless, we had to implement buffering with an external persistence service. Originally we used Memcached (appends), but we then switched to Redis (lists), as it was more reliable and had no impact on performance.
Another major performance breakthrough was when we started caching InfluxDB query results in our front-end database cache for semi-structured retrieval of sensor summary statistics. This architecture allowed our customer-facing application UI’s to be performance isolated from the data ingest and analytics processing services.
In summary, despite the tongue-in-cheek buildup towards the “ugly,” we are actually very happy with InfluxDB, for its design, features, performance, and reliability. That said, one can always wish for more!
This article was originally posted on Medium