I was pleased to read about the progress of Graylog2, ElasticSearch, Kibana, et al. in the past year. Machine data analysis has been a growing area of interest for some time now, as traditional monitoring and systems management tools aren’t capable of keeping up with All of the Things that make up many modern workloads. And then there are the more general purpose, “big data” platforms like Hadoop along with the new in-memory upstarts sprouting up around the BDAS stack. Right now is a great time to be a data analytics person, because there has never in the history of computing been a richer set of open source tools to work with.
There’s a functional difference between what I call data processing platforms, such as Hadoop and BDAS, and data search presentation layers, such as what you find with the ELK stack (ElasticSearch, Logstash and Kibana). While Hadoop, BDAS, et al. are great for processing extremely large data sets, they’re mostly useful as platforms for people Who Know What They’re Doing (TM), ie. math and science PhDs and analytics groups within larger companies. But really, the search and presentation layers are, to me, where the interesting work is taking place these days: it’s where Joe and Jane programmer and operations person are going to make their mark on their organization. And many of the modern tools for data presentation can take data sets from a number of sources: log data, JSON, various forms of XML, event data piped directly over sockets or some other forwarding mechanism. This is why there’s a burgeoning market around tools that integrate with Hadoop and other platforms.
There’s one aspect of data search presentation layers that has largely gone unmentioned. Everyone tends to focus on the software, and if it’s open source, that gets a strong mention. No one, however, seems to focus on the parts that are most important: data formats, data availability and data reuse. The best part about open source analytics tools is that, by definition, the data outputs must also be openly defined and available for consumption by other tools and platforms. This is in stark contrast to traditional systems management tools and even some modern ones. The most exciting premise of open source tooling in this area is the freedom from the dreaded data roach motel model, where data goes in, but it doesn’t come out unless you pay for the privilege of accessing the data you already own. Recently, I’ve taken to calling it the “skunky data model” and imploring people to “de-skunk their data.”
Last year, the Red Hat Storage folks came up with the tag line of “Liberate Your Information.” Yes, I know, it sounds hokey and like marketing double-speak, but the concept is very real. There are, today, many users, developers and customers trapped in the data roach motel and cannot get out, because they made the (poor) decision to go with a vendor that didn’t have their needs in mind. It would seem that the best way to prevent this outcome is to go with an open source solution, because again, by definition, it is impossible to create an open source solution that creates proprietary data – because the source is open to the world, it would be impossible to hide how the data is indexed, managed, and stored.
In the past, one of the problems is that there simply weren’t a whole lot of choices for would-be customers. Luckily, we now have a wealth of options to choose from. As always, I recommend that those looking for solutions in this area go with a vendor that has their interests at heart. Go with a vendor that will allow you to access your data on your terms. Go with a vendor that gives you the means to fire them if they’re not a good partner for you. I think it’s no exaggeration to say that the only way to guarantee this freedom is to go with an open source solution.