Michael Bironneau, Data Scientist, Open Energi
At Open Energi, we think of our service as an automated, virtual power station. Whenever the electric grid experiences sudden, unforeseen surges in supply or demand, assets under the control of our Dynamic Demand algorithm automatically pick up the slack – just like a power station would but cheaper and cleaner.
In order to prove that we’ve delivered this service and keep it running at optimum, we need to analyse large amounts of data relatively quickly. We’re also making our service smarter so that more assets will be able to participate in Dynamic Demand than before. This is where Big Data and Hortonworks Data Platform come in.
Big Data is a phrase that has been floating around companies like Google for the last two decades. It has never really had a precise definition, but when used colloquially it usually means that someone somewhere is running out of space for your data and/or computing power to analyse it. Sometimes it also means that your data is such a mess that it will take a sentient super-robot to make sense of it (thankfully, that’s not our case).
In the context of Dynamic Demand, data is our greatest asset: it tells us when we’ll be able to turn a certain asset on or off without disrupting its primary function, which could be critical to an industrial process. It also allows us to prove to National Grid that an asset we claimed participated in Dynamic Demand actually modified its power consumption to help balance the grid. We want to do more with our data: understand our portfolio of assets better, reduce integration difficulties with assets and accept new assets into our portfolio that don’t meet certain technical requirements. This means running more analyses on live streams of data and integrating many additional datasets together – for us, these problems are Big Data because we can’t cost-effectively tackle them at scale with our current systems.
Hortonworks Data Platform (HDP) is a collection of open source software built on and around Apache Hadoop designed to deal with Big Data, developed partly by large companies like Yahoo and eBay, and partly by the community. Hortonworks’ added value is in the way the tools are seamlessly configured to work as one and the support they provide. They staff core contributors of various projects and technical experts, so help is rarely more than a couple of emails away.
One of the most widely known tools in the HDP toolbox is called Apache Hive. This tool excels at integrating different types of data and allowing them to be queried as one, spreading out the computational cost of the query on as many machines as we can get our hands on. We’re planning on using this for most of our ad-hoc analysis and some batch jobs. Because it is easy to extend with custom logic, we can program Open Energi analytics straight into it. For example, we can call into our Python code which contains functions we may want to evaluate over various pieces of data, without worrying about how that data is stored or whether it is even coming from a single source. Because we deal with a lot of timeseries data we also need to perform resampling operations on a regular basis – these can be painful in regular query languages such as SQL but Hive’s extensibility makes it a breeze.
For our low-latency applications, Hortonworks package a piece of software called Apache Storm. This software is designed to run an entire graph of computations on a live stream of data, adding reliability in case parts of the graph fail. For example, when a device sends us a power reading, we can correlate it with Dynamic Demand state, train machine learning models and update a cache powering a live dashboard – all without leaving Storm.
Hortonworks take security very seriously and so do we. HDP comes with tools such as Apache Knox and Apache Ranger that deal with questions of who should have access to which piece of data. Even though enterprise-grade security is a relatively new concept in the world of Big Data, HDP has fully caught up and the core systems now support transparently encrypting data in movement and at rest, with central management in Ranger allowing us to effortlessly define security policies that comply with our business requirements and the high standards that are expected of us.
For future use, we’re excited about Apache Kylin, a brand new piece of software originally created by eBay. While not yet part of HDP, it is built on the same software ecosystem and can easily be integrated. Kylin allows for a different type of data modelling using metrics (or KPIs) and dimensions (eg. “client”, “date” or “type of load”). Roughly speaking, the engine stores pre-computed aggregates of the metrics over the space spanned by the various dimensions. For example, suppose a metric is “power consumption” while the dimensions of interest are “time” and “asset type”. Kylin could return answers almost immediately to questions such as “what was the mean power consumption last month for all assets of type ‘bitumen tank’?” At Open Energi we have many KPIs we want to keep track of and drill into – this is something Kylin should be excellent at managing for us.
Finally, we’re looking into Hortonworks’ new product called Hortonworks DataFlow (HDF). Originally created by the NSA, then integrated into the Apache Foundation’s portfolio under the name Apache Nifi, this project does what it says on the tin: it helps create and manage flows of data. It solves many technical problems, such as what to do when one component in the dataflow can’t keep up with the volumes of data it’s receiving, or how to prioritise which data to send at a given time. While our bespoke systems already solve many of these issues, HDF can do more, such as querying data that lives on individual devices without them ever having to send the raw data back. We’re always looking for ways to get more out of ephemeral data that lives on assets in the field but never gets sent back to our database, so we’re looking forward to trying this out.