Morbi ut sapien. Morbi arcu mauris, suscipit congue, placerat sit amet, suscipit a, ante. Donec aliquet dui ac nunc. Mauris magna quam, aliquet quis, viverra eu, fringilla eget, purus. Donec tristique pretium sem.
To do any sort of business planning, you need forecasting. Our reason for existing is to help our customers drive success from their data, so, around two years ago we began to research adding automated forecasting to our cloud analytics platform.
This did however present a huge challenge. Take the typical lifecycle of a forecasting project. It’s an iterative process of refining a model, generating a forecast and evaluating the results to understand how to further refine the model. This cycle is repeated until you are happy with the results.
Evaluating the results and refining the model are usually manual tasks. They need an analyst with a good understanding of the model and the business or process that generates the data. That’s a pretty specific skillset.
That’s fine if you are creating one forecast for one organisation. We, however, want to scale our business rapidly, taking on lots of different customers. It’s not practical for us to recruit large numbers of highly skilled analysts in a short space of time. So, we need to automate as much of the process as we can.
We looked at a number of solutions. An obvious first choice, as we already use AWS, was the Amazon AI platform SageMaker, which includes DeepAR. DeepAR is a ML forecasting model based on a recurrent neural network with long-short term memory cells allowing it to model trends and seasonality. Importantly for us, the platform offers automatic model tuning. So, you run your data through the model numerous times, and it automatically selects hyperparameters to improve the forecast.
We also looked at Facebook Prophet. Unlike DeepAR, it’s a statistical model, similar to the generalised additive model (GAM). It copes particularly well with trends, seasonality and holidays, each being the additive components for the model. It’s also open source, and available in Python or R, so was easy for us to integrate into our platform.
Finally, we also considered Amazon Forecast. As a fully managed service on AWS, you upload your data and it uses automated machine learning (AutoML) to choose and refine an appropriate model. It includes a growing number of algorithms, including ARIMA, DeepAR+ (an enhancement of DeepAR), Prophet, ETS and NPTS.
We began the project using real world data from one of our existing customers, a jewellery retailer. We chose them primarily because they were enthusiastic to take part, but also their data was particularly challenging to forecast. The nature of their sales means the data is noisy, outliers are often high value sales that occur randomly, yet account for a large proportion of the value so cannot be discounted. The sales are also highly seasonal, with the 5 weeks leading up to Christmas accounting for as much as 50% of the sales.
We trained the models on three years data, from 2015-2017, then tested the resulting forecasts against the real-world results for 2018. For comparison we had the manual forecasts the company had used at the time, but unfortunately this was only detailed enough for the fairly rudimentary measure of the total difference over 12 months.
On total error over 12 months all the methods we tested significantly outperformed the manual forecast. The best results were generated by Prophet, with a total error of under 300k (out of annual sales over 100 million). This compared to over 15 million error with the manual forecast. However, using more meaningful measures (RMSE and MAE) DeepAR gave us the most accurate results.
In the 9 months since we conducted this research, we’ve integrated automated forecasting into our platform. We initially chose to use Prophet, even though DeepAR was arguably more accurate. Our reasoning was that Prophet was significantly easier to configure, and less resource hungry to train. Given the relatively small increase in accuracy DeepAR gave us, the ease at which we could scale the solution with Prophet won the day. In more recent months, with the added complication of forecasting during the Covid-19 crisis, we've begun using a combination of Prophet and LightGMB, with some success.
It has been fascinating to see how quickly this field has developed in the last year. The proliferation of available models, if anything, has gathered pace. We’ve continued to test different algorithms and techniques, and the level of automation and accuracy of our forecasting increases almost daily. We are already much closer to the goal of completely automating forecasting for industry than we thought possible 12 months ago. Who knows where we’ll be in another 12 months!
Business planning can be a challenge at the best of times, but during a period of unprecedented disruption like the current Covid19 crisis, it’s particularly so. Not only is the disruption on a scale none of us have seen in our lifetime, it’s unpredictable in the extreme. Governments are changing rules with massive repercussions almost weekly (most of us would agree with good reason), and the estimates of when we might return to something like normal vary alarmingly.
So, we’ve been working with some of our customers to build forecasts to guide their business planning during the crisis. We’ve used data from the UK’s Office for National Statistics' Data Science Lab to help model the fall in business activity due to the crisis, and a number of estimates as to when lockdown will end to model the return to normal activity.
The huge advantage of automated forecasting is we can re-generate forecasts as often as we like. So, as the situation develops and new information becomes available, we can adjust our models and re-generate the forecasts. It means our customers can respond quickly and effectively to changing circumstances, which is hugely beneficial in such uncertain times.
We’ve seen first-hand how the current crisis is proving a massive challenge for our customers and how automated forecasting can help. So, as far as our own costs allow, we’re offering a few UK business’s free access to our cloud analytics and forecasting platform for a few months to help ride out the crisis.
Our resources are limited, and we will put in some of our own time and effort to help organisation on board, so we need to limit this offer to a few companies we think we can make the most difference with. The software works best for medium or large organisation who have enough data to make forecasting accurate.
If you think this could help your organisation, please just get in touch at https://inmydata.datapa.com/about/contact-us
Like most people in the IT industry of a similar age (close enough to 50 to see the hairs poking out its nostrils), I’ve seen a lot of change. That said, it’s hard to think of any technology that has the potential to change so many working practices and industries as the current developments in machine learning and data analytics. For an analytics vendor like DataPA, staying at the forefront of such developments is a constant (but exhilarating) challenge.
Bringing fresh, new ideas into the business
It’s one of the reasons we at DataPA often take on students for a summer placement. It not only brings new ideas into the business, but also makes long term recruitment of fresh talent much easier. Our experience this summer was a perfect example of why embracing new talent is always a good idea.
For many customers on inmydata*, our cloud analytics platform, forecasting is a critical requirement for their business. For instance, our retail customers need sales forecasts to set targets and deliver stock planning. So, we scoped out an R&D project to discover whether we could use machine learning to automate forecasting for our customers.
Luckily for us, The DataLab run a fantastic MSc placement project in Scotland for data science, data engineering and analytics that includes a 10 to 12 week industrial placement. Through that program we engaged with the brilliant MBN Academy, who within a week matched us with Callum, a Data Engineering MSc student from Dundee University.
Delivering automated forecasts with improved accuracy
Things progressed well over the summer. For some of us in the business it took some frantic googling to keep up. RNN, LSTM, DeepAR and Prophet!? At least we’d heard of Facebook. There was, however, no arguing with the results. Testing against data from previous years, the models Callum built could predict annual sales within 2% of the actual figure. That compared with a 15% margin of error for the manual forecasts the customer had used in practice. To put that into context, the manual forecast was over £15 million out on sales of just over £100 million, whereas the automated forecast was less than £300,000 out.
Accurate automatic forecasting on demand
Since Callum finished his dissertation with us, we’ve been integrating the models he developed into our cloud platform. We can now deliver accurate forecast models on demand, automatically building rolling forecasts on a weekly basis. It’s a development that not only offers huge opportunity to our existing customers, but also allows us to compete much more aggressively in the crowded analytics market space.
Taking part in DataLab’s MSc Placement Project this summer has been fantastic. MBN Academy made it utterly painless to recruit Callum. In the short time he was with us he was a pleasure to work with and delivered real benefit for both DataPA and our customers.
Find out more about The Data Lab’s MSc programme and how you can get involved in offering a placement opportunity.
* inmydata is an innovative cloud analytics solution. It delivers everything you need to set the right goals for your organisation, and then measure your progress in achieving them.
We’re never short of candidates for “the next big thing” in the IT industry. Not surprising given the rate of change, but even for those of us working in the industry it can be disorientating. One of the latest contenders is "Serverless Cloud Technology". So, what is it, and how the heck can you run any software without a server?
Well, you can’t. The term “serverless” refers to the developers experience of the platform, not the underlying architecture. Serverless cloud computing is simply an evolution of existing cloud technologies.
Ultimately, cloud computing is about delivering Software as a Service (SaaS), online applications such as an email or sales system. Most companies that deliver a SaaS solution don’t own the computers that host the software. Some simply rent servers from a cloud provider. This is the simplest form of cloud hosting, called Infrastructure as a Service (Iaas).
IaaS removes the cost and maintenance burden of the hardware for the developer. However, there’s still plenty of work to provision each machine and the environment to host the application. So, around 2006 cloud service providers began building platforms using a technology called containerisation. Containerisation allows multiple processes to run on a single machine, sharing the hardware resources yet remaining isolated from each other. “Platform as a service” (PaaS) meant developers only paid for the compute resources they needed (rather than whole machines) and could quickly scale those resources as their requirements grew.
Serverless cloud computing, sometimes called “Functions as a Service” (FaaS), takes this abstraction a stage further. Most applications only require compute resources in response to events. For instance, an email application only needs to do something when an email arrives or a user logs in. For large periods of time it’s just sitting around waiting for something to happen. An application on a PaaS platform needs to be permanently loaded into a container, waiting to respond to these events. Generally, a single containerised instance of the application will service many requests, and additional instances will be started to cope during busy periods. However, starting additional containers can take several minutes, so PaaS platforms can be slow to respond to spikes in traffic. On a serverless platform, the developers code is run “on demand” only when these events occur, with the compute resources required also delivered “on demand”. The developer only pays for compute resources when they are being used, not when the application is waiting for something to happen. More importantly, the application can scale virtually instantly to almost any level of demand.
Serverless computing does have its drawbacks. It imposes constraints on the design of the software, for instance all functions must be stateless, and it’s difficult to move applications between different providers. However, it does offer a very powerful new paradigm for cloud computing. One which for many developers can deliver significant cost and performance benefits.
As an analytics vendor based in Edinburgh we are not new to the world of data science. Indeed, DataPA was amongst the first software developers to enter the field of analytics over 15 years ago. In that time, we have delivered software to more than 30 thousand users in more than 300 companies, both as an embedded technology for our technology partners and as a stand-alone application to our direct customers.
Throughout that time, innovation has always been at the core of what we do. In recent years we have been building on the developments in big data processing and machine learning to ensure we remain at the forefront of the huge disruption we see coming for analytics software. Our goal is to use these technologies to automate the role of the business analyst, reducing the cost and failure rate for analytics implementations. Our first commercial product from this research went live with a customer in January, marking the start of a long road of development and innovation.
So, we are hugely excited about the leading role Scotland is playing in the development of data science. The vibrant academic and business environment attracting talent from across the world, with the support of the Scottish Government and great organisations like DataLab are the perfect environment for companies like ours to flourish. For us, DataSummit18 is an ideal opportunity to showcase what we have been working on and learn from the international talent that the event attracts. If you are going to be at DataSummit18, please stop by out stand. We have lots of real world experience to share and are hungry to learn from others in the data science community.
Choosing an embedded analytics solution can be an intimidating task. Get it right, and you’ll add huge value to your application, cementing its place with your existing customers and increasing its saleability for many years to come. Get it wrong, and you’ll waste a huge amount of time, effort and no doubt money. So, we thought we’d put together an article to help guide that decision. It’s biased, obviously, we are after all an embedded analytics vendor. But it will, we hope, ring true for any who have recently gone through the process.
First off, any solution you choose should look great. Analytics is first and foremost about visualising data so it can be understood at a glance. If a vendor can’t get that right, walk away. You should also be able to completely white label it. Embedded analytics is about improving your product, not selling someone else’s.
Second, any embedded analytics solution should be easy for the developer to use and integrate. Seamless integration with the business application needs data to be prepared at the back end, and the visualisation components to be embedded and integrated at the front end. If either of these tasks require the developer to learn a new language, or significant new syntax, it defeats the object of licensing a third-party product.
DataPA OpenAnalytics allows business logic to be expressed in a variety of languages, such as SQL, .NET or OpenEdge ABL. Moreover, our technology allows developers to hook into existing business logic already developed for the business application in any of these languages, to provide data transformation in the analytics layer. After all, why write it twice? We also make sure our training courses and documentation are tailored to specific audiences, so the examples and syntax can be lifted directly into our customers code. We don’t expect our customers to embark on weeks of training before they are comfortable developing with DataPA OpenAnalytics, and we don’t think you should consider a vendor that does.
The power of embedded analytics is providing intelligence within context. An example might be providing a sales person with information about past purchases for a product at a glance, as and when they are making a sale. To deliver this, it’s key that the analytics layer and business application are tightly coupled, with information flowing in both directions between the analytics layer and business application. This means it’s important the analytics engine has a flexible and complete API and component set, so analytics content can be tightly integrated in any client application, be it web, mobile, client, even wearable devices. Above all, it should be easy for the developer to leverage the analytics capabilities on every platform they develop for.
Delivering analytics is about more than just data visualisation. There’s a huge array of extremely impressive open source data visualisation components available today, free and simple use. If you just want to visualise simple data, why pay for a full analytics solution? To add value over and above these free components, an analytics solution must make it easy to draw data from anywhere and then blend, accumulate and transform that data to present it in the visualisation layer. It’s this analytics engine that makes the difference between simple data visualisation that allows for a pretty picture in a demo, and something that will deliver real, practical intelligence to an organisation.
Finally, and perhaps most importantly, make sure the company you choose to provide your embedded analytics combines real expertise in analytics with agility and a drive to innovate. As we’ve discussed previously in this blog, the analytics sector like most software markets is on the verge of huge change driven by the developments in big data and machine learning. Choose the right vendor and this disruption offers huge opportunities for your business and your customers.
As an analytics vendor, we’re always keen to respond to our customers’ requests. So recently, we’ve been working hard to add mapping functionality which we will be releasing to our customers in the next few months. Before that however, we thought it might be useful to look at the role of geographic mapping in dashboards and explore when and how to use them.
Maps are visually engaging and more exciting than a chart, so it’s tempting to assume that if your data is grouped by some geographical measure then you should be plotting it on a map. However, this is rarely the case. Consider the two displays below, both showing UK Regional Sales data. We already know the areas location, so plotting the data on a map doesn’t add anything. What we’re most interested in is comparing the value sold in each area. Both visualizations display this, the map with colour and the chart with the height of the bar, but it is much easier to get an instantly clear comparison from the chart.
In contrast, take the example below that maps average temperature by country. From a quick glance, we can see that the coldest regions are the large land masses to the North, Western Europe is milder than Eastern Europe and Africa is warmest continent overall. In the context of the data we’re looking at, this is all hugely useful information that would not be apparent at a glance with any other representation of the data. This is a clear example of when plotting the data on a map adds to our understanding.
For a more practical example, take the dashboard below. It’s designed to inform the decision as to where a retailer should open their next store. The measure shown on both the map and chart is the population of each state divided by the number of stores, giving a measure of the population per store for each state. Given just the chart, we may choose to open a new store in California. However, the map suggests a different decision. Plotting a circle centred on each state, with the population per store represented by the diameter of the circle, the area surrounding New York contains a high number of overlapping circles. From the map, it’s clear that locating a new store here would potentially cater for more customers than one opened in California.
Indeed, the functionality of the map would also let us progressively zoom in and perhaps choose a specific location such as New Brunswick in New Jersey.
When deciding whether to use a map to represent your data, simply ask yourself the following question; does visually expressing the physical location of the data elements add to our understanding of the data? If not, some other object would likely be a better choice. This is in fact just a specific instance of a wider rule we should apply when designing any dashboard; if a visual element does not add to our understanding of the data, it is just clutter that makes the dashboard harder to understand, and as such should not be there.
If you’d like to find out more about DataPA OpenAnalytics, or our forthcoming support for mapping, please contact us.
Apache Kafka™ is a massively scalable publish and subscribe messaging system. Originally developed by LinkedIn and open sourced in 2011, it is horizontally scalable, fault tolerant and extremely fast. In the last few years its popularity has grown rapidly, and it now provides critical data pipelines for a huge array of companies including internet giants such as LinkedIn, Twitter, Netflix, Spotify, Pinterest, Uber, PayPal and AirBnB.
So, when the innovative and award winning ISP Exa Networks approached us to help deliver a live analytics solution that would consume, analyse and visualise over 100 million messages a day (up to 6.5 thousand a second at peak times) from their Kafka™ feed, it was a challenge we couldn’t turn down.
The goal was to provide analytics for schools who used Exa’s content filtering system SurfProtect®. Information on every web request from every user in over 1200 schools would be sent via Kafka™ to the analytics layer. The resulting dashboards would need to provide each school with a clear overview of the activity on their connection, allowing them to monitor usage and identify users based on rejected searches or requests.
The first task was to devise a way of consuming such a large stream of data efficiently. We realised some time ago that our customers would increasingly want to consume data from novel architectures and an ever-increasing variety of formats and structures. So, we built the Open API query, which allows the rapid development and integration of bespoke data connectors. For Exa, we had a data connector built to efficiently consume the Kafka™ feed within a few days.
The rest of the implementation was straight forward. DataPA OpenAnalytics allows the refreshing and data preparation process for dashboards to be distributed across any network, reducing the load on the web server. In Exa’s case, a single web server, and a single processing server are sufficient to allow the dashboards to be constantly refreshed, so data is never more than a few minutes old. To help balance the process, the schools were distributed amongst 31 dashboards, and filtered to a single school as the user logs in.
The final solution gives each school a dashboard, with data never more than a few minutes old, showing figures that accumulate over the day. Each dashboard allows the school to monitor web traffic and any rejected requests or searches on their connection.
We're really excited about what we delivered for Exa Networks, and think with the versatility and scalability the latest release of DataPA OpenAnalytics offers, we can achieve even more. If you have large amounts of data, whether from Kafka™ or any other data source, and would like to explore the possibility of adding live analytics, please get in touch, we'd love to show you what we can do.
The guiding principal when designing any dashboard should be to ensure your users understand key information at a glance. Ideally, a user taking their first look at your dashboard should be able to understand key information within a few seconds. Achieving this is not rocket science, simply applying a few simple principles to your decision making will transform your dashboards.
First off, make sure your dashboard is focused. If I’m looking at a dashboard that contains information for five different roles in the organisation, I need to filter out or navigate round views for all the others to get to the information that is relevant to me. That’s going to take more than a few seconds. Step one of building a dashboard should be to decide who the dashboard is for and to understand in detail what information is key to their performance. Limit your dashboard to just this information. Remember, five simple dashboards are always more useful than one complex one.
Next, always avoid clutter. The more complex the view you offer the user, the longer it will take for them to glean the information they require. Carefully consider every object and ask yourself “do I really need this? Is the information this object represents unique and important?”. If not, it’s just clutter, get rid of it.
A little more daunting at face value, but simple in practice, is the concept of using visual clues to help the user quickly recognise what they are looking at. There are two principles of design that are particularly useful with dashboards, similarity and proximity. Let’s take similarity first. Say I have a sales dashboard that shows total sales and profit in several different charts, say by sales rep, region and date. Design principles tell us that things that are similar are perceived to be more related than things that are dissimilar. As such, if I make sure that total sales is always plotted in blue and profit in green, the user is likely to recognise these values across the different charts quickly, without having to read the legend. This principle applies to more than just colour. For instance, I may always plot charts that show sales over time as line charts, sales by region as bar and sales by person as column. A second design principle, proximity, tells us things that are close to one another are perceived to be more related than things that are spaced farther apart. Implementing this is simple, make sure you place objects that are related close together, giving the user another visual clue as to their meaning.
The final tip for creating great dashboards is to think about when the data needs to be refreshed, and let this inform the design of your dashboard. An operational dashboard is likely to require real time data, so keep it simple so it updates fast. A strategic dashboard is more likely to be updated periodically, so you can afford to (and often want to) add more detail.
There are obviously many more considerations when building dashboards, not least your choice of visual objects, when and where to prepare data and much more. However, these are more particular decisions that deserve a blog in their own right. My hope is that in following the simple design principles above, you’ll quickly be creating more effective dashboards.