Panoply.io’s CTO and co-founder, Roi Avinoam, is an experienced software technologist who is passionate about all things data. We recently checked in with Roi to get some of his insight on the world of Big Data. Here’s what he shared including some timely perspective how machine learning and natural language can be used in Big Data management.
Tell us about the mission behind Panoply.io? How are you hoping to impact the world of Big Data?
The Big Data space is wonderfully complex and evolved. Over the past decade or so, dozens of new technologies have emerged that allow even small companies to deep dive into their data and extract insight that was nearly impossible in the past. When Hadoop came out, it opened a chest full of goodies – new ideas and applications – from simple big data science to sophisticated deep learning, AI, predictive analytics, etc.
As a technologist, I’ve been deeply involved with setting up big data architectures in several companies and had to learn all of the tiniest details and caveats. And with every release of a new technology or set of capabilities, the world became even larger – making more things possible. But, with every change, the complexity of tapping into these possibilities became even larger.
Today, it’s nearly impossible to keep track of and fully utilize many of the capabilities available, unless you’re full-time dedicated to big data architectures. It has become way too complex. You’d have to get very familiar with the recent Hadoop changes, as well as Kafka, Spark, TensorFlow, Elastic Search, Redshift, BigQuery, etc. You’d have to be kept informed with the latest bug fixes, security discussions and IT best practices – like monitoring, backups, scaling, etc. You’d have to attend conferences, try out different technologies and read a ton of documentation.
For many teams – that’s great. That’s what they do. But for the vast majority of people, Big Data is just a means to an end – learning more from their data. They don’t have the mental capacity or will to invest so much in a space outside of their core business.
That’s where Panoply.io comes in. Panoply automates and optimizes all of the tasks involved with keeping your data tidy and well formatted. Anything from system and database administration to data engineering and science. Instead of hiring teams of engineers, and spending months learning and deploying your data architectures, you’d just use Panoply and get started in minutes.
How can machine learning and natural language processing be applied to Big Data management? What are the advantages of using these tools?
The problem with the above is how do you take a complex problem – one that involves a lot of brain power from high-end engineers – and automate it? Clearly, Big Data management is very unique for each team, and although there’s a Pareto at-hand, it may still be infinitely complicated to handle all of the nuances and requirements of every business.
Instead of having our users have to configure everything through many setting screens, we’ve taken a different approach – learning things ourselves. Our algorithms uses many AI techniques to imitate the decision-making thought process of experienced data engineers in order to provide an architecture that will be as advanced without requiring the same number of man-months.
For example, we use statistical analysis to track how our users are accessing their data – which tables are they mostly interested in? Which columns? Which data points? How are they joining their data? What kind of queries are repetitive? What kind of anomalies exist in the data?
Based on these measures, we can modify the underlying data model, and then use reinforced learning to improve our algorithms and get better over time. It’s like having a robotic data engineer working 24/7 trying out different things, learning, measuring and re-iterating until it gets to the perfect setup. And when it does – it waits until a new set of data or queries comes in and starts all over again.
Another example is how to recognize logical relationships in the data. Users might have hundreds of semi-structured, sometimes messy, data sets from tens of different sources. They’re sometimes completely lost within the data, not realizing what’s up to date and how can they join their data in meaningful ways.
We’ve been able to sort through massive data sets, often with a lot of seemingly irrelevant data, and find gems in it with the power of NLP. By comparing different columns and values from within the data we are able to identify data points that have some logical connections in it. That, combined with the ability to generate automated aggregation views, yields both great insight as well as a sensical data model that can be utilized and improved.
In Part 2, Roi addresses the way businesses manage data, Big Data management challenges and how to grapple with them by building a data-driven culture.
It’s time to put your legacy data to work for you. Plan and launch successful data lake projects with tips and tricks from industry experts – download our newest Building a Data Lake checklist report today!