Data infrastructure optimization, availability & security software
Data integration & quality software
The Next Wave of technology & innovation

Ideas, Insights, Innovation at DataWorks 2018 – Advances in Leveraging Hadoop and the Cloud

The 2018 DataWorks in San Jose had 2100 attendees from 32 countries and showed some innovation in its format by showcasing product demos in the newly conceived Expo Theater and Demopalooza sessions.

John Kreisa, VP of Marketing at Hortonworks, kicked off the DataWorks conference by getting into the spirit of the WorldCup and reciting some great statistics from the last World Cup: 3.2 billion TV viewers in 2014 and 32.1 million tweets in the final match.

We heard about the innovations in Hadoop 3.0, which will be released as part of HDP 3.0 later in the year. The improved scaling, scheduling, usability, deployment and resource utilization are discussed in some detail in the Hortonworks blog series titled Data Lake 3.0.

Demopalooza featured exciting demos of the new DataPlane services Data Steward Studio, Data Analytics Studio, Cloudbreak and the announcement of Streams Messaging Manager. SMM will be a Kafka management tool, which will ‘cure Kafka blindness’ by exposing details on producers, consumer groups, topics, and how they all relate.

DataWorks, Machine Learning, Big Data, Artificial Intelligence, San Jose

Ideas, Insights, and Innovation

Aligning with the Ideas, Insights, Innovation theme at DataWorks, there were many good discussions on how artificial intelligence and machine learning are transforming the way we do business. Rob Bearden, CEO of Hortonworks argued that this is the biggest business model transformation since the Industrial Revolution.

Kevin Slavin from The Shed argued in his keynote speech that the best intelligence may be a ‘human in the loop’ approach to extend human intelligence with machine intelligence.

Praveen Kankariya, CEO of Impetus, made the point that algorithms can be bought. Your data typically cannot. According to him, the number of mainstream firms fearing competition from data-driven upstarts went from 47% to 79% in the past year.

Brian Hopkins, VP and Principal Analyst at Forrester shared some interesting thoughts. He argued that innovations such as social media have put the customer in charge. Insights-driven firms are growing at 27% a year, compared to our 3.5% annual GDP growth. He also estimated that the pace of business is now 2 to 3 times faster than it was 5 years ago.

The Cloud

Leveraging the Cloud is a crucial step in this journey. Brian Hopkins and Arun Murthy, co-founder and CPO of Hortonworks, talked about evolving the Data Lake into a Data Fabric with shared metadata, governance, and security policies.

Kevin Bates, VP of Enterprise Execution at Fannie Mae, presented a session at DataWorks on their reasons for moving to the cloud. He talked about the fact that artificial intelligence and machine learning are empowering new ideas. Using the cloud allowed him to give the team a wide choice of tools, without increasing the complexity of managing the internal IT. Brian Hopkins had called this ‘going from DevOps to NoOps’ by letting someone else operate the infrastructure. The cloud also provided Fannie Mae with opportunities to share data among projects that needed the same information processed in a similar way. This reduced redundancies, drove efficiency, and enabled them to create a well-organized data lake on the cloud that was fully governed.

Kevin Bates also pointed out that this strategy is new and evolving. His advice for the audience at DataWorks was to bring in partners who can think end to end. At Syncsort we’ve also seen the importance of planning ahead in our customer deployments. During her interview on theCUBE, Syncsort CTO, Dr. Tendü Yoğurtçu, PhD talked about the value Syncsort brings to enterprises by optimizing existing infrastructure, assuring data security and availability, and advancing the data by integrating it into next-generation analytics platforms. As we see our customers embark on their cloud journeys, future-proofing their applications is a big concern.

Syncsort CTO on theCUBE talks about data-driven trends including cloud, data governance, AI and machine learning.

A Real World Challenge

A good example was a money-laundering use case presented in a DataWorks session by Dr. Yoğurtçu. A global bank wanted to leverage machine learning to create a high performing, scalable solution to comply with anti-money laundering regulations. The solution needed to be cloud-ready since the bank plans to go to a hybrid system.

One of the challenges was consolidating scattered and difficult to access datasets, including Mainframe data. Syncsort’s DMX-h™ made it easy to ingest data in bulk, and was unobtrusive on the data source systems. Any required conversions were performed during the ingestion. Ian Downard from MapR pointed out in his session that there is a high turnover rate for data scientists and machine learning specialists. Part of their frustration is having to spend time preparing data. To delight them, you need to have a good strategy for data preparation.

Another challenge faced by the bank was around cleansing the data. If the training algorithms were using bad data, the models would make incorrect predictions. Syncsort’s Trillium™ Quality for Big Data leveraged the Hadoop cluster to perform data cleansing at scale.

Entity resolution and customer identification are crucial for detecting attempts to obfuscate fraudulent transactions. This was accomplished with the sophisticated multi-field matching algorithms in Trillium Quality for Big Data.

Fraud detection needs to happen in real-time. Syncsort’s resilient Change Data Capture capabilities made sure that all data was kept fresh, regardless of whether the data was persisted in Hive, or streaming through Kafka.

Lineage was the last piece of the puzzle. Syncsort publishes lineage to Apache Atlas, Cloudera Navigator, and also makes it available through REST APIs. This allowed the bank to track the provenance of the data used in the models.

By using Syncsort’s high-performing capabilities, the bank was able to develop a solution that was future- proof and highly scalable. The data flow was developed once and can be deployed anywhere: on premise or in the cloud. ETL and quality operations can run on Spark or MapReduce, with no changes required. There was also no need for any coding. As Arun Murthy highlighted in his keynotes, there is great value in having identical deployments on premise and in the cloud.

Fore more, make sure to check out our webcast from Dr. Tendü Yoğurtçu on Data Quality and Lineage.

Related Posts