Challenges with VSAM Data – Change Data Capture (CDC) Solutions and Benefits
Mainframes are the backbone of many of the world’s largest enterprise operations. In order to take advantage of the latest capabilities in data science and machine learning, the mainframe data needs to be made available on new generation platforms such as Hadoop and Kafka.
A while back, Syncsort’s VP of Data Integration R&D, Fernanda Tavares, wrote a blog about Overcoming Technical Challenges in Large Mainframe to Hadoop Implementations where she explains how Syncsort leads the way in helping enterprises leverage their mainframe data in Hadoop. This article expands on that topic by presenting a new feature that helps refresh the Hadoop data lake with data from growing mainframe data sets faster and more efficiently.
While more and more applications running on the mainframe have been using RDBMS like Db2 as a data repository, data set (non-RDBMS storage systems) repositories are still prevalent in both legacy and new mainframe applications. Given that data in data sets cannot always be queried and analyzed as quickly and easily as data in RDBMS systems, there is a growing need to keep these data sets in sync with a system where queries and analytics can be performed quickly, easily and at effective cost. Recognizing this need Syncsort has built Connect CDC, a Change Data Capture (CDC) add-on to its flagship Big Data integration tool, Connect for Big Data. Connect CDC captures changes in near-real time data from IBM Db2 for z/OS and VSAM sources. The changes can then be applied to Hive and Impala, or stored in HDFS or the cloud in different file formats for further processing. In this article we will focus on the challenges with VSAM data on the mainframe and the solutions and benefits offered by Connect CDC.
VSAM, Virtual Storage Access Method, is undisputedly the most used data set type in mainframe enterprise applications. The term VSAM is used both as an access method and as a data set type. As an access method it provides a very efficient, high-performance and complex mechanism to manage records on disk. As a data set type it can exist in four different organization schemes also known as VSAM types:
- Key Sequence Data Set (KSDS) the most commonly used type where records are indexed by keys and can be retrieved/inserted/update/deleted by key value.
- Entry Sequence Data Set (ESDS) where records are kept in sequential order and accessed as such.
- Relative Record Data Set (RRDS) where record numbers are used as keys to access records.
- Linear Data Set (LSDS), a byte-stream data set in a traditional z/OS file. It is rarely used in applications.
Challenges with VSAM data
In addition to being commonly used as backend storage for a lot of enterprise mainframe batch applications, VSAM data sets are also very commonly used as backend storage for CICS applications. CICS (Customer information Control System) is IBM’s z/OS transaction processing subsystem that provides a transaction service for running applications online. CICS applications process tons of commercial transactions per day, including bank and ATM transactions. The amount of data landing to VSAM data sets is increasing more and more. However good storage backend capabilities VSAM data sets provide to mainframe applications, they also come with some challenges:
- They still occupy precious disk storage on the mainframe; archiving them to tapes is not always a desired goal given the I/O speed of tapes.
- They cannot easily be queried like data in RDBMS systems using a query language like SQL.
- Applications running on the mainframe that perform analytics or report from VSAM data sets can consume precious CPU cycles and increase operational costs.
- VSAM data sets are often not normalized. If records in a VSAM data set were to be migrated to a normalized database, dozens or even hundreds of tables would have to be created from one VSAM data set.
For the purpose of this article, all Connect CDC VSAM features and benefits discussed here also apply to IAM data sets. IAM, Innovation Access Method, is a reliable, high-performance indexed access method alternative to VSAM. It implements VSAM API and supports KSDKS, ESDS, RSDS and Alternative index. Like VSAM it can be updated by batch applications and CICS.
Connect CDC Solutions and Benefits.
Connect CDC can capture record changes made to VSAM data sets in near-real time whether they are managed by CICS or being updated by a batch application. Changes to CICS-managed VSAM data sets are always captured in real time. When changes are applied to VSAM data sets via batch applications CICS VR (VSAM Recovery) can be used to capture the changes in near-real time, otherwise the changes can be captured on demand using a diff utility tool provided as part of Connect CDC installation. In all cases Connect CDC uses the z/OS system logger logging facility to keep track of the changes. This works in the following way:
- VSAM data sets are created or altered with LOGREPLICATE enabled. A log stream is assigned to the VSAM data set.
- In CICS, assuming a VSAM-backed CICS application, all associated VSAM data sets have a name entry in the FCT (File Control Table) that can be up to 8 bytes long.
- In such case, when CICS operates on a record, the log stream associated with the VSAM data set also logs the record and what happened to it using the FCT name as identifier of the source of the change.
- Like in CICS case, LOGREPLICATE is enabled and a log stream is assigned to the VSAM data set.
- When a batch application operates on a record, the log stream associated with the VSAM data set also logs the record and what happened to it using the DDNAME name as the identifier of the source of the change.
Connect CDC Diff Utility
- Two VSAM data sets are required, a base data set and a changing data set.
- The utility is run on the two data sets to compare them using a provided key. The differences are written to a specified log stream.
- This approach works with most VSAM organization schemes.
Some of the benefits Connect CDC can offer to enterprises include:
- Near-real time Replication of VSAM data sets to Hadoop data lakes.
- Data can be kept unchanged in HDFS or the cloud, which can be archived or further processed.
- Data can be loaded to Hive or Impala databases and kept in synch with the VSAM data on the mainframe.
- Using the Connect for Big Data capability to process complex COBOL copybook layouts, captured VSAM data set records can be cleaned, transformed on the fly and be made available in file formats like Tableau, Apache Avro, Apache Parquet, Hive, or Impala for analytics, data science, and machine learning processing.
- Captured VSAM data can be pushed to streaming platforms like Kafka and MapR Streams
- Reduce replication time and mainframe resources utilization by transferring only records that have been changed, instead of the full data set.
- Avoid the cost and uncertainty of converting/migrating existing VSAM backed application to Db2 or other RDBMS systems for leveraging query-like RDBMS features.
For more information, watch our webcast, Engineering Machine Learning Data Pipelines Series: Streaming New Data as It Changes