Excerpt from blog originally posted on the “Big Data Page by Paige”.
In the age of businesses with data that lives on dozens or even hundreds of servers, expecting transactional integrity and data consistency and currency are old-fashioned notions. On Hadoop, you just have to settle for the new NoSQL standard of BASE and eventual consistency. That’s what they say. But, as usual, “they” are wrong. Not all Hadoop users have to drop ACID.
Having spent a pleasant day at the Hortonworks roadshow in Houston recently, I topped it off with a fun chat with Owen O’Malley, a Hortonworks co-founder and architect. We started with some mutual geekiness over the references in his speeches to “one platform to rule them all” and how he came to name ORC files with that particular acronym (Optimized Row Columnar, officially.) Then, we moved on to discussing the changing face of programming, computers, data storage and the dizzyingly rapid progress we’ve both seen in our lifetimes. In the end, we landed on the fact that ACID is becoming a high priority in the SQL in Hadoop world now, where it was virtually unheard of a short time ago.
Hadoop users have all had to give up ACID and settle for the new standard, BASE, as a general rule, but like so many things in the data wrangling industry, that’s changing fast. This may come as a shock to a lot of current Hadoop users and database users considering making the switch to Hadoop, but using Hadoop doesn’t mean you have to give up your ACID habit.
Where did this ACID come from?
ACID is a database acronym that lots of folks use, but even the folks who use it a lot don’t necessarily know the origin of it, so here’s a very simplified definition. (Note from the friendly, neighborhood acronym translator: There are plenty of more in-depth definitions of ACID out there. If you want more detail, Google is your friend. Although, if you don’t want your mind completely blown, you might want to search “ACID database,” not just ACID.)
Atomicity: All parts of a transaction are treated as a single action. All are completed or none are.
Consistency: Transactions follow the rules and restrictions of the database. So, no transaction creates an invalid data state.
Isolation: No incomplete transaction can affect another incomplete transaction.
Durability: Once a transaction is committed, it is done and will persist, even if there is a system failure.
ACID is one of those remnants of the 70’s relational database revolution that we don’t really want to see go the way of headbands and bell-bottom jeans. ACID compliance means, in the tie-dyed words of Owen O’Malley, that a database provides “consistent views of changing data.” ACID created the mind-altering concept of transactional integrity that made relational databases the revolution of the 20th century when it came to data management.
Who cares if Hadoop doesn’t do ACID?
ACID makes the standard CRUD (Create, Retrieve, Update, Delete) operations of a database happen in a predictable, traceable way. If you want to be able to query a historical view of data, see where the data stood last week or last month, you need ACID. If you want to track changes made on a particular row or column over time, when they were made, by who, etc., that’s ACID. If you want data like people’s email addresses for instance, to stay current, and always get the correct current address when you query, that’s ACID. If you have to delete old records past a certain date for compliance to a policy or law, you need ACID. If some data was inserted inaccurately and you want to be able to update with corrected data, ACID. If you need to treat multiple changes as one action, so for instance, money can’t be deducted from one account unless it is added to the other, ACID.
Basically, if you’re accustomed to inserting, updating and deleting data as it changes, and having the data behave predictably and reliably, you’ve been doing ACID.
Hadoop is all about the BASE
In the age of big data, ACID just hasn’t been hip. Most NoSQL and Hadoop data stores don’t do ACID. They work on a principle called BASE. (This is another obscure database acronym, so here’s another quickie definition.)
Basically Available: Even if a compute unit fails, a node in a cluster for example, all data will still be available for queries.
Soft state: The data state can change over time, even without any additional data changes being made. This is because of eventual consistency.
Eventual consistency is really the core of the BASE concept. The trouble with trying to maintain changing data in a cluster-based data storage system is that data is replicated across multiple locations. A change that is made in one place may take a while to propagate to another place. So, if two people send a query at the same time and hit two different replicated versions of the data, they may get two different answers. Eventually, the data will be replicated across all copies, and the data, assuming no other changes are made in the meantime, will then be consistent. This is called “eventual consistency.”
This concept is why BASE is considered the polar opposite of ACID. Eventual consistency assumes that data will reach a consistent, undisturbed resting state. The thing about data, though, in general, is that it never rests. It changes constantly. While eventual consistency tries to catch up, new data changes are highly likely to impact the system. This means that NoSQL databases often find themselves in that soft state where the data shifts and moves, never becoming stable.
For extremely high volume, low change, non-transactional data systems where transactional integrity isn’t really where it’s at, this works fine. It allows your data to scale without constantly checking to make sure each change passes a bunch of rules from the man. It frees your data from a lot of restrictions and gives it more freedom to grow and find itself.
But this comes at a price.
Eventual consistency fails the ACID test
In a business setting, if two people send the same query to the same data at the same time, they will expect to get the same answer. If two people ask the same question of the data and get two different answers, which answer is correct? The general reaction of humans to this situation is to not trust either answer. Then, they throw the query at the data again, and even though they haven’t changed anything, the soft state concept means that they might get yet a third answer. This has a high impact on trust.
Businesses that adopt the data lake concept have Hadoop clusters that are essentially an ever-changing dumping ground for new data. Even if only certain data is expected to be queried in a SQL fashion, that data is highly unlikely to remain static. If the data is constantly changing and being replicated out like drips of liquid with ripples working their way outward, there is never a time when the data can reach that settled, ideally consistent state. There never comes a time when the query results can be considered definitive.
ACID compliance on a cluster-based data management system means the end of “eventual consistency” and the return of data query results that are simply consistent.
What Hadoop technologies do ACID?
HBase/Hive – O’Malley leads the charge for ACID compliance on Hive and HBase at Hortonworks. That psychedelic slide I stole (with O’Malley’s permission) was from a presentation he did at Hadoop Summit 2014, Adding ACID Updates to Hive. His colleague, Alan Gates did a presentation at Strata + Hadoop World last month in San Jose, Hive 0.14 Does ACID. Those slides give a pretty good state-of-the-union on Hive with HBase. So, go and read them. Right now, please. I’ll wait. Notice, on Gates’ third slide, “Do or Do Not, There is NO Try.” I love how that neatly and geekily summarizes the essence of ACID.
One of his other slides very emphatically says, “Not OLTP!” (Acronym translation service: On-Line Transactional Processing.) HBase and Hive are not meant to run transactional, operational day-to-day systems, such as POS (Point Of Sale), or ERP (Enterprise Resource Planning). The insert, update, and delete capabilities are intended to keep data current and queries consistent, not to make HBase a new Oracle on steroids.
So, that’s what Hive on HBase isn’t good for. What IS it good for? It’s absolutely awesome for time series and streaming data sets. HBase can ingest data from the fire hose like nobody’s business and store it in a Hadoop cluster for as long as you need it. I’m not saying that’s the only thing it’s good for. HBase and Hive are very versatile systems. But persisting streaming data and doing historical analysis is what it really knocks out of the park from my experience.
If you thought you couldn’t have nice, consistent, current queries on a massive, high volume, time series data set, you’re on crack. Hive is on ACID.
So, Using Hadoop Doesn’t Mean Dropping ACID?
No, you have options. If BASE is your scene, there are a lot of good data storage and management technologies in the Hadoop ecosystem. If you need transactional integrity or you just need a consistent view of your changing data, even though it’s Hadoop elephant-sized, that option is already here and growing in maturity and diversity every day. The Hadoop ecosystem can make sure you don’t have a bad trip.
Editor’s Note: Paige and Owen spoke again at the Hadoop Summit in June 2016, and we posted a two-part blog on their discussion there, the first one on the early origins of Hadoop, and the reasons why ORC files were invented, the second on some surprising news about the origins of Spark, Tez, and the stunning performance you can get from Hive with ORC and the new LLAP technology.