Kate Matsudaira on the Value of Compression
Kate Matsudaira has worked at Sun, Microsoft and Amazon. She worked Big Data projects at PayPal and elsewhere and is currently founder of the startup popforms (www.popforms.com). Matsudaira recently gave a talk for the Association for Computing Machinery titled “Big Data without Big Database – Extreme In-Memory Caching for Better Performance.” Despite the title, Matsudaira’s message was just as relevant for Big Data systems as for Not-So Big Data systems.
Back to the Future
Kate Matsudaira was a pre-teen when Back to the Future appeared, but the software engineer / entrepreneur seems to have understood that basic principles like compression will stand the test of time — despite Moore’s Law and expanded secondary memory devices.
In the webinar, she outlines a number of design choices facing developers today. She suggests that some common solutions to typical performance problems will prove inadequate. By examining several specific examples of how different types of data ought to be managed, Matsudaira offers a strong case for reviving the topic of compression.
A Brief History of Compression
A software historian could tell much of the history of computing simply by following the compression narrative. In the early days of computing when memory was more precious than a retweet from @MileyCyrus, Syncsort used its compression savvy to wrestle some software utilities revenue from IBM.
Today both disk and main memory are available in what at the time would have seemed unimaginable dimensions. This would seem to imply that compression is much less important than it was when a slow-moving tape reel had to be spun to retrieve archival data and the IBM model 1311 (c. 1961) disk drive could store only 2MB.
So why is Matsudaira worried about compression for her applications? Is compression suddenly important again?
Assess, then Compress
Matsudaira’s catch phrase for this talk could well have been, “To Compress, First Assess.” Translation: Get to know your data, especially data that can be separated into reference vs. transactional usage. By “reference data” she means relatively static data, such as catalog metadata, geolocation – especially data under system owner, not “user” control. Reference data, she argues, must be fast — ideally, memory-resident — but most designers treat it no differently from transactional data.
Big Cache: Not Big Enough
She is also skeptical that employing “Big Cache” solutions like memcached(b), ElastiCache and Oracle Coherence are adequate to overcome Big Data performance problems.
She cites a number of potential problems or resource demands these solutions involve:
- Additional hardware
- Additional configuration complexity
- Additional monitor
- Additional network hop
- Slow scanning
- Additional serialization
Matsudaira is similarly skeptical that NoSQL, MongoDB or Redis solutions would be adequate. “They are fast if everything fits into memory,” she said, then asks, “Can you keep it in memory yourself?” Stated differently, she wonders whether earlier optimization through compression can improve efficiency.
Domain-Driven Design Meets PayPal Price History
To answer this question, she turns to the domain (or “model”) layer of software design (Domain-Driven Design, Eric Evans 2003). Properly designed, she says, it is possible to maintain independent hierarchies of reference data, optimize collections and save space through compression – e.g., using compact immutable maps that use 1/4th the space of, for instance, java.util.HashMap.
Reduce Memory Footprint
A reduced memory footprint is also possible with numeric data, especially once you get to know your data well. Matsudaira works through a Paypal price history use case to make this point. She also recommends compressing text in byte arrays by: using the minimum character set encoding needed for the particular data, shared-prefix methods, use case-insensitive storage, compressed pointers and simple, space-stingy encoding methods when possible.
Improve Cache Loading
Matsudaira and colleague Leon Stein (who worked on all these ideas with her) also work through use cases where cache loading can be improved. They believe loading works best when the strategy chosen reflects a thorough understanding of how an application will behave and which data is most time-sensitive.
Smaller is Beautiful
Matsudaira and Stein are not saying that Big Data won’t be Big, or that Big Data tools like Hadoop have no place in contemporary application development. Instead, they are applying a tried and true design principle – compression — to specific scenarios where the payoff is great. After all, some 40 years after its original insight, Syncsort’s technology is still quietly at work — using technologies like Hadoop to extract transform and load Big Data (often called ETL) in a compressed state and using sorted data’s natural ability to allow greater data compression, reducing disk I/O, memory and CPU consumption.
Matsudaira’s popforms startup features an applications genre called “leadership software.” Whether this genre takes hold is anyone’s guess. But if Kate Matsudaira’s compression insights are any indication, she is just as likely to be remembered for her leadership role model for women considering a career in engineering.
Software engineer Kate Matsudaira offers compression tips drawn from better data intelligence and domain-driven design.
Image Credits: Paul Watts licensed under Creative Commons.
Mark Underwood writes about knowledge engineering and Big Data privacy and security.