Time for Temporal Big Data
“When did Doris arrive in Los Angeles?”
Before answering, you might want to consider who or what is doing the asking.
- If you’re based in Perth, the concept of “yesterday” and “tomorrow” means pretty much the same thing as it does in Milwaukee, but the answer to “What happened yesterday?” may not be the same.
- A typical person’s answer won’t be provided in seconds, but computers on board the Boeing 767 Doris flew stored the landing time as a range of time, stored in milliseconds.
- Husband Richard traveling in Berlin was not interested in when she arrived at LAX, but when Doris arrived at their home in Woodland Hills.
- Richard and Doris have smart house technology that logs when their alarm system is disarmed, but it uses an early version of the alarm software that only stores the nearest hour.
Many specialized domains exhibit additional nuance about time and events. “When was your last physical examination?” is typically answered with only a year. “When was her last EKG?” could be answered by year, unless the patient is hospitalized in critical care, when the anticipated answer would be a time of day – typically omitting seconds.
Computer Time circa 1983
Credit: Joe Haupt | Flickr
Geospatial reasoning has additional challenges. Suppose you’re designing a safety system for an aircraft carrier or a huge cargo ship. If you are navigating through bodies of water, avoiding obstacles, or guiding around other ships, these spatial objects are described in interval terms – such as the length of a bay, or the dimensions of another ship.
Folks in artificial intelligence have worried about these issues for decades. Big Data? It’s your turn.
Common Reasoning about Pesky Time Data
Reasoning about time requires so-called “common knowledge” which can be notoriously difficult to implement in software. For example, the concept of “yesterday” could be implemented rigidly as a microsecond after midnight. But consider these added complexities. People somehow manage to incorporate a subtle understanding of time when using expressions like “visit,” “vacation,” “birthday,” “coincidence,” “wait awhile,” “until” and “after.”
Getting software to mirror this degree of flexibility and understanding of time remains one of the largest barriers to reasoning about large databases, if not The Singularity.
In the widely used Microsoft Data Warehouse Toolkit (Joy Mundy, Warren Thornthwaite, Ralpha Kimball, Wiley 2011), time data in Microsoft Analysis services, for example, receives only a couple of paragraphs. The gist of the help given is “identify the core date dimension, and then specify which attribute refers to year, quarter, month and so on.” You’ll find no entry in the index for “temporal” beyond this brief reference to a built-in wizard whose purpose is fairly transparent.
In the otherwise thoughtfully written MapReduce Design Patterns (Donald Miner & Adam Shook, O’Reilly 2013), there is no index entry for temporal data. Yet the first temporal data element appears on page xiii of the Preface – even before the book launches into its Big Data design patterns.
Impaled by Point in Time
“Point in Time” refers to the perspective that a report, query or analytics prediction is valid only at the point in time that it is executed. Most systems today are incapable of reproducing results for a given point in time because some, or perhaps all, of the data changes over a period of time.
Consider the example of a regional sales report. Jay Pritchett wants to know which agents assigned to his pseudo-geographic regions have generated the highest-margin closet sales. What he didn’t take into account was that he had reorganized the regions during the year, and reassigned some of the agents to different regions. When it came time to dole out commissions, he discovered that his paper report produced in March couldn’t be reproduced in September.
Often this issue arises with slowly changing dimensions, but for some data bases, many dimensions change with unanticipated frequency, with complicated results on analytics. Big Data velocity and volume, spurred on by device streams from the Internet of Things, will amplify the problem.
It takes some persuasion, but eventually most analysts understand that it’s time to address temporal data in their data models – somehow.
Traditional database representations consider “valid time,” a time period that is “true” vs. some external real world measurement, and “transaction time,” which is the time period during which a fact in a database warehouse is considered “true.”
There is some work on temporal reasoning in computation. It’s not generally available out of the box for relational or NoSQL data base systems, but Allen’s interval calculus provides one means for computing about time intervals. In proposing an alternate approach, Chawda et al. observed that “current approaches developed for handling join queries in real-valued data cannot be directly used to handle interval joins.”
Human-Readable Time is a Concept, Not an Absolute
Credit: ITDP | Flickr
In the book Managing Time in Relational Databases (Tom Johnson, MK Press, 2010), the author intensifies the debate, arguing that
- The relational model was not designed to handle time and that most ad hoc solutions are “jury-rigs”
- The workaround of adding effective date or a surrogate representing time primary and foreign keys creates overhead to maintain referential integrity
- Because there is no native “period” type, the DBMS allows for overlapping (invalid) time periods
- Staging areas, when used to address temporal data aggregation or ETL, require duplicated schemas and tend to have weaker master data management and editing processes
Taking Not-so-Sweet Time
Time data is somewhat of pesky nuisance that most developers wrestle with but may never fully vanquish. In using SQL, one of the most common error conditions is date conversion failure. In Big Data ETL applications like Syncsort DMX-h, casual inspection may suggest the time data is perfectly acceptable, but it may not comply with the format expected by the target repository – hence the thrown exception.
To see how you can manage temporal data with Syncsort’s ETL, start with a trial version of DMX-h and load up some log data from multiple devices and applications into Hadoop. There’ll be plenty of timestamp, dates, date ranges, YTD, QTD and time-bounded events.
Not all of this data will play nicely with your apps. Which is perfect timing for a lesson in temporal data.
B. Chawda, H. Gupta, S. Negi, T. A. Faruquie, L. V. Subramaniam, and M. K. Mohania, “Processing interval joins on Map-Reduce.” in EDBT, 2014, pp. 463-474.