Telecommunications has changed a lot since the breakup of the AT&T monopoly. But one thing hasn’t changed: telecommunications is a capital intensive business with big appetite for data. Now it’s competitive and capital-intensive. Telecom data hunger continues.

Telecom Big is Nothing New

Big Data is not new to telecom. Information about calls, or call-detail records (CDRs), has been collected for monitoring and billing purposes for decades. High speed performance has been a requirement, not a nicety, for telecom. As a result, in-memory databases were first deployed by telecoms in the 80′s.

The scale of these collections became clear when the NSA’s MAINWAY database came to light in 2006. According to estimates made at the time MAINWAY contained 1.9 trillion records. A given carrier probably stored more than this, depending on how much history was maintained. A 2003 listing of large databases included AT&T’s Security Call Analysis and Management Platform (SCAMP). At the time AT&T was keeping two years of call data, which amounted to 26.3 TB, running on AT&T’s Daytona database. A dated report on Daytona, which called it a “data management system, not a database,” indicated that in 2007 Daytona was supporting 2.8 trillion records.

In 2007, it was becoming clear that the 2005 Sprint – Nextel merger was going to be a financial calamity for Sprint. What else has changed in the telco landscape over the past seven years or so?

That was Then. Now Cometh the Petabyte.

As T-Mobile’s IBM project member Christina Twiford said in 2012, it’s that data volumes grow quickly. The Netezza platform she discussed then had to scale from 100TB to 2 petabytes. The operation she described then loaded 17B records daily, and was supported by 150K ETL or ELT processes. Access to this data was provided to 1300 customers through a web interface.

“Hadoop is exciting in that you can through in the structured and the unstructured data for advanced analytics. . . And you can process streams of data without ‘landing’ the data. . . Telcos are all over M2M or the Internet of Things,” she said.

Tim Eckard of Zaloni presented some telecom use cases at a 2013 Telecom Analytics conference in Atlanta. One unnamed wireless operator had a Central Data Mediation Archive used for e-discovery, billing disputes and the like. The size was a total of about 1PB growing at 12TB/day in 2012, and doubled to 25TB/day (5B records) in 2013. According to Eckard, the customer was able to move to a Hadoop environment for ¼ the cost of moving to a traditional (relational?) database setting.

Pervasive described similar results with a smaller telco provider whose Call Descriptor Records were captured with Pervasive’s DataRush. Telco switches dump data every 5 minutes, hundreds of thousands of records per dump, so the telco objective was to increase their speed of processing by an order of magnitude. This was essential in order to improve decision processes, measures of real time profit margin and operational analysis.

IBM’s Robert Segat likewise suggested that telcos would be using analytics for purchasing and churn propensities. Challenges faced by telcos: disruptive competitors, LTE / 4G, increased regulation, support for streamed data – each would be opportunities for Hadoop and Hadoop-like platforms. In particular, wireless data is expected to grow by 50X over the next decade. There are also 30 billion RFID sensors, customer social network information and a greater variety of traffic types – especially video.

What, Me Archive?

According to Hortonworks, its Hadoop distro is being used at seven US carriers. Despite the volume – which BITKOM’s Urbanski indicates in Gigaom is around 10 million messages per second — a telco can save six months of messages for later troubleshooting. This is both an operational and a competitive advantage if it can save churn.

Wireless data is expected to grow by 50X over the next decade

Leading edge solutions include using Hadoop data to allocate bandwidth in real time. This opportunity is appealing to telcos because service level agreements may require dynamic adjustments to account for malware, bandwidth spikes and QoS degradation. Such solutions likely involve all facets of Big Data: volume, velocity and variety of data sources.

Forrester’s Brian Hopkins identified big data for real-time analytics as the number 2 “most revolutionary technology” according to a 2013 survey of enterprise architects.

This optimistic picture clearly omits some of the detailed steps needed to fully deploy a telco Hadoop application. Far from being a greenfield application, most telco Big Data deployments will be bigger and faster with more data sources. But how is that data to be loaded into Hadoop?

Expectations for data pipeline speed are high. ETL toolkits must play nicely with the Hadoop ecosystem. Its proponents in telco, or elsewhere for that matter, have high expectations for developer ease of use and execution-time performance. Convenience, typified by offerings like as Syncsort’s Mainframe Offload allow for Hadoop to coexist with legacy mainframe tools like COBOL copybooks and JCL – without the need for additional software on mainframes. For telcos considering AWS, projects can be rapidly provisioned in EC2 and connect to ETL data sources in RDBMS, mainframe, HDFS, SF.com, Redshift or S3.

From CDR to CSR

If your phone rings a few seconds after your next dropped wireless call, it may well be a customer service agent with access to real time analytics to aid in troubleshooting the problem.

Assuming you’ve paid your bill. They’ll know that, too.

Mark Underwood writes about knowledge engineering, Big Data security and privacy.

Twitter @knowlengr LinkedIn

{ 0 comments }

If a picture is worth a thousand words, is a picture that’s a thousand times bigger better still?

These and other questions are being asked by user experience designers as the tidal wave that is Big Data rolls into the human-computer interaction space.

In recent years, several IPOs reflecting increased demand for business intelligence software demonstrated the need to put data management in the hands of analysts, not only database admins. These IPOs included Qlik Technologies (2010, current market cap $2.4B), Splunk (2012, $8.2B) and Tableau (2013, $4.7B). Earlier generations of BI software provided visualization tools, but Tableau, QlikView and Splunk were part of a crop promising to extend business visualization to a wider audience with fewer intervening steps.

ETL provider Syncsort, Cloudera and Tableau discuss these and other issues in a March 2014 presentation

Data as the New “Soil”

David McCandless asserts in his 2010 TED talk that “Data is the new soil,” and that visualizations are flowers derived from plants tilled in this ground. He believes some analysts – even untrained ones – have a sort of “dormant design literacy” about visualizations. He says that by ingesting a steady diet of web visualizations, users have unintentionally developed expectations about what they see and how they want to see it. In the dense information jungle, he argues, visualization is a relief, “a clearing.”

Marching a bit out of step with Big Data has been Big Software, which has also seen a rise of visualization tools. Software by Irise, reportedly adopted somewhere within 500 of the Fortune 1000, relies on visualization to improve efficiency and transparency in the software development life cycle. Part of what makes Irise attractive to Big Data developers are its gateways to fast-growing repositories like HP Quality Center and IBM Rational.

Also growing from this “soil” are visualizations that bridge data and knowledge. Think Visual Studio on steroids, capable of navigating both data and semantic space. What’s really needed, wrote Bret Victor in 2011, is a systematic approach to interactive visualization, a capability to navigate “up and down the ladder of abstraction.”

Look Up! It’s Cloud Viz.

Most of the familiar names in BI have roots in client/server architectures, though most offer some sort of cloud version by now. Products like Birst Visualizer, ChartIO, RoamBI and ClearStory are newer cloud-based offerings. BigML, aimed at DIY predictive analytics from the cloud, includes a visualization component. For instance, Ayasdi uses “topological data analysis” to “highlight the underlying geometric shapes of your data and allowing for real-time interaction to produce immediate insights.” Microsoft’s Power BI offers data visualization capabilities supporting million-row tables for Office365 users.

Microsoft Power BI Demonstration (Screenshot)

“Show Me” (Don’t Tell Me)

Think your data challenge is big? Please, to borrow from the Rush song of the same name, show me, but don’t tell me. Big Data visualization is about saying more with less.

Research is underway to explore new ways to visualize big data. Here is one example.

A report by Aditi Majumder and Behzad Sajadi demonstrates how multi-projector displays can deliver visualizations for uses such as immersive simulation or scientific data visualization. They have augmented their displays with gestural interfaces that allow one or more users to interact with these display in real time, performing tasks such as navigating 3D models. In addition, the research demonstrated a capability to light any object to act as a screen for visualizing data – using them as “the visualization medium,” even for everyday objects.

Show Me a Story

There’s more to today’s visualizations than flipboards and hi-res maps. In Business Storytelling for Dummies (Wiley 2014), Karen Dietz and Lori Silverman tell us that visualizations don’t have to tell a story. But, they say, “if you want to use data and data visualization to move people to action, then melding story and data together is essential.” They offer prospective storytellers a few guidelines:

 

  • Show, don’t tell. Try removing text to see if the visuals still work.
  • Be clear about what you want to accomplish. Dietz and Silverman recommend revisiting the knowledge hierarchy associated with the concepts.
  • Be wary of eye candy. A pleasant-looking visualization doesn’t guarantee that viewers will process the intended message.
  • Slickness and flashiness aren’t necessary. Beware of “. . . thinking that the data is what needs emphasis, when the real key to success is the story about the data (p.155).”

 

Visualization for Four V’s

Some aspects of good visualization will remain unchanged from the pre-Big Data buzzword era. So what is new? What will need to change by virtue of increased volume, velocity, variety and veracity?

The impact of Big Data will be felt in unanticipated realms. A recent New York Times story identified Big Data genealogy as leading to potential insights into disease inheritability and untold family connections to major historical events. Author A.J. Jacobs explained:

Last year, Yaniv Erlich, a fellow at the Whitehead Institute at M.I.T., presented preliminary results of his project FamiLinx, which uses Big Data from Geni’s tree to track the distribution of traits. His work has yielded a fascinating picture of human migration.

Even a modest homegrown family tree can be difficult to print. Splicing together multiple pages is the norm for DIY genealogy. A fully interlinked tree that connects to the trees created by hundreds or thousands of others demands a scalability that earlier UX design simply lacked.

Beyond the Visionary

Big Data may not lead to bigger stories. Analysts may discover that some old methods – bar charts, line graphs, heat maps – work as well as ever. But a big dose of any one of the V’s – especially velocity – can change all that. Big Data visualization will have to change accordingly.

But in many other domains, animation and 3D may need to become as commonplace as bar charts.

Mark Underwood writes about knowledge engineering, Big Data security and privacy @knowlengr LinkedIn.

{ 0 comments }

It was only a matter of minutes after the discovery of the missing Malaysian Airlines flight MH370 that questions began to be asked about the data.  How could a Boeing 777 jet airliner disappear without a digital trace?  What about the data the airplane collected?  What about the tracking data?  The transponders?  How about the satellite “pings”?  What can  the cell phones’ GPS’s tell us?

It was assumed that the quickest path to answering questions about the mysterious flight disappearance was to decipher the big data digital stream that the plane must have created.

Read my complete blog on the The Wharton IGEL Blog.

Follow Gary Survis on Google+

{ 0 comments }


Dr. Vincent Granville is Co-Founder of Data Science Central and author of the upcoming text, Developing Analytical Talent: Becoming a Data Scientist.

Q. On the Analytic Bridge and Data Science sites, there is comparatively sparse reference to ETL. Does this reflect slow migration to platforms that support big data analytics when compared to greenfield big data initiatives?

I think ETL is not considered as analytics by traditional analytics practitioners (statisticians). It is regarded as data plumbing and software engineering/architecture, or computer science at best. However, it is a critical piece of big data with its own analytics challenges (yield optimization, server up-time optimization, selection of metrics, load balance etc.) At Data Science Central, we are moving into getting more involved with ETL, especially with NoSQL environments (Hadoop-style frameworks). Also, I am currently working on a paper about new, massive data compression algorithms for rows/columns datasets, as well as hierarchical summary databases.

Q. Some observers believe there is a heightened potential for misleading or erroneous inference-making with Big Data. Do you agree?

Yes, when you are looking for millions of patterns in millions of data buckets, you are bound to find spurious correlations. Indeed what you think are strong signals (using traditional pattern recognition algorithms) is likely to be just plain noise. Still big data contains more information than smaller data. You need to use the right methodology to identify the real signals / insights. The issue is discussed in my article, “The Curse of Big Data.”

Q. Where do you think we are with Big Data along the Gartner hype cycle? (Feel free to amend their construct as you see fit.)

I think it depend on the user. For the 1% consuming 99% of all data, we are close to the “plateau of productivity”. For the 99% consuming 1% of all data, we are barely beyond “trough of disillusionment”.

Q. What standards – formal or de facto – are you watching most closely?

I am looking at what universities and organizations offer in terms of training and certifications. A number of organizations, including ourselves, offer non-standard training and certification: online, on-demand, low-cost, non-academic, practical. It will be interesting to see how this training ecosystem evolves.

Q. Which computing disciplines which existed 5-10 years ago do you feel will be most able in the transition to today’s data science frameworks?

I don’t think a particular discipline is better positioned. First, boundaries are fuzzy – machine learning is statistics (clustering), data mining and AI are machine learning, data engineering is computer science. Then, data science is a hybrid, cross-discipline science. It was created precisely as a reaction against silos – computer science, business analytics, statistics, operations research and so on.

Q. In the interview on your web site with Sir Cox, he says, “My intrinsic feeling is that more fundamental progress is more likely to be made by very focused, relatively small scale, intensive investigations than collecting millions of bits of information on millions of people, for example. It’s possible to collect such large data now, but it depends on the quality, which may be very high or not, and if it is not, what do you do about it?” What is Cox saying to communities like the American Society for Quality and others who would like to positively influence Big Data quality?

I just reposted the interview, not sure who interviewed David Cox in the StatisticsViews.com original article. I have two reactions to David’s comment. First, many organizations faced with big data do a poor job at extracting insights with good ROI. Maybe they are not analytically mature enough to deal with big data. In this sense, David is right. But I believe the future is in developing methodology for very big data. It does not need to be expensive, see my article “Big Data is Cheap and Easy.” Big Data is very different from traditional data. Mathematicians found that some number theory conjectures, true for billions of numbers, did not work for all numbers once they got enough theoretical knowledge or computing power to do really big tests; likewise I expect data practitioners to make similar discoveries about big data.

Q. Are what Cox calls “small, carefully planned investigations” orthogonal to self-service tools like Tableau or QlikView, which can operate in a hypothesis vacuum?

I’m sure Cox is talking about experimental design, a branch of statistics. I believe model-free, data-driven, assumption-free testing has a lot of potential. Much of the criticism about traditional statistics is about testing one hypothesis against another one, while in fact you’d like to find out, out of 20 hypotheses, which ones seem to be provide the most reliable causal explanations.

Q. What changes, if any, would you like to see in the data science curriculum to improve data visualization?

Data videos as described in my most recent book (Developing Analytical Talent: Becoming a Data Scientist, Wiley April 2014) and the ability to put more dimensions in what is, at the end of the day, a two-dimensional chart. You can do this by playing with a full spectrum of symbols, dot sizes, colors and palettes. But it comes with a price: training decision executives on how to quickly interpret more sophisticated graphs. In any case, simplicity is paramount in any visualization, or your end users will eventually misinterpret or stop looking at your charts.

Q. Those in the non-academic, media side of the data business extol the virtues of narratives and storytelling. They’re not always clear, though, about what data to use and when. How should they proceed in curating data-driven, data-annotated stories?

I like good infographics. Cartoons might also have a potential to quickly convey an important concept or finding, though as of today, I know no one communicating insights via cartoons.

Q. The forums you have fostered have been around awhile now. How has the population in the forums changed over time?

The population has become more purely data-oriented over the last two years, with a shift from pure analytics to computer science, data engineering and architecture. In some sense, it is now more mainstream, even attracting IT people.

Q. What trends are you watching most keenly?

I am looking at the types of technologies that are getting more popular. While everybody was and is still talking about NoSQL, people realize that Map-Reduce can’t efficiently solve everything, that there are solutions such as graph databases (technically, it is also NoSQL) to address spatial problems (weather predictions) for instance. I also see a number of statisticians or business analysts who either want to be more involved in data science, or are aggressively protecting their turf against any kind of change – positive or negative. Some practitioners are opposed to a hybrid cross-discipline role such as data scientist, and claim that well defined roles (I call it silos) are better. I believe that start-ups are more open to data science and automation to compete against bigger companies – for instance to create a crowdsourcing application that predicts the price for any standard medical procedure in any hospital in the world, while big healthcare in US has been unable to provide any kind of pricing to patients asking about the cost of a procedure.

Q. Do you have a few favorite software tools that you feel are indispensable?

I have been using open source for a long time, including many UNIX tools and then Perl for many text-processing intensive applications (prototyping and development). Python (check out the Pandas libraries for data analysis) and R (which I like for visualizations) are now more popular, and better integrate in a team environment. I have used Syncsort to perform large sorting on credit card transactions, when working at Visa. Putty (Telnet), FileZilla (FTP), Snip (for screenshots), web crawlers (I do it in Perl), the Cygnus environment (a kind of UNIX for Windows), Excel, and knowledge of SQL, HTML, XML, JavaScript, a programming language such as C, UNIX, and R, are indispensable.

Mark Underwood writes about knowledge engineering, Big Data security and privacy @knowlengr LinkedIn.

{ 0 comments }