The AWS (Amazon Web Services) Marketplace is an online store that lets customers access, procure, and use software, data, and services that run in the Elastic Compute Cloud generation 2 (EC2) cloud. In December 2008, AWS launched Public Data Sets on AWS, a central repository of public data sets accessible at no charge. As with all AWS services, users only pay for the storage and computing services they actually use.
Before the AWS Marketplace launched Public Data Sets, anyone wanting to access or process large data sets (like genome data) had to spend considerable time finding, downloading, customizing, and analyzing the data. Public Data Sets let anyone access and analyze this data using EC2 services or Amazon Elastic MapReduce (EMR) services, which are hosted Hadoop clusters.
The goal with the Public Data Sets was to enable innovation, lowering the barriers to entry into big data processing, so that those without the resources to access and process huge data sets could compete more effectively in a marketplace of innovation and ideas. With Public Data Sets, the AWS Marketplace helps innovators focus on research rather than the complexities of managing computational infrastructure.
How Do AWS Marketplace Public Data Sets Work?
Public Data Sets may use two possible formats: Amazon Elastic Block Store (EBS), or Amazon Simple Storage Service (S3). EBS data provides storage in blocks for use with EC2 that can be scaled up or down quickly. It may be used in mission-critical applications like Microsoft SharePoint, for example. Amazon S3 is designed for web-scale computing and provides a simple web services interface for storing and retrieving data anytime and anywhere on the web. It’s great for content storage and distribution.
Accessing data in EBS format involves signing up for an AWS account, launching an Amazon EC2 instance, and creating an Amazon EBS volume using a Snapshot ID obtained from the catalog of Public Data Sets. One way to launch an EC2 instance and create an Amazon EBS volume is with a FireFox plugin called ElasticFox.
Accessing data hosted in S3 can be done in several ways. Users can employ a simple HTTP request, AWS Command Line Tools and SDKs, use Amazon EC2 to download the data, or use Amazon Elastic MapReduce to process S3 data.
How Are Public Data Sets Added?
Anyone interested in making data sets freely available can submit an application for consideration in Public Data Sets. Once an application is submitted, the AWS Marketplace team reviews it to determine if the data set is a good fit. The submitter must have the right to make the data available freely, and once a data set is selected for inclusion, the applicant must provide a description of the data set, its schema, and sample code showing how it might be analyzed.
Interesting Examples of AWS Public Data Sets
Public Data Sets cover almost any topic imaginable and are available for anyone to use.
Public Data Sets in the AWS Marketplace cover a vast range of topics. Here is a very small sampling:
• Sloan Digital Sky Survey – a map of one-quarter of the entire sky in detail, including positions and absolute brightness of hundreds of millions of celestial objects. Data also includes measured distances to more than one million galaxies and quasars.
• Daily Global Weather Measurements, 1929-2009 – eighty years’ worth of daily weather measurements like temperature, wind speed, humidity, etc. collected from over 9,000 weather stations around the world
• Material Safety Data Sheets – a collection of over 230,000 material safety data sheets in plain text format. Information on chemical components, storage, handling, first aid measurements, and more is included.
AWS Marketplace Public Data Sets also includes all the current facts and assertions in the Freebase open database, covering millions of topics.
Syncsort’s Ironcluster and AWS Data Sets
If you need to process massive amounts of data, but don’t have the technical resources required, Syncsort’s Ironcluster, along with the Amazon Marketplace can help. Amazon’s Elastic MapReduce lets organizations set up and operate Hadoop clusters in the cloud, which is significantly easier than doing it on-site. But even with Elastic MapReduce, there’s a fairly steep learning curve, and considerable manual coding involved.
Ironcluster, however, takes care of the learning curve, letting you design Elastic MapReduce jobs graphically, without writing code. You can create data flows like joins, web log aggregations, and change data capture (CDC) quickly using a library of Use Case Accelerators. Ironcluster can also be connected with data sources other than Amazon Marketplace, including mainframe, HDFS, and Salesforce.com data. And cost-effective scaling is available if you need more nodes.
Amazon Marketplace offers Public Data Sets in addition to the private data sets that organizations store on S3 or EBS. Syncsort’s Ironcluster empowers organizations to put that data to work in innovative, powerful ways, without the headaches of coding and on-site provisioning.