aq_pp, AWS, Big Data Trends, Blog Archive, Open Data, Public Data Sets, Use Case / 20 June 2016 / Dorothy

Open Data and Essentia

Open data as a term is relatively new, but the concept is not. The idea that information should be freely available for unrestricted use has been around for awhile, but didn’t really take off before the rise of the Internet made it feasible to share data quickly and globally. Add in the recent popularity of big data, and it makes a lot of sense that public datasets are on the rise as well. Enormous amounts of valuable data, including everything from climate projections to genome sequences, have been made available by the organizations that own them and are now free for download: on Amazon Public Data Sets, Data.gov, and more. The possibilities are endless, as researchers, businesses, and citizens from around the world have access to data that would otherwise be extremely expensive and time-consuming to collect. The challenge that follows is how these researchers and businesses are going to handle these large datasets, including storage, organization, and analysis.

Essentia is uniquely equipped to do all of these things and more. Data in various shapes and forms can be easily categorized, and from there quickly preprocessed into the desired format. Essentia’s flexibility is well suited to handle all the different types of files that appear in public datasets, and its powerful analytics engine is extremely useful in drawing meaningful conclusions from the cleaned, organized data. Read on to see some examples of how Essentia can be used with open data.

 

Average flight delay time, in minutes, for three major US carriers

Average flight delay time, in minutes, for three major US carriers

 

Aviation statistics:

One of the largest AWS public datasets is US Transportation, a collection of statistics and other information on different modes of transportation within the United States. The AWS data spans 1988-2008, so I organized and analyzed it in Essentia, basing my analysis on a number of factors such as average delay time and proportion of flights delayed by more than 15 minutes. Essentia’s aq_udb tools make it really easy to process and extract the desired data.  By utilizing stream-based processing, it can evaluate values like totals and averages by aggregating information from hundreds of large files, all in one command. The US Department of Transportation website has more recent data available as well, and even though the newer data was formatted differently from the data stored in AWS, Essentia made it easy to extract the same information from each and combine the two for the analysis shown here.

 

Number of border crossings in each of the seven states with the highest volumes of border crossings.

Number of border crossings by state. Shown for top 7 states by volume of border crossings.

 

Border crossings:

Also taken from the US Transportation database in AWS, border crossing data is provided as daily records for each entry point in each state, and also includes the mode of transportation for each traveler. As with the aviation data, I originally analyzed the data available in AWS, which covered 1997-2007. Then, after the most recent information available was downloaded from the US Transportation website and uploaded into our own S3 bucket, I processed the new data with Essentia as well, producing the updated analysis above.

 

Annual average hourly compensation for each industry

Annual average hourly compensation for each industry

 

Rate of change of annual average hourly compensation for each industry

Rate of change of annual average hourly compensation for each industry

 

Labor statistics:

US Labor Statistics is another AWS public dataset, containing data on everything from inflation to workplace injuries. Processing this data would usually be pretty involved, especially since a lot of it is in text, pdf, and other file types that aren’t very Excel-friendly. However, I was able to use the aq_pp preprocessing commands to organize, clean, and export this information in an analysis-friendly form (analysis is shown in the first figure). Doing so also made it a lot easier to do some simple regression calculations, and the slope of each regression line can be seen in the second figure.

These are just a few examples of how Essentia can be used to efficiently work with open data. Essentia makes it easy to work with different kinds of information, and is capable of doing so on a large scale not accommodated by most data analysis tools. As the open data community continues to grow, it will become increasingly important to be able to work with the wealth of data that is quickly becoming publicly available. As you can see, Essentia is exceptionally well-equipped to do so.

For more information and sample scripts used in the examples above, go to our git repository.

Japan