A few weeks ago, I wrote about the recent rise in popularity of open data, and how these public data sets can be easily processed with Essentia. All the examples in that post are based on Amazon’s AWS Public Data Sets, which are (for the most part) large databases put together by organizations for public access and use. However, because the AWS data sets are voluntarily published by each organization, many are not regularly updated. In the US Transportation database available on AWS
, aviation records and statistics are provided from 1988 to 2008. More recent data (through April 2016) can be found on the US Department of Transportation’s website
, but in a format different from that of the data provided on AWS. Other open data are not prepackaged at all: for example, the US Census Bureau has information on state tax collections from 1992 to 2014
, but on the website, recent data is separated from historical data, and from there, visitors can only view data for one year at a time. Furthermore, while tables for recent years can be downloaded as CSV or Excel workbooks, older tables are only available as Excel files. How do these issues affect people seeking to work with open data? Added to the complexity of processing large amounts of data is the challenge of first collecting all of the available files, then processing each one (separately, if they come in different file types and data formats) before putting everything together. Read on to see how Essentia rises to the occasion.
Open data as a term is relatively new, but the concept is not. The idea that information should be freely available for unrestricted use has been around for awhile, but didn’t really take off before the rise of the Internet made it feasible to share data quickly and globally. Add in the recent popularity of big data, and it makes a lot of sense that public datasets are on the rise as well. Enormous amounts of valuable data, including everything from climate projections to genome sequences, have been made available by the organizations that own them and are now free for download: on Amazon Public Data Sets, Data.gov
, and more. The possibilities are endless, as researchers, businesses, and citizens from around the world have access to data that would otherwise be extremely expensive and time-consuming to collect. The challenge that follows is how these researchers and businesses are going to handle these large datasets, including storage, organization, and analysis.
The wealth of information in Apache logs is astounding, but this information can be buried in hard-to-find files, error-prone, and difficult to extract. Essentia easily handles the first two issues using the Essentia Scanner and Essentia Preprocessor. However, to extract more specialized data from the Apache Logs and other forms of data, Essentia allows easy creation and integration of custom modules to supplement its analysis.
Apache server logs present an important opportunity with a multitude of valuable insights to be gained, but are typically buried in S3 directories with many other such logs in entirely different formats. Not only must the correct logs be extracted from their datastore, they must be converted into a format that can be properly analyzed.
This is where Essentia comes in. First we scan the S3 directory to be sure to select exactly the access logs we want to analyze. Then we use the Essentia Log Converter to convert these access logs into a form readable by our Preprocessor (ie a singly -delimited format) on the fly.
In one step we ignore the irrelevant columns in the apache logs so we can focus on processing only the most relevant data. Then we utilize a custom C module to bolster Essentia’s analysis and extract the location and system information out of the users’ IP addresses.
Information is everywhere and people are starting to realize the benefits to be gained by utilizing it. Unfortunately, this information is often spread across many different sets of files and can be stored in a variety of places. Finding all of this data and merging it into one, complete set of data that’s ready for analysis is a difficult and complicated task. We rose to this challenge and created Essentia to make this process quick, easy, and efficient.
By simply telling the Essentia Scanner where your data is located, you can immediately start to categorize your files so that you can select exactly the data you need. Then you can stream this data into the Essentia Preprocessor where it can be combined in a variety of ways to make sure you get the entire set of data that you’re looking for.