A few weeks ago, I wrote about the recent rise in popularity of open data, and how these public data sets can be easily processed with Essentia. All the examples in that post are based on Amazon’s AWS Public Data Sets, which are (for the most part) large databases put together by organizations for public access and use. However, because the AWS data sets are voluntarily published by each organization, many are not regularly updated. In the US Transportation database available on AWS, aviation records and statistics are provided from 1988 to 2008. More recent data (through April 2016) can be found on the US Department of Transportation’s website, but in a format different from that of the data provided on AWS. Other open data are not prepackaged at all: for example, the US Census Bureau has information on state tax collections from 1992 to 2014, but on the website, recent data is separated from historical data, and from there, visitors can only view data for one year at a time. Furthermore, while tables for recent years can be downloaded as CSV or Excel workbooks, older tables are only available as Excel files. How do these issues affect people seeking to work with open data? Added to the complexity of processing large amounts of data is the challenge of first collecting all of the available files, then processing each one (separately, if they come in different file types and data formats) before putting everything together. Read on to see how Essentia rises to the occasion.
This dataset, comprised of annual, seasonal, monthly, daily, and hourly data for a number of climate normals including temperature, wind vector direction, dew point, and more. I chose monthly average temperature data for this example, and preprocessed the file using aq_pp. This selection was a little trickier to work with than the other data sets I’ve used, mostly because the temperature data was formatted as follows: “820C” is 82.0 degrees Fahrenheit, and the “C” flag indicates completeness (all 30 years were accounted for in calculating this value). For analysis, I wanted just the “82.0”, and I found the MapFrom and MapTo options perfect for this task. I added an option for each column to my preprocess aq_pp command, using the RT mapping syntax like so: -map Jan “%%var%%%%var2:0-1%%%?” “%%var%%.%%var2%%”. In the first RT expression, the tailing “?” catches the flag letter, and “var” and “var2” match the remaining 3 digit number, with “var2″‘s length requirement ensuring that “var” takes exactly the first 2 digits. This maps January’s entries from its original form to the new form, and I reused this expression for each column. The result was a light and clean CSV file, filtered for the station IDs in the cities I was interested in, with column labels corresponding to each month (omitted in the raw data file) and every entry converted to the form “82.0”. I used Excel to generate the graph above.
NASA’s SSE web portal lists download links for 4 surface meteorology and solar energy parameters: radiation, climate, wind, and solar energy. Radiation and climate measurements are monthly and annual averages taken over 22 years (July 1983-June 2005), wind speed is given for monthly and annual averages taken over 10 years (July 1983-June 1993), and solar geometry observations are provided for the “monthly average day”. In this example, I examine the following radiation parameter: clear sky insolation incident on a horizontal surface (kWh/m^2/day). A common feature in the SSE text files is several lines of metadata above each table. With the aq_pp preprocessing tool, it’s easy to use the input file attribute (ie “aq_pp -f,+15,eok …”) to skip all of these. From there, I indicated the space delimiter, and Essentia did the rest. For the graph above, I picked only coordinates for the cities I was interested in, but my original cleaned CSV file included every coordinate from the raw data.
On the SEC website, visitors can download market metrics, trade distributions, earnings reports for publicly traded companies, and a host of other information. In the spreads tables, for each ticker the following is recorded: trade observations, trade volume, and spread for TWQ, TWRQ, VWES, VWRES, DVWES, and DVWRES. I only wanted to look at trade volume, so I extracted only the ticker name and trade volume fields, and filtered for the tickers of the companies I was interested in: Facebook (FB), Intel (INTC), and Apple (AAPL). After collecting these into a single CSV, I used Excel’s Autofill function to fill in the dates to create the time series above. It’s worth noting that the SEC data is organized by year, with a file for each day of the year that the stock market was active in that year’s zipped folder. The dates are not included in each file, and instead are indicated in each file name, which means that it would be very convenient to use Essentia’s Date Regex function when creating the category to catalog the records by date. This would allow for streaming and querying within specified date ranges, which could prove useful for taking a closer look at a shorter span of time, such as weeks or months in 2013.
The SSA has made available on data.gov a comprehensive CSV file that records disability claim data for each fiscal year. This file contains roughly 30 columns, indicating the file name, region code, state code, favorable adult determinations, eligible child population filing rate, and numerous other statistics. The records are ordered by region code (alphabetically), then state code, then fiscal year. This, coupled with all the measurements that can’t really be graphed together (populations, percentages, determinations), makes even the most rudimentary data analysis fairly inconvenient. I used Essentia to extract just the fields I wanted: state, adult population, and percent of the adult population receiving disability benefits. With the region code field removed, Essentia preserved the original ordering by state and then by year, and then I used an Excel pivot table to sort by size of adult population to select the states to include in my analysis. At this point, the data was clean and minimal enough for Excel to automatically know what kind of line chart I wanted to generate, shown above. Open data generally includes a ton of information because it is intended to be accessible and useful to anyone, but this often means that most users don’t need everything. Using Essentia, it took me less than 5 minutes to extract the data I wanted, accelerating the analysis process and making it easier as well.
The National Oceanic and Atmospheric Administration keeps track of severe weather events, and records the coordinates, cell ID, azimuth, shear, and numerous other variables for each. I wasn’t really interested in creating a time series for this data set, since only the year is given for each event. Instead, I extracted the coordinates and shear of each event, regardless of date, and plugged the contents of my resulting CSV file into this handy little site. BatchGeo generates a KML file ready for use with Google Earth (see my screenshot above), and also emails users links to the maps they’ve created so that they can still edit them. Once in Google Earth, it’s pretty interesting to look around the world and see where the most severe weather events are recorded; each location can also be zoomed in upon to see the shear recorded there.
The increasing ubiquity of open data sources corresponds to increased variability in where information is made available and if/how it is compressed, organized, and formatted. Data scientists and business analysts alike will find that many data management tools can only analyze this format, or decompress this type. Essentia’s flexible, powerful functions make it the perfect tool for wrangling any kind of data: it seamlessly integrates into the data analysis process by doing exactly what’s needed and then getting out of the way.