Open Data and Essentia (Part 2)

A few weeks ago, I wrote about the recent rise in popularity of open data, and how these public data sets can be easily processed with Essentia. All the examples in that post are based on Amazon’s AWS Public Data Sets, which are (for the most part) large databases put together by organizations for public access and use. However, because the AWS data sets are voluntarily published by each organization, many are not regularly updated. In the US Transportation database available on AWS, aviation records and statistics are provided from 1988 to 2008. More recent data (through April 2016) can be found on the US Department of Transportation’s website, but in a format different from that of the data provided on AWS. Other open data are not prepackaged at all: for example, the US Census Bureau has information on state tax collections from 1992 to 2014, but on the website, recent data is separated from historical data, and from there, visitors can only view data for one year at a time. Furthermore, while tables for recent years can be downloaded as CSV or Excel workbooks, older tables are only available as Excel files. How do these issues affect people seeking to work with open data? Added to the complexity of processing large amounts of data is the challenge of first collecting all of the available files, then processing each one (separately, if they come in different file types and data formats) before putting everything together. Read on to see how Essentia rises to the occasion.

Read more

Hadoop vs Essentia: Media Overlap Calculation

At AuriQ, we use Essentia to analyze digital marketing logs for a diverse set of clients. Some are interested in traditional attribution scores, others in lift metrics, but there is also an interest in brand exposure.

The issue is the following: We want users to see an ad on as many different web sites as possible. But which ‘groups’ of sites are most commonly visited? For simplicity lets focus on site pairs and count unique users who visit site A and B, site A, and C, etc etc, and then rank these in order of decreasing unique user counts.  We can then focus ad spend on those highest ranked pairs of sites.

Read more