When it comes to big data, compression is key.
However, many popular analysis tools can’t handle Zip compressed files. Each Zip file must be converted to another compression type (such as Gzip) before being analyzed by these tools. This accrues a high cost due to the need to store these extra files as well as the large amount of time it takes to carry out the conversion. That is, if you’re not using Essentia.
The parts of Essentia that dramatically improve on this are its native support of Zip and Gzip compression as well as its ability to streaming unzip Zip files and then compress them into Gzip format.Thus you can select exactly the Zip files you need from wherever your data is stored using the Essentia Scanner, streaming convert them into Gzip format, and then output them wherever you want. They can be sent directly into Redshift or other analysis tools, saved to file, or sent to S3 for later loading.
The wealth of information in Apache logs is astounding, but this information can be buried in hard-to-find files, error-prone, and difficult to extract. Essentia easily handles the first two issues using the Essentia Scanner and Essentia Preprocessor. However, to extract more specialized data from the Apache Logs and other forms of data, Essentia allows easy creation and integration of custom modules to supplement its analysis.
It’s simple: big data means lots of logs and these logs tend to be disordered and very hard to distinguish. If the desired log files are in different directories and particularly if there are other log files in those directories along side them, it can sometimes be necessary to specify these files and their paths by name. This is an incredibly messy and time consuming process that Essentia remedies.
With the Essentia Scanner you simply point to your datastore–where your files are being stored–and the list of filenames are stored in a database file. You can then easily explore how these files are organized and categorize your files into the segments you want them to be in.
Apache server logs present an important opportunity with a multitude of valuable insights to be gained, but are typically buried in S3 directories with many other such logs in entirely different formats. Not only must the correct logs be extracted from their datastore, they must be converted into a format that can be properly analyzed.
This is where Essentia comes in. First we scan the S3 directory to be sure to select exactly the access logs we want to analyze. Then we use the Essentia Log Converter to convert these access logs into a form readable by our Preprocessor (ie a singly -delimited format) on the fly.
In one step we ignore the irrelevant columns in the apache logs so we can focus on processing only the most relevant data. Then we utilize a custom C module to bolster Essentia’s analysis and extract the location and system information out of the users’ IP addresses.