Introduction To Marketing Technology Course

How to query Hadoop for getting counts

$100,000

One Hundred Thousand Dollars.

Just sitting there in plain sight – waiting to be picked up.

It is yours, if you only have the courage and wisdom to pick it up. No one is stopping you! There is no resistance.

“What is the catch?”, you may ask.

There is none really! It is an opportunity, equally available to everyone, if they only grab it! This opportunity is the skill to query Big Data to get counts.

I am a software engineer with 20 years of experience. My salary has risen steadily in steps. I can attribute every step of increase to a key skill or action. Six years ago, I decided to learn how to query Hadoop. Within a year, there was a real project with great visibility, that needed this skill. I was prepared. I ran the counts daily for a week – basically executing the same query daily, and sending the results in an email. The project was a success, I got a good performance review, and my salary increased by $20,000. That increment has stayed for last 5 years (there have been other increments, but they act on the new base of $then_salary + $20,000). So, in five years, this one skill made me $100,000, and still counting!

=======

Google search reveals that Big Data engineer salary is disproportionately high when compared to other software domains. Yet, the skill needed is closer to the skill need in Valet Parking than in automobile engineering. The engineering skills needed to build a Honda Civic or Tesla S3 need years to acquire. Yet the skills needed to park them can be picked up in a day! The weird reality is – Valet Parking is paid more than engineering for Big Data.

Today will be the day we will learn Valet Parking the Big Data.

=======

The skill you want to learn is a simple one. Continuing from yesterday’s example, suppose we have sent a few million emails, all containing a unique 1×1 invisible pixel. We are interested in knowing how many people have opened the emails so far.

Typically, CRM companies already have the infrastructure to copy the access logs, parse them and store them on a Big Data system – be it on Amazon or Hadoop Cluster. For our purposes, we can imagine that all the data are available to us in the form of a giant spread sheet – with billions of rows and hundreds of column. And we will be using a language called Pig.

Our job, then, is to

  1. Specify which data set we will be using (LOAD statement)
  2. Which columns we are interested in (GENERATE statement)
  3. Which rows we are interested in (FILTER statement)
  4. Deduplicating (DISTINCT statement)
  5. Counting the rows (COUNT statement)
  6. Storing the results (STORE statement)

Usually, someone has already done this in that company for counting based on different criteria. So, things are even better – you actually start with a working code and recipe that you need to modify.

You need to modify just which columns you are interested in, and which rows you are interested in. The rest is as is!

Typically, someone will already tell you what those columns and rows are in an email. So, you are really copying those field names from the email, and pasting them into the pig script.

=======

Suppose the image that we embedded in the email for one particular user was

<img src=”//geo.query.yahoo.com/v1/public/yql?yhlVer=2&yhlClient=rapid&yhlS=1197791690&yhlCT=2&yhlBTMS=1524070779696style=”position:absolute; left:-9999px;” alt=”” />

I have used color highlighting to demonstrated that there are a number of key-value pairs that are encoded in the url. These translate into different column names in the Hadoop Spreadsheet.

Our pig script would look like:

  • A = LOAD Email.Daily using org.apache.hive.hcatalog.pig.HCatLoader();
  • B = FILTER A BY timestamp > ‘20180401’ AND email_id = ‘1197791690‘;
  • C = FOREACH B GENERATE user_id;
  • D = DISTINCT C;
  • E = GROUP D All;
  • F = FOREACH E GENERATE COUNT (D);
  • STORE F INTO ‘my_output’;

If this pig is stored on a gateway host with the name count_emails.pig, then you will login to the gateway host and run the command

  • pig count_emails.pig >foo.out 2>foo.err

When the job completes, you will read the output by running the following command

  • hadoop fs -text my_output

You will see the count – 42 – and report that to your stakeholders in a thrilling email!

=======

There are other ways of querying big data – you could use HIVE – which is an SQL interface.

There is a UI called HUE, that provides a graphical way for you to query Big data with either pig or hive.

These are all organization specific details. The core skill is the same. And it is for yours for grabbing! Just get one query working, and update your resume as a Big Data engineer fearlessly! Good luck!

Leave a comment