Hive or Pig?

May 17, 2016May 2, 2016 Garshita GuptaLeave a comment

Which one of them is your favorite – Hive or Pig? What do you prefer to work with? People often confuse as to when to use Hive and when to use Pig. And, while in most of the cases, either of it can be used, the question that arises is why both of them exist in… Continue reading Hive or Pig?

Tuesday Big Data Series

Understanding Pig Data Model

May 10, 2016May 2, 2016 Garshita GuptaLeave a comment

Pig has a simple yet rich data model which consists the following four types: Atom An atom consists of a single atomic value which can be a string or a number. Examples – ‘tom’ or 2 Tuple A tuple is a sequence of fields each of which can be of any datatype. Examples – (‘tom’, ‘california’) or… Continue reading Understanding Pig Data Model

Tuesday Big Data Series

Understanding HDFS quotas

May 3, 2016April 5, 2016 Garshita GuptaLeave a comment

Every Hadoop system has an Hadoop Administrator and Hadoop users/developers. The Administrator is responsible for deployment and maintenance of the entire infrastructure. He is responsible for cluster availability, file system management, security, installation of latest updates, and all other things that need to keep the system up and running. The administrator is also responsible for… Continue reading Understanding HDFS quotas

Tuesday Big Data Series

HDFS: Filesystem Metadata and how it persists

April 26, 2016April 1, 2016 Garshita GuptaLeave a comment

In post that explains the HDFS architecture, we saw that HDFS namespace is stored and maintained by NameNode. What is HDFS Namespace? Namespace is a hierarchy of directories, files and blocks in HDFS. It supports file system operations such as creation, modification, deletion and listing of files and directories. One of the important features of HDFS is… Continue reading HDFS: Filesystem Metadata and how it persists

Tuesday Big Data Series

Understanding NameNode and DataNode in HDFS

April 19, 2016April 1, 2016 Garshita GuptaLeave a comment

HDFS has a master/slave architecture and is built-up of basically two kinds of nodes: NameNode (which acts as a Master) and DataNodes (which acts as slaves). NameNode and DataNode are pieces of software that are designed to run on commodity machines which typically run on a GNU/Linux operating system (OS). HDFS is built using Java language,… Continue reading Understanding NameNode and DataNode in HDFS

Tuesday Big Data Series

HDFS and it’s features that make it so awesome!

April 12, 2016April 1, 2016 Garshita GuptaLeave a comment

HDFS, Hadoop Distributed File System, is a Java-based filesystem for storing large volumes of data in the Hadoop framework. When we are thinking of dealing with enormous amounts of data, the first thing that comes to our mind is where do we store this data and how do we store it. We know that every single bit… Continue reading HDFS and it’s features that make it so awesome!

Tuesday Big Data Series

Data Storage – Not a challenge anymore

April 5, 2016March 11, 2016 Garshita GuptaLeave a comment

Image Source In this new world of Big Data, you can never fall short of data. Marketers these days have access to more data than ever. They can use this data to learn about their customer and fine tune their marketing to appeal them strongly. Data is collected from every interaction on the web. For instance,… Continue reading Data Storage – Not a challenge anymore

Tuesday Big Data Series

Unexpected Failures: Don’t let the boat sink!

March 29, 2016March 11, 2016 Garshita GuptaQuoteLeave a comment

Image Source

Did you know?

A quick search on Google shows that the page load slowdown by just one second could cause Amazon $1.6 Billion in sales each year. Google could lose 8 millions searches per day if it slows down its search results by four tenths of a second [Tweet this].

Outages, unexpected system failures, downtimes, hardware or software failures are a nightmare for any IT organization. A failure of any application or a system can either result in data loss or temporary/permanent loss of a service. This can affect the existing customers, interfere with the marketing campaigns, generate loss of revenues and spoil the brand name. We don’t want that. Do we?

To prevent businesses from getting impacted, it is important to understand different types of failures, what causes these failures and how they can be prevented.

To begin with, what are the common causes of failures?

Hardware Failures

Machines or hosts may fail. A distributed application that depends on a single machine may be at a higher risk, when a component fails. A single component failure can cause entire setup to fail.

Software Failures

Bugs in software can affect the functioning of a component or the entire system as a whole.

For example, a new patch may have some bugs in the code that can cause failure of an existing functionality.

Upgrades or Maintenance

As the technology is always moving at a fast pace, we need to regularly keep our software and hardware up-to-date. This requires regular and frequent upgrades to the existing systems. Such maintenance or upgrade needs can result in an outage and bring the system to a halt.

Human Errors

Let’s accept it. All of us make mistakes. Errors or mishaps caused by people cannot be avoided. This can also result in causing a downtime.

To make our systems highly fault-tolerant and available at all times, we need to be resilient to such failures. Adopting a few practices while designing the architecture of the system can prove helpful.

Removing single points of failures

This will ensure that the system continues to run even if one of the machines fails. Hadoop is built to tolerate such failures. Failure of a slave or a data node in the Hadoop system will not affect the application. The system will keep running even though the affected node is not functional. However, failure of the master node or Namenode may result in some downtime. But, Hadoop has mechanisms to recover from this failure quickly.

Making the system robust to software bugs/errors

Hadoop is designed to tolerate some software bugs. With some error handling implemented in the code, applications can be made robust.

Incorporate rolling upgrades

To keep software and hardware up-to-date, they need to be frequently upgraded to include new features. Rolling upgrades can be incorporated which will do what is required without bringing the system to a halt. Hadoop supports rolling upgrades.

Replication of data

One of the important features of Hadoop is that it replicates the data stored to three other hosts. ‘Three’ is the default number that can be modified by the Hadoop administrator if required.

If one of the hosts that contain data fails, or if the data is lost completely on that node, it can be recovered by accessing the node where data had been replicated. The best part about this feature is that Hadoop internally handles data replication and the ability to switch to the node where data is replicated.

Reproducing the computation

In Hadoop systems, if a node performs slowly or if it does not respond within time, the system starts the same task on another working node. This is done in complete silence and the user is unaware of such failures. The result from the node that completes the task first is considered to be true and the other tasks on slow or dead machines are then killed.

System should support faster restarts.

Faster restarts can reduce the downtime, if not avoid it.

Failures are unexpected and unpredictable. That is why they cannot be completely avoided even though there have been technological advancements. The above listed features are just some of them that make Hadoop resilient to failures. A number of efforts are on the way to add improvements and additional functionalities to the newer versions of Hadoop. Businesses should build and use systems that are highly available, fault-tolerant and have faster error recovery which cause no or minimum damage. But, don’t let the boat sink!

Like it? Tweet it!

Tuesday Big Data Series

Endless Data, Endless Opportunities

March 22, 2016March 11, 2016 Garshita GuptaLeave a comment

With the kind of volume and variety of data that is available in the world today, it becomes important that we analyze this data irrespective of its source and format and also decide which of the elements of the data would be helpful in creating an effective marketing solution. Image Source In this age of… Continue reading Endless Data, Endless Opportunities

Tuesday Big Data Series · Wednesday Marketing Series

Parallel Processing, Faster Targeting

March 15, 2016March 11, 2016 Garshita GuptaLeave a comment

So, we have all the data collected from various sources like logs from the web, social media sites, user based interaction logs, etc. Marketers (or businesses) want to turn this real and growing data into an opportunity. We want to be able to – Study or analyze this raw data to extract information and gain… Continue reading Parallel Processing, Faster Targeting

MarketEng

… marketing engineered

Category: Tuesday Big Data Series

Hive or Pig?

Understanding Pig Data Model

Understanding HDFS quotas

HDFS: Filesystem Metadata and how it persists

Understanding NameNode and DataNode in HDFS

HDFS and it’s features that make it so awesome!

Data Storage – Not a challenge anymore

Unexpected Failures: Don’t let the boat sink!

Endless Data, Endless Opportunities

Parallel Processing, Faster Targeting

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this: