Thursday Management Series

The Scrum Workflow – A chain of events!

 Image Source In the previous posts under the Thursday Management Series, we have seen a brief overview of a Scrum, went through some of the commonly used terms in Scrum and discussed in detail the roles of a Product Owner and that of a Scrum Master. I hope these articles have cleared the picture of… Continue reading The Scrum Workflow – A chain of events!

Tuesday Big Data Series

Unexpected Failures: Don’t let the boat sink!

'As far as we can tell, the system went down because someone stepped on a crack in the sidewalk.'

Image Source

Did you know?

A quick search on Google shows that the page load slowdown by just one second could cause Amazon $1.6 Billion in sales each year. Google could lose 8 millions searches per day if it slows down its search results by four tenths of a second [Tweet this].

Outages, unexpected system failures, downtimes, hardware or software failures are a nightmare for any IT organization. A failure of any application or a system can either result in data loss or temporary/permanent loss of a service. This can affect the existing customers, interfere with the marketing campaigns, generate loss of revenues and spoil the brand name. We don’t want that. Do we?

To prevent businesses from getting impacted, it is important to understand different types of failures, what causes these failures and how they can be prevented.

To begin with, what are the common causes of failures?

  1. Hardware Failures

Machines or hosts may fail. A distributed application that depends on a single machine may be at a higher risk, when a component fails. A single component failure can cause entire setup to fail.

  1. Software Failures

Bugs in software can affect the functioning of a component or the entire system as a whole.

For example, a new patch may have some bugs in the code that can cause failure of an existing functionality.

  1. Upgrades or Maintenance

As the technology is always moving at a fast pace, we need to regularly keep our software and hardware up-to-date. This requires regular and frequent upgrades to the existing systems. Such maintenance or upgrade needs can result in an outage and bring the system to a halt.

  1. Human Errors

Let’s accept it. All of us make mistakes. Errors or mishaps caused by people cannot be avoided. This can also result in causing a downtime.

To make our systems highly fault-tolerant and available at all times, we need to be resilient to such failures. Adopting a few practices while designing the architecture of the system can prove helpful.

  1. Removing single points of failures

This will ensure that the system continues to run even if one of the machines fails. Hadoop is built to tolerate such failures. Failure of a slave or a data node in the Hadoop system will not affect the application. The system will keep running even though the affected node is not functional. However, failure of the master node or Namenode may result in some downtime. But, Hadoop has mechanisms to recover from this failure quickly.

  1. Making the system robust to software bugs/errors

Hadoop is designed to tolerate some software bugs. With some error handling implemented in the code, applications can be made robust.

  1. Incorporate rolling upgrades

To keep software and hardware up-to-date, they need to be frequently upgraded to include new features. Rolling upgrades can be incorporated which will do what is required without bringing the system to a halt. Hadoop supports rolling upgrades.

  1. Replication of data

One of the important features of Hadoop is that it replicates the data stored to three other hosts. ‘Three’ is the default number that can be modified by the Hadoop administrator if required.

If one of the hosts that contain data fails, or if the data is lost completely on that node, it can be recovered by accessing the node where data had been replicated. The best part about this feature is that Hadoop internally handles data replication and the ability to switch to the node where data is replicated.

  1. Reproducing the computation

In Hadoop systems, if a node performs slowly or if it does not respond within time, the system starts the same task on another working node. This is done in complete silence and the user is unaware of such failures. The result from the node that completes the task first is considered to be true and the other tasks on slow or dead machines are then killed.

  1. System should support faster restarts.

Faster restarts can reduce the downtime, if not avoid it.

Failures are unexpected and unpredictable. That is why they cannot be completely avoided even though there have been technological advancements. The above listed features are just some of them that make Hadoop resilient to failures. A number of efforts are on the way to add improvements and additional functionalities to the newer versions of Hadoop. Businesses should build and use systems that are highly available, fault-tolerant and have faster error recovery which cause no or minimum damage. But, don’t let the boat sink!

Like it? Tweet it! Tweet: Unexpected Failures: Don't let the boat sink! http://ctt.ec/1S3_j @themarketeng

Monday Technology Series

Best Practices for Responsive Email Design [INFOGRAPHIC]

In the previous post, we have seen with some examples that how the email market has transitioned from desktop to mobile with the emergence of all sorts of new devices. One has to be on top of the game and be able to provide best user experience to all his customers. Although, it’s tough to cater to… Continue reading Best Practices for Responsive Email Design [INFOGRAPHIC]

Friday "Term of the week" Series

Term of the week : Big Data

The term Big Data refers to extremely large data sets that may be analyzed to reveal patterns, trends and human behavior. This large volume of data is collected by tracking user interactions and actions over the web. This data can be structured, semi-structured or unstructured which is one of its advantages over the traditional databases that expect… Continue reading Term of the week : Big Data

Thursday Management Series

Scrum Terminologies

The following terms are often used in a scrum process. Since, these are universal terms and definitions, they have been picked up from wiki. Scrum Team – It comprises of Product owner, scrum master and development team Product Owner – The person responsible for maintaining the product backlog by representing the interests of the stakeholders, and ensuring… Continue reading Scrum Terminologies