Definition SQL Databases – These are relational databases. NoSQL Databases – These are non-relational or distributed databases. Data Storage Types SQL Databases – Single way to store data. NoSQL Databases – Multiple ways to store data – key-value stores, document stores, wide-column stores, graph stores. Data storage models SQL Databases – Data is stored in a table where… Continue reading SQL vs. NoSQL Simplified
So, what has changed in the past few years? The volume of data has increased tremendously. The kind of data we are dealing with has changed. We no longer have the plain-old text format to deal with. We have audio, videos, images, and other complex formats of data that needs to be dealt with. Number of users… Continue reading What is NoSQL?
Apache Spark is another project of Apache that offers parallel data processing and which can work with Hadoop to develop Big Data applications. It is a fast and general engine for large-scale data processing. Let us look at some of the features of Apache Spark one by one – Real Time Processing Unlike Map-Reduce, Spark can handle… Continue reading 5 awesome features of Apache Spark
Apache Hadoop YARN (Yet Another Resource Negotiator) is a cluster management technology. It decouples MapReduce’s resource management and scheduling capabilities from the data processing component, enabling Hadoop to support more varied processing approaches and a broader array of applications. YARN, (or sometimes called as MR2), is an extended and an improved version of MR1. It was… Continue reading Term of the Week : YARN
Fair Scheduler is a pluggable scheduler for Hadoop that allows YARN applications to share resources in large clusters fairly. As the name suggests, it allocates resources such that all applications get an equal share. By default, this is done on the basis of the memory. But, i t can be configured to schedule with both memory… Continue reading What is a Fair Scheduler in Hadoop?
Oozie, an open source project, is implemented as a Java web application that runs in a Java servlet container and is distributed under the Apache License 2.0. It is a workflow scheduling system to manage Hadoop jobs. Oozie combines multiple jobs sequentially into one logical unit of work. It is integrated with the Hadoop stack, with YARN… Continue reading Term of the Week : Oozie
Apache Sqoop is a tool designed for transferring bulk data between Apache Hadoop and relational databases. It supports incremental loads of a single table or a free form SQL query as well as saved jobs which can be run multiple times to import updates made to a database since the last import. Imports can also be… Continue reading What is Apache Sqoop?
The Apache Tez project is an extensible framework built on top of Apache Hadoop YARN. It is used to process data, that earlier took multiple MR jobs, now in a single Tez job which uses Directed Acyclic Graph (DAG) for data processing. It is used for building high performance batch and interactive data processing applications.… Continue reading Term of the Week : Apache Tez
The Apache Tez project is an extensible framework built on top of Apache Hadoop YARN. It is used to process data, that earlier took multiple MR jobs, now in a single Tez job which uses Directed Acyclic Graph (DAG) for data processing. It is used for building high performance batch and interactive data processing applications. It drastically improves… Continue reading What is Apache Tez?
The term Apache Pig refers to an open source scripting platform by Apache used for analyzing and processing large data sets. It allows users to write complex map reduce problems using a simple scripting language called Pig Latin. Pig translates the Pig Latin script into MapReduce so that it can be executed within YARN for access to a… Continue reading Term of the Week : Apache Pig