Hadoop Distributed File System (HDFS)

1. What is HDFS?  

Short Answer

HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes), and provide high-throughput access to this information.

Files are stored in a redundant fashion across multiple machines to ensure their durability to failure and high availability to very parallel applications

Long Answer

HDFS was based on a paper Google published about their Google File System.Hadoop Distributed File System (HDFS) is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers.

HDFS runs on top of the existing file systems on each node in a Hadoop cluster. It is not POSIX compliant. It is designed to tolerate high component failure rate through replication of the data. Hadoop works best with very large files.The larger the file, the less time Hadoop spends seeking for the next data location on disk, the more time Hadoop runs at the limit of the bandwidth of your disks. Seeks are generally expensive operations that are useful when you onlyneed to analyze a small subset of your dataset. Since Hadoop is designed to run over your entire dataset, it is best to minimize seeks by using large files. Hadoop is designed for streaming or sequential data access rather than random access. Sequential data access means fewer seeks, since Hadoop only seeks to the beginning of each block and begins reading sequentially from there.Hadoop uses blocks to store a file or parts of a file.

A Hadoop block is a file on the underlying filesystem. Since the underlying filesystem stores files as blocks, one Hadoop block may consistof many blocks in the underlying file system. Blocks are large. They default to 64 megabytes each and most systems run with block sizes of 128 megabytes or larger.

2. What is Big Data?  

Short Answer

Big Data is nothing but an assortment of such a huge and complex data that it becomes very tedious to capture, store, process, retrieve and analyze it with the help of on-hand database management tools or traditional data processing techniques.

Long Answer

We come across data in every possible form, whether through social media sites, sensor networks, digital images or videos, cellphone GPS signals, purchase transaction records, web logs, medical records, archives, military surveillance, e-commerce, complex scientific research and so on…it amounts to around some Quintilian bytes of data! This data is what we call as…BIG DATA!

Big data is nothing but an assortment of such huge and complex data that becomes very tedious to capture, store, process, retrieve and analyze it. Thanks to on-hand database management tools or traditional data processing techniques, things have become easier now. In fact, the concept of “BIG DATA” may vary from company to company depending upon its size, capacity, competence, human resource, techniques and so on. For some companies it may be a cumbersome job to manage a few gigabytes and for others it may be some terabytes creating a hassle in the entire organization.

3. What do the four V’s of Big Data denote?  

Short Answer

IBM has a nice, simple explanation for the four critical features of big data:

a) Volume –Scale of data
b) Velocity –Analysis of streaming data
c) Variety – Different forms of data
d) Veracity –Uncertainty of data

Long Answer

a) Volume: BIG DATA is clearly determined by its volume. It could amount to hundreds of terabytes or even petabytes of information. For instance, 15 terabytes of Facebook posts or 400 billion annual medical records could mean Big Data!

b) Velocity: Velocity means the rate at which data is flowing in the companies. Big data requires fast processing. Time factor plays a very crucial role in several organizations. For instance, processing 2 million records at share market or evaluating results of millions of students applied for competitive exams could mean Big Data!

c) Variety: Big Data may not belong to a specific format. It could be in any form such as structured, unstructured, text, images, audio, video, log files, emails, simulations, 3D models, etc. New research shows that a substantial amount of an organization’s data is not numeric; however, such data is equally important for decision-making process. So, organizations need to think beyond stock records, documents, personnel files, finances, etc.

d) Veracity: Veracity refers to the uncertainty of data available. Data available can sometimes get messy and maybe difficult to trust. With many forms of big data, quality and accuracy are difficult to control like the Twitter posts with hash tags, abbreviations, typos and colloquial speech. But big data and analytics technology now permits to work with these types of data. The volumes often make up for the lack of quality or accuracy. Due to uncertainty of data, 1 in 3 business leaders don’t trust the information they use to make decisions.

4. What is Hadoop?  

Short Answer

Hadoop is a framework that allows for distributed processing of large data sets across clusters of commodity computers using a simple programming model.

Long Answer

It is truly said that ‘Necessity is the mother of all inventions’ and ‘Hadoop’ is amongst the finest inventions in the world of Big Data! Hadoop had to be developed sooner or later as there was an acute need of a framework that can handle and process Big Data efficiently.

Technically speaking, Hadoop is an open source software framework that supports data-intensive distributed applications. It is licensed under the Apache v2 license. It is therefore generally known as Apache Hadoop. It has been developed, based on a paper originally written by Google on MapReduce system and applies concepts of functional programming. It is written in the Java programming language and is the highest-level Apache project being constructed and used by a global community of contributors. Hadoop was developed by Doug Cutting and Michael J. Cafarella. And the charming yellow elephant you see is basically named after Doug’s son’s toy elephant!

In 2002, Doug Cutting created an open source, web crawler project. In 2004, Google published MapReduce, GFS papers. In 2006, Doug Cutting developed the open source, Mapreduce and HDFS project. In 2008, Yahoo ran 4,000 node Hadoop cluster and Hadoop won terabyte sort benchmark. In 2009, Facebook launched SQL support for Hadoop.

5. Why is it important to harness Big Data?  

Data had never been as crucial before as it is today. In fact, we can see a transition from the old saying…’Customer is King’ to ‘Data is king’! This is because for an efficient decision making, it is very important to analyze the right amount and the right type of data! Healthcare, banking, public sector, pharmaceutical, or IT, all need to look beyond the concrete data stored in their databases and study the intangible data in the form of sensors, images, weblogs, etc. In fact, what sets smart organizations apart from others is their ability to scan data effectively to allocate resources properly, increase productivity and inspire innovation!

6. Why Big Data analysis is crucial?  

1. Just like labor and capital, data has become one of the factors of production in almost all the industries.

2. Big data can unveil some really useful and crucial information which can change decision making process entirely to a more fruitful one.

3. Big data makes customer segmentation easier and more visible, enabling the companies to focus on more profitable and loyal customers.

4. Big data can be an important criterion to decide upon the next line of products and services required by the future customers. Thus, companies can follow proactive approach at every step.

5. The way in which big data is explored and used can directly impact the growth and development of the organizations and give a tough competition to others in the row! Data driven strategies are soon becoming the latest trend at the Management level!

7. How to Harness Big Data?  

As the name suggests, it is not an easy task to capture, store, process and do big data analysis. Optimizing big data is a daunting affair that requires a robust infrastructure and state-of-art technology which should take care of the privacy, security, intellectual property, and even liability issues related to big data. Big data will help you answer those questions that were lingering for a long time! It is not the amount of big data that matters the most, it is what you are able to do with it that draws a line between the achievers and the losers.

8. Can you give some examples of Big Data?  

There are many real life examples of Big Data Facebook is generating 500+ terabytes of data per day, NYSE (New York Stock Exchange) generates about 1 terabyte of new trade data per day, a jet airline collects 10 terabytes of censor data for every 30 minutes of flying time. All these are day to day examples of Big Data!

9. How Big is ‘Big Data’?  

With time, data volume is growing exponentially. Earlier we used to talk about Megabytes or Gigabytes. But time has arrived when we talk about data volume in terms of terabytes, petabytes and also zettabytes! Global data volume was around 1.8ZB in 2011 and is expected to be 7.9ZB in 2015. It is also known that the global information doubles in every two years!

10. How is analysis of Big Data useful for organizations?  

Effective analysis of Big Data provides a lot of business advantage as organizations will learn which areas to focus on and which areas are less important. Big data analysis provides some early key indicators that can prevent the company from a huge loss or help in grasping a great opportunity with open hands! A precise analysis of Big Data helps in decision making! For instance, nowadays people rely so much on Facebook and Twitter before buying any product or service. All thanks to the Big Data explosion.

11. What is Hadoop Map Reduce?  

Short Answer

For processing large data sets in parallel across a hadoop cluster, Hadoop MapReduce framework is used. Data analysis uses a two-step map and reduce process.

Long Answer

It all started with Google applying the concept of functional programming to solve the problem of how to manage large amounts of data on the internet. Google named it as the ‘MapReduce’ system and was penned down in a paper published by Google. With the ever increasing amount of data generated on the web, MapReduce was created in 2004 and Yahoo stepped in to develop Hadoop in order to implement the MapReduce technique in Hadoop. The function of MapReduce is to help Google in searching and indexing the large quantity of web pages in matter of a few seconds or even in a fraction of a second. The key components of MapReduce are JobTracker, TaskTrackers and JobHistoryServer.

12. What is Apache Pig?  

Apache Pig is another component of Hadoop, which is used to evaluate huge data sets made up of high-level language. In fact, Pig was initiated with the idea of creating and executing commands on Big Data sets. The basic attribute of Pig programs is ‘parallelization’ which helps them to manage large data sets. Apache Pig consists of a compiler that generates a series of MapReduce program and a ‘Pig Latin’ language layer that facilitates SQL-like queries to be run on distributed databases in Hadoop.

13. What is Apache Hive?  

As the name suggests, Hive is Hadoop’s data warehouse system that enables quick data summarization for Hadoop, handle queries and evaluate huge data sets which are located in Hadoop’s file systems and also maintains full support for map/reduce. Another striking feature of Apache Hive is to provide indexes such as bitmap indexes in order to speed up queries. Apache Hive was originally developed by Facebook, but now it is developed and used by other companies too, including Netflix.

14. What is Apache HCatalog?  

Apache HCatalog is another important component of Apache Hadoop which provides a table and storage management service for data created with the help of Apache Hadoop. HCatalog offers features like a shared schema and data type mechanism, a table abstraction for users and smooth functioning across other components of Hadoop such as such as Pig, Map Reduce, Streaming, and Hive.

15. What is Apache HBase?  

HBase is an acronym for Hadoop DataBase. HBase is a distributed, column oriented database that uses HDFS for storage purposes. On one hand it manages batch style computations using MapReduce and on the other hand it handles point que­ries (random reads). The key components of Apache HBase are HBase Master and the RegionServer.

16. What is Apache Zookeeper?  

Apache ZooKeeper is another significant part of Hadoop Ecosystem. Its major function is to keep a record of configuration information, naming, providing distributed synchronization, and providing group services which are immensely crucial for various distributed systems. In fact, HBase is dependent upon ZooKeeper for its functioning.

17. Who are ‘Data Scientists’?  

Data scientists are soon replacing business analysts or data analysts. Data scientists are experts who find solutions to analyze data. Just as web analysis, we have data scientists who have good business insight as to how to handle a business challenge. Sharp data scientists are not only involved in dealing business problems, but also choosing the relevant issues that can bring value-addition to the organization.

18. Core skills of a data scientist.  

The process of data science includes 3 stages.
* Data Capture
* Data Analysis
* Presentation.

19. Data Capture?  

Programming and Database Skills:
The first step of data mining is to capture the right data. So, to be a data scientist, it is very essential to be familiar with tools and technologies, especially the open source ones like Hadoop, Java, Python, C++, and database technologies like SQL, NoSQL, HBase and so on.

Business Domain and Expertise:
Data differs according to the business. Therefore, understanding the business data needs expertise, which comes only by working in a particular data domain.
For example: Data gathered from the medical field will be entirely different from the data of a retail clothing store.

Data Modeling, Warehouse and Unstructured Data Skills:
Organizations are gathering enormous amount of data through various resources. The data captured in this fashion is unstructured and needs to be organized before analysis. Therefore, a data scientist has to be proficient in modeling the unstructured data.

20. Data Analysis?  

Statistical Tool Skills:
The essential skill of a data scientist is to know how to use the statistical tools like R, Excel, SAS and so on. These tools are required to grind the captured data and analyze it.

Math Skills:
Computer science knowledge alone is not sufficient to be a data scientist. The data scientist profile requires someone who can understand large-scale machine learning algorithms and programming, while being a proficient statistician. This needs expertise in other scientific and mathematical disciplines apart from computer languages.

.Net Interview Question

PHP Interview Question

Java Interview Question

AngularJS Interview Questions