EXPLANATION OF BIG DATA ARCHITECTURE AND ITS TECHNOLOGIES
DEFINITION OF BIG DATA
Big data refers to massive volumes of data that can’t be stored and/or processed using the simple Database Management System approach within a specific time frame. It refers to any data that is in petabytes or greater in memory size that causes drawbacks in storing, analysing and visualizing the data, i.e. Terabytes, Exabytes, Zetabytes etc. Its volume outstrips the tools to store it or even process it. This data is non-transactional and can be user generated or machine generated.
BIG DATA ARCHITECTURE
Big data architecture, a foundation for big data analytics, is a result of the interaction of big data application tools. These tools or database technology are put together to achieve high performance, high fault tolerance and scalability. It is dependent upon tools that an organization already has in place and also on the data environment an organization has.
A big data architecture is designed to handle the ingestion, processing and analysis of data that is too large or complex for simple traditional database systems. Big data Solutions usually involve batch processing of big data sources (at rest), real-time processing of big data (in motion), interactive exploration of big data and predictive analytics and machine learning.
A lot of big data architectures include some or all of the following components;
°Data source: Can be one or more, e.g static data store(relational databases), static files produced by applications (web server log files) and real-time data source(IoT devices)
°Data storage: Batch processing operations’ data is usually stored in a distributed file store that is able to hold high volumes of data files in different formats known as a data lake. E.g the Azure data lake store from Microsoft.
°Batch processing: A big data solution must process data using long running jobs to filter, aggregate and prepare data for analysis. Since the data setse are very large, thos involves reading source files, processing them and writing output to new files.
°Real-time message ingestion : the architecture must include a way to capture and store real-time messages for stream processing only if the solution involves real – time sources.
°Analytical Data Store: Solution should prepare data for analysis and serve the processed data in a structured format that can be queried with the use of analytical tools. A Kimball-style realational data warehouse can be used to serve these queries as the data store.
° Orchestration: Orchestration technology like Azure Data/Apache Oozie and scoop, can be used for big data for solutions that range from repeated data processing operations to loading the processed data into an analytical data store and eventually pushing the results directly into a report/dashboard.
EXAMPLES OF BIG DATA ARCHITECTURE
°Internet of Things(IoT) architecture
BIG DATA TECHNOLOGIES
Big data technologies are the means through which drawbacks in data analytics, visualization and storage are tackled. Because of the problems brought forth by big data’s volume, variety and velocity, it prompts for new technology solutions. The most prominent and widely used big data technology is the Hadoop open source project which was invented by Apache. This open source library was created with the focus placed on scalable, reliable, distributed and flexible computing systems that can handle this big data. Hadoop is made up of two components that work hand in hand.
First up is the Hadoop Distributed File System (HDFS) which gives way to high-bandwidth that is necessary for big data computing.
The second component that makes up Hadoop is a data processing structure or platform known as MapReduce. It is important as it distributes huge data sets from search engines (e.g. google search technology) across many servers which will in turn process the overall data set it receives and creates a summary before more traditional analysis tools are used. The distribution and summary creation of the large data sets is what is presumed to be the “map” and “reduce “respectively.
Hadoop technology and various big data tools have evolved to solve the challenges faced in the big data environment. These big data tools can be classified into categories as follows;
Data Storage and Management
Examples include NoSQL MongoDB, CouchDB, Cassandra, HBase, Neo4J, Talend, Apache Hadoop, Apache Zoo Keeper etc
Examples include MS Excel, Open Refine etc.
A process of discovery insights in a database. Examples include Rapid Miner, Tera Data etc.
A collection of concepts that enable efficient and rapid processing of data sets with a focus on reliability, flexibility, agility and performance. Because it is called ‘NoSQL’, which is a short notation for “non-SQL”, it does not mean that it employs the use of a language other than SQL. NoSQL utilises SQL as well as other query languages.