BUS611 Study Guide

Unit 4: Big Data Processing and Cloud Computing

4a. Describe big data and cloud computing

  • What are some of the issues associated with cloud data storage?
  • What is meant by big data, and how is it related to cloud storage?

In many cases, the data we are storing in our DBMS is far too large to be stored in a single location. There are also significant risks associated with storing all of our strategic data in a single location. For this reason, we make use of cloud data storage systems.
 
A cloud system consists of various types of IT hardware, software, and physical infrastructure that allow cloud service providers to deliver various services. The most common services are 1- SaaS (software as a service), 2- PaaS (platform as a service), and 3 - IaaS (infrastructure as a service). These services would be delivered over a network, typically the public internet. Cloud systems must be highly flexible and allow for various technologies and systems of all vintages and standards. Cloud systems and the vendors and service providers who support them must be able to integrate many different types of technology and systems of different vintages and vendors. New technology and systems are constantly being developed, and cloud systems must allow these new technologies to be integrated into the older technologies already in use.
 
Cloud services vendors must be able to provide non-proprietary network management solutions to allow for the wide range of technologies that must be integrated into the cloud system.
 
This figure is a conceptualization of a cloud storage system.



To review, see Introduction to Big Data.
 

4b. Identify the advantages and disadvantages of cloud computing

  • What are some of the challenges of managing cloud services?

A distributed database represents multiple interconnected databases spread across several sites connected by a network. Since the databases are all connected, they appear as a single database to the users. The distributed sites are either managed by the host organization or contracted to a cloud services provider. One of the major advantages of the distributed model is the ability to run the database on a variety of different servers and equipment.
 
Cloud-based organizations use cloud computing to deliver computing services. This includes servers, databases, networking, software, analytics, and intelligence over the cloud (internet). The cloud offers faster innovation, economic scalability, and flexibility of organizational resources. Organizations rely on cloud computing to perform virtually every function, including analytics. Cloud services providers take great care in siting and designing data centers for optimal performance. A data center is simply a building, a dedicated space within a building, or a group of buildings used to house computer systems and associated components, such as telecommunications and physical storage systems.
 
The primary reasons for using such an arrangement are efficiency, redundancy, and cost savings. By taking advantage of the scale and scope of a cloud services provider, we can get just what we need at a price that matches our needs. We also gain the advantages of extensive redundancy, reliability, and security that such a service provides.
 
Like anything, cloud services are not without their negative points, though. Disaster recovery is something that any manager of a system must consider. In cloud computing systems, disaster recovery can be more of a challenge than in a single-site system. This is because Cloud Service Providers (CSPs) must provide services to their customers at all times. Because they operate many data centers, this can make disaster recovery more complicated. Disasters can lead to expensive service disruptions. Two disaster recovery models can be used to prevent failure: Traditional and cloud-based service models.
 
Another strategy to address disaster recovery is the use of fault-tolerant architectures. A Fault-tolerant system can continue to perform its function and operate even in the presence of failures in some of its parts. The concept can apply to a single computer system, a cluster of data centers managed by a cloud services provider, a link in a network, or any other component that might fail. For example, multiple instances increase redundancy, resulting in a more fault-tolerant configuration.
 
To review, see Cloud Computing and Business Expectations.
 

4c. Describe the challenges involved with managing big data

  • What are some of the characteristics of big data?
  • What are some of the challenges in managing and processing big data?

Big data typically describes data sets so large or complex that traditional data-processing techniques often prove inadequate.
 
The structure of big data is described by:

  • Volume: amount measured in gigabytes or terabytes
  • Velocity: one-time snapshot frequency streams
  • Variety: structured, numeric, alpha, unstructured, text, sound, image or video, genomics
  • Veracity: validation, noise level, deception, detection, relevance, ranking
  • Value: the usefulness of the data in supporting decisions that add economic value

 
This figure illustrates these characteristics.

This figure illustrates these characteristics.

The administration of big data presents many challenges. These challenges revolve around the sheer size of the database, the networks required to move data, the disk resources required to store the data, and the computer processing power required to query or process the data. For these reasons, distributed networks and solutions are often the only options. Even supercomputers are not large enough to do the job on their own, and the amount of disk space required necessitates that resources be spread geographically among several data centers. This is why organizations that use big data resources almost always engage massively scaled cloud service providers.
 
It is also worth noting that the type of data that is increasing most rapidly is unstructured data. This data type is characterized by "human information" such as high-definition videos, movies, photos, scientific simulations, financial transactions, phone records, genomic datasets, seismic images, geospatial maps, e-mail, tweets, Facebook data, call-center conversations, mobile phone calls, website clicks, documents, sensor data, telemetry, medical records and images, climatology and weather records, log files, and text. This type of data is not only voluminous, but the diversity of different formats makes it very difficult to manage and store in traditional relational database structures.
 
To review, see Big Data Opportunities and Challenges.
 

4d. Analyze the relationships between cloud computing and big data

  • In what ways are the administration of big data and cloud services related?
  • What role does the network play in provisioning cloud services for big data?

Cloud computing and big data are logically interrelated. Big data, typically stored in a data warehouse, may have come from internal and external sources in the organization. Because of the size and diversity of big data, it is necessary to apply more comprehensive processes to manage it. Cloud systems represent a very attractive option for doing this. The management and administration of these cloud database systems is the database administrator's responsibility.
 
One issue that arises in the development of a cloud services relationship to support big data is the issue of provisioning. Provisioning involves the specifications and range of services the user will require from the cloud services provider to meet the organization's needs. Integrating big data administration with the cloud infrastructure must be a joint effort between the provider of those services and the contract user. It should also be noted that the requirements will change over time, and therefore provision needs to be made for the evolution of the services as needs change.
 
Another dimension of provisioning for cloud services involves the network. Local corporate data centers tend to be located where the company is located and thus might not reside in locations with the optimum characteristics for data centers; therefore, the data centers provided by the cloud services provider might be at some distance. Thus, network issues need to be addressed.
 
It is also worth noting that large cloud providers that implement geographically dispersed sites worldwide can typically achieve reliability rates sufficient for large-scale operations. These same levels of network reliability might not be achievable by the organization on its own.
 
To review, see Introduction to Big Data.
 

4e. Apply virtualization techniques to provide resources for cloud computing solutions

  • What is virtualization, and how is it accomplished?
  • What are some of the challenges of virtualization?

Network virtualization combines various hardware and software components that deliver network services into a single entity, managed by software, called a virtual network. Network virtualization could include the platform itself or another network resource. We often achieve this virtualization for large database systems by contracting with cloud service providers.
 
In a virtualized network with local autonomy, the individual local nodes are responsible for the administrative functions related to those nodes. This would include local node security, backup and recovery, and concurrency control.
 
Because virtualized systems are not confined geographically, they will generally result in resources and data being housed in data centers in a number of different countries. This results in an inconsistent set of laws and regulations that govern the storage and administration of the data.
 
To review, see Compliance and Risk in a Cloud Environment.
 

Unit 4 Vocabulary 

This vocabulary list includes the terms that you will need to know to successfully complete the final exam.

  • data center
  • distributed database
  • fault-tolerant
  • value
  • variety
  • velocity
  • veracity
  • virtualization
  • volume