BUS611 Study Guide

Unit 6: Data on the Internet

6a. Describe the importance of emerging technologies that will carry businesses forward in a constantly evolving environment such as APIs and web crawlers, that allow businesses to operate internet-enabled data management systems.

  • How is data stored and accessed over the internet?
  • How does a web crawler extract information from websites?

The growth of the internet has led to an increase in e-business and e-commerce. An E-business is any organization that conducts business over the internet. E-commerce is transmitting funds or money over an electronic network, primarily the internet. Both e-business and e-commerce may occur as business-to-business (B2B), business-to-consumer (B2C), consumer-to-consumer (C2C) and consumer-to-business (C2B).
 
The internet gives us instant access to millions of IP addresses. An Internet Protocol address (IP address) is a numerical label such as 188.6.7.4 connected to a network that uses the Internet Protocol for communication. It digitally connects us to numerous networks with the click of a key or touch of a screen. Advancements increased internet use for business, and data is easily collected and used for business growth.
 
Websites collect and store vast amounts of data on each consumer. Organizations determine what is relevant and irrelevant to the consumer. Data abstraction is a process that delivers only necessary information while concealing background details. So far, you have learned that database systems (DMBS) are made of complex data structures. To improve user experience and ease user interaction on the internet, developers hide irrelevant internal details from the user. This is the definition of data abstraction. This data is used to conduct marketing and increase growth for B2B and B2C sales.
 
APIs are used to gather information from a wide variety of web sources. We can also use APIs to help e-Business and e-Commerce gather data from various web sources.
 
A web crawler is used to browse web pages for the content of interest and to copy web pages for offline viewing and indexing. This content is often also used to populate a data warehouse.
 
To review, see What is Web Crawler?.


6b. Identify the different constructs of the Internet, such as web-crawling and web-enabled databases, and how they work together

  • What is an API, and how is it used?
  • What is a web crawler, and how is it used?

There are various types of approaches for abstracting web data. Web data extraction is also known as web harvesting, web scraping, and screen scraping. This is commonly done through an Application Programming Interface (API), a set of definitions and protocols for building and integrating application software. APIs let a product or service communicate with other products and services without knowing how those other products have been implemented. This can simplify app development, saving time and money.
 
A web crawler is an automated script or program that browses the internet in a systematic and automated way. Web crawlers are also known as web spiders or web robots. Many internet websites, including search engines, use crawling to provide up-to-date data. Web crawlers copy web pages for processing later through a search engine. This allows users to find web pages quickly when using a search engine. Sometimes, web crawlers are used to extract other forms of information or data from websites.
 
This figure illustrates the basic conceptual design of a web crawler:


A commonly used language to format files for storing data on the internet is called Extensible Markup Language (XML). This language includes rules that allow for the encoding of documents, and the format is designed to be both readable by a human and for processing by a computer system.
 
To review, see Getting Data from the Web.


6c. Assess how clients are used to execute remote applications and access data

  • How are clients used to spread the computing load of large databases?
  • What are JSON and XML, and how are they used?

Clients are portions of programs that can be run remotely on different computers from the main program. Typically, the main program, a distributed relational database, would be run on a server. The client part of the program might be run on a smaller machine like a customer's laptop computer or even a mobile application.
 
Websites are probably the most widely used and primary "big data" source. Organizations across every industry use technology to store, collect, and integrate consumer data from websites in database management systems (DBMS). This makes it easy to store web data in a structured format using rows and columns. JSON and XML are two formats that can be used in this process. JSON is lighter than XML with fewer markup requirements, but XML is generally more forgiving of poor formatting.
 
To review, see Getting Data from the Web.


Unit 6 Vocabulary 

This vocabulary list includes the terms that you will need to know to successfully complete the final exam.

  • application programming interface (API)
  • client
  • data abstraction
  • e-business
  • e-commerce
  • IP Address
  • JSON
  • web crawler
  • XML