Unit 6: Data on the Internet
The growth of the internet has led to an increase in e-business and e-commerce. An E-business is any organization that conducts business over the internet. E-commerce is the transmission of funds or money over an electronic network, primarily the internet. Both e-business and e-commerce may occur as business-to-business (B2B), business-to-consumer (B2C), consumer-to-consumer (C2C) and consumer-to-business (C2B) The internet gives us instant access to millions of IP addresses. It digitally connects us to numerous networks with the click of a key or touch of a screen. Advancements increased the use of the internet for business, data is easily collected and used for business growth. Websites collect and store vast amounts of data on each consumer. Organizations determine what is relevant and irrelevant to the consumer. Data abstraction is a process that delivers only necessary information while concealing background details. So far, you have learned that database systems (DMBS) are made of complex data structures. To improve user experience and ease user interaction on the internet, developers hide internal irrelevant details from the user. This is the definition of data abstraction. This data is used to conduct marketing and increase growth for B2B and B2C sales. This unit will cover data over the internet and growth, which has been immense in the past few years, and it is growing faster than ever. Also, the unit will review data integration and information retrieval, such as structured queries over the web.
Completing this unit should take you approximately 3 hours.
Upon successful completion of this unit, you will be able to:
- describe the importance of APIs and web crawlers, that allow businesses to operate data management systems;
- identify the different constructs of the Internet, such as web-crawling and Web-enabled databases, and how they work together; and
- assess how clients are used to execute remote applications and access data.
6.1: Web Data
Websites are probably the most widely used and primary source of "big data". Organizations across every industry use technology to store, collect, and integrate consumer data sourced from websites in database management systems (DBMS). A relational database management system (RDBMS) is a DBMS designed especially for relational data. This makes it easy to store web data in a structured format using rows and columns.
Web analytics provide organizations with consumer or visitor data. This data is used to optimize content based on user interest. Think back to what you learned about SQL in Unit 5. Remember the importance of using this programming language? Therefore, since the internet is the most common source of big data, then it is important that you continue to develop SQL programming skills to leverage web data.
Read this article, paying attention to the various methods to collect data from the internet.
Watch this video on how APIs are used to gather information from a wide variety of web sources. Consider how APIs can also be used to help e-Business and e-Commerce gather data from a wide variety of web sources. Then answer this question. How are APIs used to collect and store consumer data? What are some of the ethical issues associated with this collection and storage of consumer data?
Now, watch this tutorial on the Python Web Scraper. You do not have to download Python. Watch the tutorial and follow along with the presenter. Pay attention to how Python API was used to request online user information, cookies, and JavaScript Object Notation (JSON).
6.1.1: Approaches to Web Data Abstraction
Advancements in technology have sparked growth in e-business and e-commerce. As a result, web data has become the primary source of "big data". Next, you will learn different data abstraction methodologies for organizational web data.
There are various types of approaches for abstracting web data. Web data extraction is also known as web harvesting, web scraping, and screen scraping. This is commonly done through an application program interface (API).
An API is a software intermediary that allows communication between two or more applications. Remember how APIs were used to send a request for data to internet users? Let's learn a few more approaches to web data abstraction.
Read this article, paying attention to the data abstraction section. What did you learn about the power design methodology called data abstraction?
Read this article. Be sure you can explain the methods (approach) for extracting data based on usability.
6.1.2: Applications of Data Abstraction
You learned that abstraction is the process of selecting relevant data from databases. You can model the abstraction of the same data to use with different applications. Once you model using abstraction, the same data can be used in different applications. For example, JAVA abstraction is done using abstract classes and interfaces.
Read this article, paying attention to each application used in data abstraction. Be able to explain the purpose of each application and summarize holistic configuration.
6.1.3: Web Crawling
A web crawler is an automated script or program that browses the internet in a methodical and automated way. Web crawlers are also known as web spiders or web robots. Many internet websites, in search engines, use crawling to provide up-to-date data.
Web crawlers copy web pages for processing later through a search engine. This allows users to find web pages quickly when using a search engine. Sometimes, web crawlers are used to extract other forms of information or data from websites.
Watch this video on web crawling. Pay attention to the different components of web crawlers' policy, security, identifier, and general purpose.
6.1.4: Legal Issues
You learned about using web crawling to browse the internet and extract data from other sites. This is a result of governments passing open data laws. However, there are a few legal concerns associated with web crawling.
Web crawling is legal when you do it for your own purpose. This falls under the fair use doctrine. However, problems and difficulties start if you want to use scraped data for other reasons, particularly commercial purposes. On September 9, 2019, the U.S. 9th circuit court of Appeals ruled that web scraping public websites does not violate the Computer Fraud and Abuse Act (CFAA). Some website owners consider scraping theft because they believe this information is "their own".
During this unit, you learned about web data collected via e-business and e-commerce operations. Because of the advancement in technology, websites account for the majority of "big data". Now you understand the approach, methodology, and applications associated with data abstraction. Remember, web crawling is data shared with outside agencies collected from websites. Because of ever-changing laws and regulations, you should stay knowledgeable about country and state open data laws that govern web crawling. Unit 7 will deliver more detail about data sharing between users and organizations.
Read this article and pay attention to the Evidence Act of 2019 and the OPEN Government Act of 2019. Compare these acts with the Health Insurance Portability and Accountability Act (HIPAA), Personally Identifiable Information (PII), Family Education Rights and Privacy Act (FERPA), and the National Institute of Health (NIH). What happens when restrictions are applied to data that is already available to the public?
Study Guide: Unit 6
We recommend reviewing this Study Guide before taking the Unit 6 Assessment.
Unit 6 Assessment
- Receive a grade
Take this assessment to see how well you understood this unit.
- This assessment does not count towards your grade. It is just for practice!
- You will see the correct answers when you submit your answers. Use this to help you study for the final exam!
- You can take this assessment as many times as you want, whenever you want.