4.1 - Ethical and Social Issues Around Data Collection

mg8mer

Introduction

Welcome to the FiveHive article for topic 4.1 of AP CSA!

In this article, we will cover the learning objective of 4.1 as shown in the AP® Computer Science A Course and Exam Description 2025:

  • 4.1.A Explain the risks to privacy from collecting and storing personal data on computer systems.
  • 4.1.B Explain the importance of recognizing data quality and potential issues when using a data set.
  • 4.1.C Identify an appropriate data set to use in order to solve a problem or answer a specific question.

This unit covers A LOT of content (17 topics)! But the unit title is accurate: we are mostly learning about collecting data in Java; specifically, we will learn about creating Arrays, ArrayLists, and 2D Arrays; accessing, traversing, and sorting such arrays; recursion; and more!

But of course, when we store data, there will be ethical implications in terms of collection and use. That’s why we’re overcoming that hurdle first before diving into actual concepts!

In particular, we’re covering the following: privacy risks from storing personal data, data quality issues including algorithmic bias, and selecting appropriate data sets depending on our goal or task.

Without further ado, let’s jump right in! 

Privacy Risks in Data Collection

You may be thinking: why are there privacy risks when storing data? Isn’t data usually impersonal? That’s not always the case! Many times you’ll find yourself needing to keep track of others’ personal information, including: names, addresses, phone numbers, email addresses, SSNs, IP addresses, location data, and browsing history. This information can easily be used to identify and profile individuals, leading to consequences such as identity theft, fraud, and more.

That is why your role as a developer is to always safeguard the personal privacy of users.

While not necessarily included in the CED, these means of protecting privacy should always been on your mind when programing:

  • Only collect data when necessary (i.e, don’t ask for SSNs when your app is a simple game).
  • Be transparent about what data you’re collecting and what you’re using it for.
  • ALWAYS MAKE SURE THE USER CONSENTS TO DATA COLLECTION.
  • Use tools such as encryption to hide sensitive information like passwords.
  • Ensure only people with specific credentials can access data.
  • Regularly clean out databases when data is not being used.

Algorithmic Bias and Data Quality

Algorithmic Bias

Beyond privacy, we must consider data quality and how it affects our programs. Poor data quality leads to incorrect results, and biased data leads to unfair outcomes. This is known as algorithmic bias, which is when a program consistently faces errors and malfunctions that unfairly benefit a certain group of users. 

Below are several examples of how poor data can lead to bias and error:

  • A hiring algorithm trained on historical hiring data might disadvantage women if the company previously hired mostly men, even if that pattern was discriminatory.
  • Facial recognition systems trained primarily on lighter-skinned individuals often have much higher error rates for people with darker skin.
  • Credit scoring systems relying heavily on ZIP codes may discriminate against people from lower-income neighborhoods, even if they're personally creditworthy.

Yeah… bad data isn’t great, and another one of our responsibilities as programmers is to ensure that our programs use data that is representative of our user-base and works well for various demographics. Otherwise, when it comes time to extrapolate based on our data, bias will inevitably follow suit.

Incomplete and Inaccurate Data

While some datasets are somewhat skewed and result in harm to specific groups, sometimes data sets can lead to bad outcomes for everybody! That happens when data sets contain outright incorrect or incomplete data, preventing software from achieving its intended functionality.

Picture this: Let’s say a class of 30 students took an AP CSA Unit 4 test, and that 6-7 students were absent that day. The next day, the instructor grades all the exams, and calculates the class average for documentation purposes… but wait, what about the absent students? And that’s the problem, our data is incomplete, so the true class average may be misconstrued or misrepresented!

But worse yet, the instructor was sleepy while grading some tests, so some grades were a few points off. Now our data is also fabricated!

Moral of the story: as programmers, we also have to ensure that the data we use is accurate and doesn’t misrepresent the bigger picture, if you will. This can be done through input validation, excluding missing data, and regularly cleaning data to ensure accuracy.

Selecting Appropriate Data Sets

Not all data sets are suitable for every purpose. Contents of a data set might be related to a specific question or topic and might not be appropriate to give correct answers or extrapolate information for a different question or topic. 

A data set must be relevant to the question you're trying to answer. Using data collected for one purpose to answer a different question can lead to misleading conclusions. For example, let’s say you want to understand how students perform in AP Computer Science A. Which data set is appropriate?

  • AP CS A exam scores from recent years
  • SAT Math scores (different subject)
  • College CS grades 

Obviously, we want the dataset with AP CS A exam scores from recent years, as that is 100% directly applicable to our purpose, while the others are irrelevant.

That’s it for content. Now let’s practice!

Practice