Yes, Data Can Be Biased

If you had the opportunity to build a model or algorithm from the ground up, what is the first thing that comes to mind – methodology, outcomes, impact?  Many people would say you must begin with data. Data is a good place to start, but much like technology, data is not neutral and can affect the success and benefits of your model. This is why we hear time and again that equitable data collection is crucial. Everyone needs to be represented — and represented in a manner that takes into account existing problems as well as solving for new ones.

We are witnessing the negative impact of inequitable data collection in real time with the COVID-19 pandemic and recovery. Pandemic to Prosperity, pandemic-focused data that examines the most vulnerable communities, took a deep dive into the impact of COVID on southern states, for example. A key finding: “Reports out of Virginia and Alabama explain that data collection problems are actually hampering the states’ ability to receive more vaccinations from the CDC.” This inequity will carry through to vaccine distribution as state and federal systems have yet to seamlessly integrate. 

Building a model from the beginning is not an option for many of us. How, then, can we push for more equitable design? One way is through examining data collection. As more entities, public and private, use artificial intelligence (AI) to help solve problems, we all must take into account the data that is not being collected — and who it belongs to. We must look at how equitable data structures, policies, and procedures are crucial to fight systemic bias and inequities. Inequitable data feeding into imperfect tech systems is unacceptable and can lead to inequitable policies. Algorithmic bias can stem from a data bias, which, in turn, exacerbates existing systemic discrimination, particularly in housing, credit, employment, and education. Algorithms and AI “learn” from running an equation repeatedly, with different combinations of data often stemming from one original dataset. Therefore, if the dataset is lacking in representation or based in discriminatory practices, it does not matter how many times the data is run through a model, the outcomes will always be biased. Equitable datasets are crucial to training AI to ensure it makes better and more equitable decisions.

Thankfully, there are resources that can help. The Leadership Conference on Civil and Human Rights developed Civil Rights Principles for COVID-19 Vaccine Development and Distribution as a framework for this task. One of those principles is, “[r]obust data collection around vaccine development and distribution must be an instrumental part of our nation’s COVID-19 response…Now, as we enter the vaccination stage of the pandemic, data can be used for help tracking progress in the fight against COVID-19. As populations begin to be vaccinated, data on vaccinations should be analyzed to help us identify gaps and understand how to rectify them.” 

Data inaccuracies are already known; we know the communities that are underrepresented in recovery data. State governments and health officials can step in now to course correct. Collecting data that enhances the visibility of all communities is the first step. That way states like Virginia and Alabama can receive needed vaccine doses. Understanding why the data is not collected is critically important. With a change in data collection processes, state governments can feed the federal vaccine distribution systems more accurate data which will lead to more equitable vaccine distribution.

Equitable data collection is also crucial for a full and robust recovery not just from the pandemic, but from the economic consequences that it has wrought for so many people. While the COVID-19 pandemic has been disastrous for all, marginalized communities (especially Black, Indigenous, Asian, and Latino communities) bear the brunt of the pandemic’s economic consequences. This is why accurate data is crucial. With the right data the government and private sectors can work together, targeting resources to communities that are most disadvantaged by the COVID-19 pandemic. Equitable data collection is fundamental to ensuring that the COVID recovery process works for more people and specifically marginalized communities.

Things will only change when practitioners build the mantra of ‘data is not neutral, nor are the ways we use data’ into their work. After all, it’s only when we consider the impact of data during the collection process that we can help create more equitable systems — something we all must strive for on a daily basis.

Maria Filippelli is a Public Interest Technology Census Fellow at New America.

Bertram Lee is Counsel for Media and Tech at The Leadership Conference on Civil and Human Rights. In his role, he works to advance the interests of marginalized communities in technology and media policy.