Gender Gaps Correlate with Gender Bias in Social Media Word Embeddings

AbstractGender status, gender roles, and gender values vary widely across cultures. Anthropology has provided qualitative ac- counts of economic, cultural, and biological factors that impact social groups, and international organizations have gathered indices and surveys to help quantify gender inequalities in states. Concurrently, machine learning research has recently characterized pervasive gender biases in AI language models, rooting from biases in their textual training data. While these machine biases produce sub-optimal inferences, they may help us characterize and predict statistical gender gaps and gender values in the culture(s) that produced the training text, thereby helping us understand cultural context through big data. This paper presents an approach to (1) construct word embeddings (i.e., vector-based lexical semantics) from a region’s social media, (2) quantify gender bias in word embeddings, and (3) correlate biases with survey responses and statistical gender gaps in education, politics, economics, and health. We validate this approach using 2018 Twitter data spanning 143 countries and 51 U.S. territories, 23 international and 7 U.S. gender gap statistics, and seven international survey results from the World Value Survey. Integrating these heterogeneous data across cultures is an important step toward understanding (1) how biases in culture might manifest in machine learning models and (2) how to estimate gender inequality from big data.

Return to previous page