five

EMBEDDING MODELS: MEASURING RACISM IN LANGUAGE USING WORD EMBEDDINGS: METHODS AND APPLICATIONS IN SOUTH AFRICAN NEWS

收藏
Figshare2026-02-20 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/EMBEDDING_MODELS_MEASURING_RACISM_IN_LANGUAGE_USING_WORD_EMBEDDINGS_METHODS_AND_APPLICATIONS_IN_SOUTH_AFRICAN_NEWS/31376707
下载链接
链接失效反馈
官方服务:
资源简介:
Racism persists in language, but its contemporary forms are increasingly subtle, diffuse, and difficult to measure at scale, especially in contexts like post-apartheid South Africa, where explicit racist content may decline while inequalities persist. Word embeddings provide a means to measure racism in language. Yet most embedding-based studies focus on gender in Global North corpora, rely on ad hoc race dimensions, and rarely consider how normativity (e.g., White normativity) shapes the bipolar instrument commonly used to measure bias in embeddings. This research uses word embeddings to study racism in South African news and strengthens the methodological toolkit for constructing race dimensions in word embeddings.First, it uses Word2Vec embeddings to examine how Black and White categories are associated with socioeconomic stereotypes and health descriptors in South African news articles. Results suggested that discussions of poor socioeconomic conditions generally occur in the context of black and good socioeconomic conditions in white, and patterns correspond with human judgments. Health results were weaker, mixed, and corpus-dependent.Second, it investigates bias at the speaker level using news quotes. It constructs “speaker landscapes” that create a vector representation of speakers based on the language they use and measures the association of this vector to the centroid of topic keywords. The results show that White voices are often quoted in global and technical vaccination discourse, while Black voices are more peripheral and locally framed.Third, the thesis provides statistical metrics (PairDir and a PCA-based Axis Coherence Score) to assess anchor quality when constructing race dimensions. Name-based anchors, especially sub-Saharan African and European names, prove more stable and generalisable than American names, race terms, or geographical categories.Finally, the research investigates white normativity in race dimensions using the SSA/European categories and embedding models from the previous study. It shows that neutral and valence words tend to cluster closer to the White pole, demonstrating semantic normativity, and that bipolar difference-of-centroids axes can amplify one pole beyond what unipolar measures imply.Overall, the thesis introduces a computational approach for studying racism and race-based identity bias in South African news, shows evidence of those in new coverage, and provides statistical metrics to improve the construction of race dimensions. It also links theories of White normativity to the geometry of embedding spaces and advances the methodological tools for analysing racism in language using word embeddings.
创建时间:
2026-02-20
二维码
社区交流群
二维码
科研交流群
商业服务