Extracting Age-Related Stereotypes from Social Media Texts

Kathleen C. Fraser, Svetlana Kiritchenko, Isar Nejadgholi
National Research Council Canada, Ottawa, Canada

In Proceedings of the 13th edition of the Language Resources and Evaluation Conference (LREC), June 2022

Age-related stereotypes are pervasive in our society, and yet have been under-studied in the NLP community. Here, we present a method for extracting age-related stereotypes from English language Twitter data, generating a corpus of 300,000 over-generalizations about four contemporary generations (baby boomers, generation X, millennials, and generation Z), as well as "old" and "young" people more generally. By employing word-association metrics, semi-supervised topic modelling, and density-based clustering, we uncover many common stereotypes as reported in the media and in the psychological literature, as well as some more novel findings. We also observe trends consistent with the existing literature, namely that definitions of "young" and "old" age appear to be context-dependent, stereotypes for different generations vary across different topics (e.g., work versus family life), and some age-based stereotypes are distinct from generational stereotypes. The method easily extends to other social group labels, and therefore can be used in future work to study stereotypes of different social categories. By better understanding how stereotypes are formed and spread, and by tracking emerging stereotypes, we hope to eventually develop mitigating measures against such biased statements.

[paper] [Query terms] [Topic anchor terms]


A. Curation Rationale

The corpus consists of over 300,000 English sentences, each stating an over-generalization about one of four contemporary generations (baby boomers, generation X, millennials, and generation Z), or about "old" or "young" people more generally. The corpus has been collected to study age-related stereotypes frequently expressed in social media. Studying stereotypes occurring in real social interactions can contribute to our understanding of how stereotypes are formed and spread, and their impact on target groups and inter-group relations. Identifying and monitoring the dynamics of currently pervasive group perceptions is a necessary first step before intervening with educational, counter-narrative, and other mitigating measures.

Stereotypes are over-generalizations about the characteristics of a group of people, such that an individual is assumed to have these characteristics simply based on their perceived membership in the group. Stereotypes can lead to prejudicial behaviour against members of a group, as well as psychological harm. Of particular concern is stereotyping on the basis of protected characteristics, such as race, sex, religion, or age. This corpus is related to age-based stereotyping and ageism. Ageism is widespread in North American society and can lead to age-based bias in the workplace, media representation, and the healthcare system. Furthermore, such stereotypes can become a self-fulfilling prophecy when they are internalized by people who self-identify as older adults, leading to isolation and health decline.

The corpus provides a means to analyze pervasive age-related stereotypes from naturally-occurring data in Twitter. It contains spontaneously expressed opinions of thousands of individuals, and, therefore, allows to detect less common or emerging stereotypes, without being limited by an a priori expectations of social stereotypes. We collected tweets mentioning six age-related groups: baby boomers, generation X, millennials, generation Z, older adults, and young people. The collection was performed using the Twitter API over a period of three months, from August 20, 2021 to November 20, 2021. As search queries, we selected terms frequently used to refer to the groups (e.g., boomers, gen xers, senior citizens) as well as some common misspellings (e.g., milennials). The full list of the query terms is available above.

A number of filtering steps were applied to the collected tweets to help reduce the amount of irrelevant texts generated by bots as well as ads, news headlines, and promotional campaigns written by organizations. (For full details, see the paper.) We further retrieved only sentences where the target group is a nominal subject of the main or a subordinate clause of the sentence. Sentences where a target group was described with qualifiers referring only to some members of the group (e.g., some, these, several, few) were excluded. Further, we discarded sentences where different target groups (e.g., boomers and millennials) were discussed together.

B. Language Variety

The data was collected via Twitter API with the language option set for English; therefore, any variety of English recognized by the Twitter language identification tool as English can be present.

C. Speaker Demographic

No direct speakers' demographic information is available.

According to Statista, Twitter users worldwide tend to be male, between the ages of 25 and 49. The United States of America has the most users. According to Pew Research Center, in the US, Twitter users are younger, more highly educated and have higher income than the general public. However, the present corpus was collected using specific query terms and restricted to English-language tweets. Thus, its user demographics might differ from the general Twitter demographics.

D. Annotator Demographic

There are no annotations.

E. Speech Situation

The corpus was collected between August 20, 2021 and November 20, 2021. The tweets mostly represent informal, spontaneous, asynchronous written language. The intended audience is friends and followers of the user or the general Twitter audience. Each tweet is limited to 280 characters.

F. Text Characteristics

The sentences represent spontaneous expressions of Twitter users about common traits and expected behaviors of members of an age-related group. The sentences state over-generalized views on the groups treating all members of the group as one entity. Various topics are covered, including, but not restricting to, family, work, politics, technology, and health. Only textual information is included in the corpus; images, videos, and URLs were removed.

G. Recording Quality N/A

H. Other N/A

I. Provenance Appendix N/A