PyData Global 2025

Text Mining Orkut’s Community Data with Python: Cultural Memory, Platform Neglect, and Digital Amnesia
2025-12-10 , General Track

Orkut was once the emotional and cultural core of Brazil’s internet. Its scraps, testimonials, and communities gave users a way to publicly shape identity, build relationships, and engage with everything from music and religion to politics and humor. When Google shut it down in 2014, most of its data was deleted. What remains today is fragmented and buried in the Wayback Machine.

In this talk, I use Python to recover and analyze limited traces of Orkut’s digital legacy. I scraped thousands of community names from archived HTML using requests and BeautifulSoup, processed them with multilingual sentence embeddings from sentence-transformers, and applied scikit-learn and BERTopic to cluster the data, surface major social themes, and quantify them. These techniques reveal how users created meaning, formed subcultures, and expressed identity through online interactions.

Alongside the technical walkthrough, I draw on Cory Doctorow’s concept of enshittification, defined as the slow decline of platforms as they shift from serving users to exploiting them. Orkut is a case of enshittification by neglect: its shutdown led not just to the death of a platform, but to the erasure of a generation’s digital memory. According to Google's farewell announcement, over its 10 years of existence, Orkut hosted 51 million communities, 120 million discussion topics, and more than 1 billion interactions; most of which were permanently deleted.

This talk is for Python users interested not only in working with social media text data but also in uncovering the cultural narratives embedded within it. It invites the audience to see datasets as more than technical artifacts, viewing them instead as living records of online social life.


This talk explores how Python can be used to recover and analyze digital traces from a platform that once defined Brazil’s online culture. Orkut, active from 2004 to 2014, hosted millions of communities where users expressed identity, humor, politics, and emotion in public and often poetic ways. When the platform was shut down, nearly all of this user-generated data was deleted. Today, only fragmented pieces remain, preserved in the Wayback Machine.

I present a data analysis project that extracts and categorizes Orkut community names using open-source Python tools. I use requests and BeautifulSoup to scrape data from archived HTML snapshots. I then apply multilingual sentence embeddings from the sentence-transformers library to generate vector representations of the text, followed by clustering techniques using scikit-learn and BERTopic to uncover and quantify recurring social themes.

This technical walkthrough is grounded in a sociological lens. I draw on Cory Doctorow’s concept of enshittification, which describes how platforms degrade as they prioritize value extraction over user experience. Orkut's case illustrates how platform neglect can result not only in product death but also in large-scale cultural erasure. By treating community names as social artifacts, I show how data science can help recover forgotten histories and highlight overlooked communities at the intersection of digital humanities, memorialization, and cultural heritage.

Attendees will gain practical skills in web scraping, multilingual NLP, and unsupervised clustering. The talk also raises broader questions about data loss, platform decay, and the ethical role of data scientists, software engineers, and tech workers in preserving digital memory.

No advanced data science, scraping, text mining, or NLP knowledge is required. The talk is best suited for data scientists and Python developers interested in working with real-world social data and approaching datasets with both technical rigor and cultural sensitivity. Regardless of background, this talk is accessible to anyone interested in data science, NLP, and text mining.

Time Breakdown (30 min)
| Time | Section |
| --------- | ------------------------------------------------------------------------------------- |
| 0–4 min | Introduction to Orkut and its cultural role in Brazil and in the Global South |
| 4–7 min | Platform shutdown, data loss, digital memory and neglect |
| 7–10 min | Project overview: goals, ethical framing, and data source (Wayback) |
| 10–15 min | Scraping with requests and BeautifulSoup from archived HTML |
| 15–20 min | Processing: multilingual embeddings with sentence-transformers |
| 20–23 min | Clustering and theme discovery using scikit-learn and BERTopic |
| 23–26 min | Insights: social themes, quantification, and what topic categories mattered to users |
| 26–29 min | Reflection: enshittification, data loss, and cultural preservation |
| 29–30 min | Final remarks and invitation to rethink data as memory + Q\&A |

Additional remarks:
1) A GitHub repository containing the scraping scripts, archived HTML files, datasets, and analysis will be shared with attendees.
2) This project was inspired by both personal nostalgia and frustration over the loss of access to my Orkut's profile, photos, testimonials, and communities.
3) Besides its overwhelming popularity in Brazil, Orkut also had a strong foothold in other countries across the Global South, such as India and China, reflecting its broader appeal beyond the English-speaking tech centers, typically prioritized in platform histories. This context would makes the talk proposal herein outlined even more interesting and compelling for a PyData Global audience.

CountryTraffic on Mar 31, 2004Traffic on Sep 30, 2014
Brazil5.16%55.5%
United States51.36%3.3%
India18.4%
China6.4%
Japan7.74%2.7%
Netherlands4.10%
United Kingdom3.72%
Other27.92%15.7%
Reference: https://web.archive.org/web/20140109153358/http://www.alexa.com/siteinfo/orkut.com.br

Prior Knowledge Expected:

No

Rodrigo Silva Ferreira is a QA Engineer at Posit, where he contributes to the quality and usability of open-source tools that empower data scientists working in R and Python. He focuses on both manual and automated testing strategies to ensure reliability, performance, and an excellent user experience.

Rodrigo holds a BSc. in Chemistry with minors in Applied Math and Arabic from NYU Abu Dhabi and a MSc. in Analytical Chemistry from the University of Pittsburgh. Multilingual and globally minded, he enjoys working at the intersection of data, science, and technology — especially when it means building tools that help people better understand and navigate the world through its increasingly complex data.