Logo image
Protecting Vulnerable Voices: Synthetic Dataset Generation for Self-Disclosure Detection
Conference proceeding   Peer reviewed

Protecting Vulnerable Voices: Synthetic Dataset Generation for Self-Disclosure Detection

Shalini Jangra, Suparna De, Nishanth Sastry and Saeed Fadaei
Social Networks Analysis and Mining: 17th International Conference, ASONAM 2025, Vol.16323, pp.3-18
Lecture Notes in Computer Science, 16323
Social Networks Analysis and Mining. ASONAM 2025 (Niagara Falls, Canada, 25/08/2025–28/08/2025)
03/02/2026

Abstract

Personal Information Identifiers Synthetic data Vulnera- ble Populations Privacy Leaks Large Language Models
Social platforms such as Reddit have a network of communities of shared interests, with a prevalence of posts and comments from which one can infer users' Personal Information Identifiers (PIIs). While such self-disclosures can lead to rewarding social interactions, they pose privacy risks and the threat of online harms. Research into the identification and retrieval of such risky self-disclosures of PIIs is hampered by the lack of open-source labeled datasets. Important hindrances to sharing high-quality labelled data include high annotation costs and privacy risks associated with the release of datasets containing self-disclosive text, especially when users include vulnerable populations. To foster reproducible research into PII-revealing text detection, we develop a novel methodology to create synthetic equivalents of PII-revealing data that can be safely shared. Our contributions include creating a tax-onomy of 19 PII-revealing categories for vulnerable populations and the creation and release of a synthetic PII-labeled multi-text span dataset generated from 3 text generation Large Language Models (LLMs), Llama2-7B, Llama3-8B, and zephyr-7b-beta, with sequential instruction prompting to resemble the original Reddit posts. The utility of our methodology to generate this synthetic dataset is evaluated with three metrics: First, we require reproducibility equivalence, i.e., results from training a model on the synthetic data should be comparable to those obtained by training the same models on the original posts. Second, we require that the synthetic data be unlinkable to the original users, through common mechanisms such as Google Search. Third, we wish to ensure that the synthetic data be indistinguishable from the original, i.e., trained humans should not be able to tell them apart. We release our dataset and code at https://netsys.surrey.ac.uk/datasets/synthetic-self-disclosure/ to foster reproducible research into PII privacy risks in online social media.
pdf
ASONAM_2025_-_CRV_paper_126424.28 kBDownloadView
Author's Accepted Manuscript Embargo until publication date CC BY V4.0
url
https://asonam.cpsc.ucalgary.ca/2025/View
Event Website Conference website

Metrics

200 File views/ downloads
2 Record Views

Details

Logo image

Usage Policy