Protecting Vulnerable Voices: Synthetic Dataset Generation for Self-Disclosure Detection

Shalini Jangra; Suparna De; Nishanth Sastry; Saeed Fadaei

doi:10.1007/978-3-032-13821-7_1

Back

Protecting Vulnerable Voices: Synthetic Dataset Generation for Self-Disclosure Detection

Conference proceeding

Peer reviewed

Protecting Vulnerable Voices: Synthetic Dataset Generation for Self-Disclosure Detection

Shalini Jangra, Suparna De, Nishanth Sastry and Saeed Fadaei

Social Networks Analysis and Mining: 17th International Conference, ASONAM 2025, Vol.16323, pp.3-18

Lecture Notes in Computer Science, 16323

Social Networks Analysis and Mining. ASONAM 2025 (Niagara Falls, Canada, 25/08/2025–28/08/2025)

03/02/2026

DOI: https://doi.org/10.1007/978-3-032-13821-7_1

Abstract

Personal Information Identifiers

Synthetic data

Vulnera- ble Populations

Privacy Leaks

Large Language Models

Social platforms such as Reddit have a network of communities of shared interests, with a prevalence of posts and comments from which one can infer users' Personal Information Identifiers (PIIs). While such self-disclosures can lead to rewarding social interactions, they pose privacy risks and the threat of online harms. Research into the identification and retrieval of such risky self-disclosures of PIIs is hampered by the lack of open-source labeled datasets. Important hindrances to sharing high-quality labelled data include high annotation costs and privacy risks associated with the release of datasets containing self-disclosive text, especially when users include vulnerable populations. To foster reproducible research into PII-revealing text detection, we develop a novel methodology to create synthetic equivalents of PII-revealing data that can be safely shared. Our contributions include creating a tax-onomy of 19 PII-revealing categories for vulnerable populations and the creation and release of a synthetic PII-labeled multi-text span dataset generated from 3 text generation Large Language Models (LLMs), Llama2-7B, Llama3-8B, and zephyr-7b-beta, with sequential instruction prompting to resemble the original Reddit posts. The utility of our methodology to generate this synthetic dataset is evaluated with three metrics: First, we require reproducibility equivalence, i.e., results from training a model on the synthetic data should be comparable to those obtained by training the same models on the original posts. Second, we require that the synthetic data be unlinkable to the original users, through common mechanisms such as Google Search. Third, we wish to ensure that the synthetic data be indistinguishable from the original, i.e., trained humans should not be able to tell them apart. We release our dataset and code at https://netsys.surrey.ac.uk/datasets/synthetic-self-disclosure/ to foster reproducible research into PII privacy risks in online social media.

Files and links (2)

pdf

ASONAM_2025_-_CRV_paper_126424.28 kBDownload View

Author's Accepted Manuscript Embargo until publication date CC BY V4.0

url

https://asonam.cpsc.ucalgary.ca/2025/View

Event Website Conference website

Metrics

200 File views/ downloads

2 Record Views

Details

Title: Protecting Vulnerable Voices: Synthetic Dataset Generation for Self-Disclosure Detection
Creators: Shalini Jangra (Corresponding Author) - University of Surrey, School of Computer Science & Electronic Engineering
Suparna De (Author) - University of Surrey, School of Computer Science & Electronic Engineering
Nishanth Sastry (Author) - University of Surrey, School of Computer Science & Electronic Engineering
Saeed Fadaei (Author) - University of Surrey, School of Computer Science & Electronic Engineering
Publication Details: Social Networks Analysis and Mining: 17th International Conference, ASONAM 2025, Vol.16323, pp.3-18
Conference: Social Networks Analysis and Mining. ASONAM 2025 (Niagara Falls, Canada, 25/08/2025–28/08/2025)
Series: Lecture Notes in Computer Science; 16323
Publisher: Springer
Number of pages: 16
Publication Date: 03/02/2026
Grants: AP4L: Adaptive PETs to Protect & emPower People during Life Transitions, EP/W032473/1, Engineering and Physical Sciences Research Council (United Kingdom, Swindon) - EPSRC
Identifiers: 991095228902346; WOS:001752227300001
Copyright: © 2026 The Author(s), under exclusive license to Springer Nature Switzerland AG. For the purpose of open access, the authors have applied a Creative Commons Attribution (CC BY) licence to any Author Accepted Manuscript version arising.
Academic Unit: School of Computer Science & Electronic Engineering
Language: English
Resource Type: Conference proceeding

Protecting Vulnerable Voices: Synthetic Dataset Generation for Self-Disclosure Detection

Abstract

Files and links (2)

Metrics

Details

Usage Policy