Abstract
As data marketplaces are expected to become a prominent theme in the digital economy, whereby data assets are being generated today more than ever, and their potential to transform how value can be unlocked, the automation of metadata description generation will become a necessity for data asset discoverability, especially as data volumes explode and manual documentation becomes unsustainable. Therefore, semi-autonomous means are required to bring down the barrier to entry for data providers. As an extension of the Data Space concept, marketplaces advertise their assets or products through " offerings " , that enable the discoverability of an asset or bundle of assets, published by data providers to a catalogue, which data consumers in turn can use for querying. Recent advances in large language models (LLMs) and constrained decoding techniques enable schema-compliant, semi-automated metadata generation, reducing manual overhead and improving discoverability. We propose a schema-aware, edge-optimized LLM pipeline for generating structured descriptions for data asset offerings in the SEDIMARK marketplace, with evaluation on realistic information models.