MapRDD : finer grained resilient distributed dataset for machine learning

Zhenyu Li; Stephen A Jarvis

doi:10.1145/3206333.3206335

Back

Conference proceeding

MapRDD : finer grained resilient distributed dataset for machine learning

Zhenyu Li and Stephen A Jarvis

2018

DOI: https://doi.org/10.1145/3206333.3206335

Abstract

QA76 Electronic computers. Computer science. Computer software

The Resilient Distributed Dataset (RDD) is the core memory abstraction behind the popular data-analytic framework Apache Spark. We present an extension to the Resilient Distributed Dataset for map transformations, that we call MapRDD, which takes advantage of the underlying relations between records in the parent and child datasets, in order to achieve random-access of individual records in a partition. The design is complemented by a new MemoryStore, which manages data sampling and data transfers asynchronously. We use the ImageNet dataset to demonstrate that: (I) The initial data loading phase is redundant and can be completely avoided; (II) Sampling on the CPU can be entirely overlapped with training on the GPU to achieve near full occupancy; (III) CPU processing cycles and memory usage can be reduced by more than 90%, allowing other applications to be run simultaneously; (IV) Constant training step time can be achieved, regardless of the size of the partition, for up to 1.3 million records in our experiments. We expect to obtain the same improvements in other RDD transformations via further research on finer-grained implicit & explicit dataset relations.

Metrics

1 Record Views

Details

Title: MapRDD : finer grained resilient distributed dataset for machine learning
Creators: Zhenyu Li - University of Warwick
Stephen A Jarvis - University of Warwick
Publisher: ACM
Publication Date: 2018
Identifiers: 991103783802346
Academic Unit: President & VC's Office (VC01)
Language: English
Resource Type: Conference proceeding

MapRDD : finer grained resilient distributed dataset for machine learning

Abstract

Metrics

Details

Usage Policy