博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
[Paper] LCS: An Efficient Data Eviction Strategy for Spark
阅读量:5128 次
发布时间:2019-06-13

本文共 3919 字,大约阅读时间需要 13 分钟。

Abstract

  • Classical strategies do not aware of recovery cost, which could cause system performance degradation.   -->  a cost aware eviction strategt can obviously reduces the total recovery cost.
  • A strategy named LCS(Least cost strategy) -->  gets the dependencies information between cache data via analyzing application, and calculates the recovery cost during running. By predicting how many times cache data will be reused and using it to weight the recovery cost, LCS always evicts the data which lead to minimum recovery cost in future.

Introduction

  • Current eviction strategies:
    • FIFO: focuses on the create time.
    • LRU: focuses on access history for better hit ratio.
  • Many eviction algorithms take access history and costs of cache items into consideration. But for spark, the execution logic of upcoming phase is known, access history has no help to eviction strategy.
  • LCS has three steps:
    1. Gets the dependencies of RDD by analyzing application, and predicts how many times cache partitions will be reused.
    2. Collects information during partition creation, and predicts the recovery cost.
    3. Maintains the eviction order using above two information, and evicts the partition that incurs the least cost when memory is not sufficient.

Design and Implementation

Overall Architecture

  • Three necessary steps:
    1. Analyzer in driver node analyzes the application by  the DAG strcutures provided by DAGScheduler.
    2. Collector in each executor node records information about each cache partition during its creation.
    3. Eviction Decision provides an efficient eviction strategy to evict the optimal cache partition set when remaining memory space for cache storage is not efficient, and decide whether remove it from MemoryStore or serialize it to DiskStore.

 Analyzer

  • DAG start points: 
    • DFS, files on it can be read from local or remote disk directly;
    • ShuffledRDD, which can be generated by fetching remote shuffle data.

  This indicates the longest running path of task: when all the cache RDDs are missing, task needs to run from the starting points. (Only needs to run part of the path from cache RDD by referring dependencies between RDDs).

  • The aim of Analyzer is classifying cache RDDs and analyzing the dependency information between them before each stage runs.
  • Analyzer only runs in driver node and will transfer result to executors when driver schedules tasks to them. 
  • By pre-registering RDD that needs to be unpresist, and checking whether it is used in each stage, we put it to the RemovableRDDs list of the last stage to use it. The removable partition can be evicted directly, and will not waste the memory.
  • Cache RDDs of a stage will be classified to:
    • current running cache RDDs (targetCacheRDDs)
    • RDDs participate in current stage (relatedCacheRDDs)
    • other cache RDDs

Colletor

  • collector will collect information about each cache partition during task running.
  • Information that needs to be observed:
    • Create cost: Time spent, called Ccreate.
    • Eviction cost: Time costs when evicting a partition from memory, called Ceviction. (If partition is serialized to disk, the eviction cost is the time spent on serializing and writing to disk, denoted as Cser. If removed directly, the eviction cost is 0.)
    • Recovery cost: Time costs when partition data are not found in memory, named Crecovery. If partition is serialized to disk, the recovery cost is the time spent in reading from disk and deserilization, denoted as Cdeser. Otherwise, recomputed by lineage information, represented as Crecompute.

Eviction Decision

  • Through using information provided by Colletor, each cache partition has a WCPM value:
    WCPM = min (CPM * reus, SPM + DPM * reus).
    CPMrenew = (CPMancestor * sizeancestor + CPM * size) / size
    SPM refers to serialization, DPM refers to deserialization, resu refers to reusability

Evaluation

Evaluation Environment and Method

  • PR, CC, KMeans algorithms...
  • LCS compare to LRU & FIFO

 

转载于:https://www.cnblogs.com/wttttt/p/7521916.html

你可能感兴趣的文章
aboutMe
查看>>
【Debug】IAR在线调试时报错,Warning: Stack pointer is setup to incorrect alignmentStack,芯片使用STM32F103ZET6...
查看>>
一句话说清分布式锁,进程锁,线程锁
查看>>
FastDFS使用
查看>>
服务器解析请求的基本原理
查看>>
[HDU3683 Gomoku]
查看>>
下一代操作系统与软件
查看>>
[NOIP2013提高组] CODEVS 3287 火车运输(MST+LCA)
查看>>
Python IO模型
查看>>
DataGridView的行的字体颜色变化
查看>>
局域网内手机访问电脑网站注意几点
查看>>
[Serializable]的应用--注册码的生成,加密和验证
查看>>
Android-多线程AsyncTask
查看>>
LeetCode【709. 转换成小写字母】
查看>>
CF992E Nastya and King-Shamans(线段树二分+思维)
查看>>
如果没有按照正常的先装iis后装.net的顺序,可以使用此命令重新注册一下:
查看>>
linux install ftp server
查看>>
alter database databasename set single_user with rollback IMMEDIATE 不成功问题
查看>>
WCF揭秘——使用AJAX+WCF服务进行页面开发
查看>>
【题解】青蛙的约会
查看>>