StorageLevel
describes how an RDD is persisted (and addresses the following concerns):
-
Does RDD use disk?
-
How much of RDD is in memory?
-
Does RDD use off-heap memory?
-
Should an RDD be serialized (while persisting)?
-
How many replicas (default:
1
) to use (can only be less than40
)?
There are the following StorageLevel
(number _2
in the name denotes 2 replicas):
-
NONE
(default) -
DISK_ONLY
-
DISK_ONLY_2
-
MEMORY_ONLY
(default forcache
operation for RDDs) -
MEMORY_ONLY_2
-
MEMORY_ONLY_SER
-
MEMORY_ONLY_SER_2
-
MEMORY_AND_DISK_2
-
MEMORY_AND_DISK_SER
-
MEMORY_AND_DISK_SER_2
-
OFF_HEAP
You can check out the storage level using getStorageLevel()
operation.
val lines = sc.textFile("README.md")
scala> lines.getStorageLevel
res0: org.apache.spark.storage.StorageLevel = StorageLevel(disk=false, memory=false, offheap=false, deserialized=false, replication=1)